why are you cache both rdd and table?
I try to cache all the data to avoid the bad performance for the first
query. Is it right?
Which stage of job is slow?
The query is run many times on one sqlContext and each query execution
takes 1 second.
2015-04-23 11:33 GMT+03:00 ayan guha guha.a...@gmail.com:
Quick questions: why are you cache both rdd and table?
Which stage of job is slow?
On 23 Apr 2015 17:12, Nikolay Tikhonov tikhonovnico...@gmail.com
wrote:
Hi,
I have Spark SQL performance issue. My code contains a simple JavaBean:
public class Person implements Externalizable {
private int id;
private String name;
private double salary;
}
Apply a schema to an RDD and register table.
JavaRDDPerson rdds = ...
rdds.cache();
DataFrame dataFrame = sqlContext.createDataFrame(rdds, Person.class);
dataFrame.registerTempTable(person);
sqlContext.cacheTable(person);
Run sql query.
sqlContext.sql(SELECT id, name, salary FROM person WHERE salary =
YYY
AND salary = XXX).collectAsList()
I launch standalone cluster which contains 4 workers. Each node runs on
machine with 8 CPU and 15 Gb memory. When I run the query on the
environment
over RDD which contains 1 million persons it takes 1 minute. Somebody can
tell me how to tuning the performance?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-performance-issue-tp22627.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org