How Spark SQL supports primary and secondary indexes

2015-04-29 Thread Nikolay Tikhonov
Hi all,

I execute simple SQL query and got a unacceptable performance.

I do the following steps:

1. Apply a schema to an RDD and register table.


2. Run sql query which returns several entries:


Running time for this query 0.2s (table contains 10 entries). I think
that Spark SQL has full in-memory scan and index might increase performance.
How can I add indexes?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-Spark-SQL-supports-primary-and-secondary-indexes-tp22700.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How Spark SQL supports primary and secondary indexes

2015-04-29 Thread Nikolay Tikhonov
I'm running this query with different parameter on the same RDD and got 0.2s
for each query. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-Spark-SQL-supports-primary-and-secondary-indexes-tp22700p22706.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL performance issue.

2015-04-23 Thread Nikolay Tikhonov
 why are you cache both rdd and table?
I try to cache all the data to avoid the bad performance for the first
query. Is it right?

 Which stage of job is slow?
The query is run many times on one sqlContext and each query execution
takes 1 second.

2015-04-23 11:33 GMT+03:00 ayan guha guha.a...@gmail.com:

 Quick questions: why are you cache both rdd and table?
 Which stage of job is slow?
 On 23 Apr 2015 17:12, Nikolay Tikhonov tikhonovnico...@gmail.com
 wrote:

 Hi,
 I have Spark SQL performance issue. My code contains a simple JavaBean:

 public class Person implements Externalizable {
 private int id;
 private String name;
 private double salary;
 
 }


 Apply a schema to an RDD and register table.

 JavaRDDPerson rdds = ...
 rdds.cache();

 DataFrame dataFrame = sqlContext.createDataFrame(rdds, Person.class);
 dataFrame.registerTempTable(person);

 sqlContext.cacheTable(person);


 Run sql query.

 sqlContext.sql(SELECT id, name, salary FROM person WHERE salary =
 YYY
 AND salary = XXX).collectAsList()


 I launch standalone cluster which contains 4 workers. Each node runs on
 machine with 8 CPU and 15 Gb memory. When I run the query on the
 environment
 over RDD which contains 1 million persons it takes 1 minute. Somebody can
 tell me how to tuning the performance?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-performance-issue-tp22627.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org