Re: Spark SQL performance issue.

Arush Kharbanda Thu, 23 Apr 2015 02:40:37 -0700

Hi

Can you share your Web UI, depicting your task level breakup.I can see many
thing
s that can be improved.


1. JavaRDD<Person> rdds = ...rdds.cache(); ->this caching is not needed as
you are not reading the rdd  for any action

2.Instead of collecting as list, if you can save as text file, it would be
better. As it would avoid moving results to the driver.

Thanks
Arush

On Thu, Apr 23, 2015 at 2:47 PM, Nikolay Tikhonov <tikhonovnico...@gmail.com
> wrote:

> > why are you cache both rdd and table?
> I try to cache all the data to avoid the bad performance for the first
> query. Is it right?
>
> > Which stage of job is slow?
> The query is run many times on one sqlContext and each query execution
> takes 1 second.
>
> 2015-04-23 11:33 GMT+03:00 ayan guha <guha.a...@gmail.com>:
>
>> Quick questions: why are you cache both rdd and table?
>> Which stage of job is slow?
>> On 23 Apr 2015 17:12, "Nikolay Tikhonov" <tikhonovnico...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> I have Spark SQL performance issue. My code contains a simple JavaBean:
>>>
>>>     public class Person implements Externalizable {
>>>         private int id;
>>>         private String name;
>>>         private double salary;
>>>         ....................
>>>     }
>>>
>>>
>>> Apply a schema to an RDD and register table.
>>>
>>>     JavaRDD<Person> rdds = ...
>>>     rdds.cache();
>>>
>>>     DataFrame dataFrame = sqlContext.createDataFrame(rdds, Person.class);
>>>     dataFrame.registerTempTable("person");
>>>
>>>     sqlContext.cacheTable("person");
>>>
>>>
>>> Run sql query.
>>>
>>>     sqlContext.sql("SELECT id, name, salary FROM person WHERE salary >=
>>> YYY
>>> AND salary <= XXX").collectAsList()
>>>
>>>
>>> I launch standalone cluster which contains 4 workers. Each node runs on
>>> machine with 8 CPU and 15 Gb memory. When I run the query on the
>>> environment
>>> over RDD which contains 1 million persons it takes 1 minute. Somebody can
>>> tell me how to tuning the performance?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-performance-issue-tp22627.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Spark SQL performance issue.

Reply via email to