Re: SparkSQL performance

Renato Marroquín Mogrovejo Tue, 21 Apr 2015 04:38:29 -0700

Thanks for the hints guys! much appreciated!
Even if I just do a something like:


"Select * from tableX where attribute1 < 5"

I see similar behaviour.

@Michael
Could you point me to any sample code that uses Spark's Rows? We are at a
phase where we can actually change our JavaBeans for something that
provides a better performance than what we are seeing now. Would you
recommend using Avro presentation then?
Thanks again!


Renato M.

2015-04-21 1:18 GMT+02:00 Michael Armbrust <mich...@databricks.com>:

> There is a cost to converting from JavaBeans to Rows and this code path
> has not been optimized.  That is likely what you are seeing.
>
> On Mon, Apr 20, 2015 at 3:55 PM, ayan guha <guha.a...@gmail.com> wrote:
>
>> SparkSQL optimizes better by column pruning and predicate pushdown,
>> primarily. Here you are not taking advantage of either.
>>
>> I am curious to know what goes in your filter function, as you are not
>> using a filter in SQL side.
>>
>> Best
>> Ayan
>> On 21 Apr 2015 08:05, "Renato Marroquín Mogrovejo" <
>> renatoj.marroq...@gmail.com> wrote:
>>
>>> Does anybody have an idea? a clue? a hint?
>>> Thanks!
>>>
>>>
>>> Renato M.
>>>
>>> 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo <
>>> renatoj.marroq...@gmail.com>:
>>>
>>>> Hi all,
>>>>
>>>> I have a simple query "Select * from tableX where attribute1 between 0
>>>> and 5" that I run over a Kryo file with four partitions that ends up being
>>>> around 3.5 million rows in our case.
>>>> If I run this query by doing a simple map().filter() it takes around
>>>> ~9.6 seconds but when I apply schema, register the table into a SqlContext,
>>>> and then run the query, it takes around ~16 seconds. This is using Spark
>>>> 1.2.1 with Scala 2.10.0
>>>> I am wondering why there is such a big gap on performance if it is just
>>>> a filter. Internally, the relation files are mapped to a JavaBean. This
>>>> different data presentation (JavaBeans vs SparkSQL internal representation)
>>>> could lead to such difference? Is there anything I could do to make the
>>>> performance get closer to the "hard-coded" option?
>>>> Thanks in advance for any suggestions or ideas.
>>>>
>>>>
>>>> Renato M.
>>>>
>>>
>>>
>

Re: SparkSQL performance

Reply via email to