Re: SparkSQL performance

Michael Armbrust Mon, 20 Apr 2015 16:20:34 -0700

There is a cost to converting from JavaBeans to Rows and this code path has
not been optimized.  That is likely what you are seeing.


On Mon, Apr 20, 2015 at 3:55 PM, ayan guha <guha.a...@gmail.com> wrote:

> SparkSQL optimizes better by column pruning and predicate pushdown,
> primarily. Here you are not taking advantage of either.
>
> I am curious to know what goes in your filter function, as you are not
> using a filter in SQL side.
>
> Best
> Ayan
> On 21 Apr 2015 08:05, "Renato Marroquín Mogrovejo" <
> renatoj.marroq...@gmail.com> wrote:
>
>> Does anybody have an idea? a clue? a hint?
>> Thanks!
>>
>>
>> Renato M.
>>
>> 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo <
>> renatoj.marroq...@gmail.com>:
>>
>>> Hi all,
>>>
>>> I have a simple query "Select * from tableX where attribute1 between 0
>>> and 5" that I run over a Kryo file with four partitions that ends up being
>>> around 3.5 million rows in our case.
>>> If I run this query by doing a simple map().filter() it takes around
>>> ~9.6 seconds but when I apply schema, register the table into a SqlContext,
>>> and then run the query, it takes around ~16 seconds. This is using Spark
>>> 1.2.1 with Scala 2.10.0
>>> I am wondering why there is such a big gap on performance if it is just
>>> a filter. Internally, the relation files are mapped to a JavaBean. This
>>> different data presentation (JavaBeans vs SparkSQL internal representation)
>>> could lead to such difference? Is there anything I could do to make the
>>> performance get closer to the "hard-coded" option?
>>> Thanks in advance for any suggestions or ideas.
>>>
>>>
>>> Renato M.
>>>
>>
>>

Re: SparkSQL performance

Reply via email to