Re: SparkSQL performance

Michael Armbrust Wed, 22 Apr 2015 14:21:47 -0700

https://github.com/databricks/spark-avro


On Tue, Apr 21, 2015 at 3:09 PM, Renato Marroquín Mogrovejo <
renatoj.marroq...@gmail.com> wrote:

> Thanks Michael!
> I have tried applying my schema programatically but I didn't get any
> improvement on performance :(
> Could you point me to some code examples using Avro please?
> Many thanks again!
>
>
> Renato M.
>
> 2015-04-21 20:45 GMT+02:00 Michael Armbrust <mich...@databricks.com>:
>
>> Here is an example using rows directly:
>>
>> https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#programmatically-specifying-the-schema
>>
>> Avro or parquet input would likely give you the best performance.
>>
>> On Tue, Apr 21, 2015 at 4:28 AM, Renato Marroquín Mogrovejo <
>> renatoj.marroq...@gmail.com> wrote:
>>
>>> Thanks for the hints guys! much appreciated!
>>> Even if I just do a something like:
>>>
>>> "Select * from tableX where attribute1 < 5"
>>>
>>> I see similar behaviour.
>>>
>>> @Michael
>>> Could you point me to any sample code that uses Spark's Rows? We are at
>>> a phase where we can actually change our JavaBeans for something that
>>> provides a better performance than what we are seeing now. Would you
>>> recommend using Avro presentation then?
>>> Thanks again!
>>>
>>>
>>> Renato M.
>>>
>>> 2015-04-21 1:18 GMT+02:00 Michael Armbrust <mich...@databricks.com>:
>>>
>>>> There is a cost to converting from JavaBeans to Rows and this code path
>>>> has not been optimized.  That is likely what you are seeing.
>>>>
>>>> On Mon, Apr 20, 2015 at 3:55 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>>
>>>>> SparkSQL optimizes better by column pruning and predicate pushdown,
>>>>> primarily. Here you are not taking advantage of either.
>>>>>
>>>>> I am curious to know what goes in your filter function, as you are not
>>>>> using a filter in SQL side.
>>>>>
>>>>> Best
>>>>> Ayan
>>>>> On 21 Apr 2015 08:05, "Renato Marroquín Mogrovejo" <
>>>>> renatoj.marroq...@gmail.com> wrote:
>>>>>
>>>>>> Does anybody have an idea? a clue? a hint?
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>> Renato M.
>>>>>>
>>>>>> 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo <
>>>>>> renatoj.marroq...@gmail.com>:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I have a simple query "Select * from tableX where attribute1 between
>>>>>>> 0 and 5" that I run over a Kryo file with four partitions that ends up
>>>>>>> being around 3.5 million rows in our case.
>>>>>>> If I run this query by doing a simple map().filter() it takes around
>>>>>>> ~9.6 seconds but when I apply schema, register the table into a 
>>>>>>> SqlContext,
>>>>>>> and then run the query, it takes around ~16 seconds. This is using Spark
>>>>>>> 1.2.1 with Scala 2.10.0
>>>>>>> I am wondering why there is such a big gap on performance if it is
>>>>>>> just a filter. Internally, the relation files are mapped to a JavaBean.
>>>>>>> This different data presentation (JavaBeans vs SparkSQL internal
>>>>>>> representation) could lead to such difference? Is there anything I 
>>>>>>> could do
>>>>>>> to make the performance get closer to the "hard-coded" option?
>>>>>>> Thanks in advance for any suggestions or ideas.
>>>>>>>
>>>>>>>
>>>>>>> Renato M.
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: SparkSQL performance

Reply via email to