Thanks for the hints guys! much appreciated! Even if I just do a something like:
"Select * from tableX where attribute1 < 5" I see similar behaviour. @Michael Could you point me to any sample code that uses Spark's Rows? We are at a phase where we can actually change our JavaBeans for something that provides a better performance than what we are seeing now. Would you recommend using Avro presentation then? Thanks again! Renato M. 2015-04-21 1:18 GMT+02:00 Michael Armbrust <mich...@databricks.com>: > There is a cost to converting from JavaBeans to Rows and this code path > has not been optimized. That is likely what you are seeing. > > On Mon, Apr 20, 2015 at 3:55 PM, ayan guha <guha.a...@gmail.com> wrote: > >> SparkSQL optimizes better by column pruning and predicate pushdown, >> primarily. Here you are not taking advantage of either. >> >> I am curious to know what goes in your filter function, as you are not >> using a filter in SQL side. >> >> Best >> Ayan >> On 21 Apr 2015 08:05, "Renato Marroquín Mogrovejo" < >> renatoj.marroq...@gmail.com> wrote: >> >>> Does anybody have an idea? a clue? a hint? >>> Thanks! >>> >>> >>> Renato M. >>> >>> 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo < >>> renatoj.marroq...@gmail.com>: >>> >>>> Hi all, >>>> >>>> I have a simple query "Select * from tableX where attribute1 between 0 >>>> and 5" that I run over a Kryo file with four partitions that ends up being >>>> around 3.5 million rows in our case. >>>> If I run this query by doing a simple map().filter() it takes around >>>> ~9.6 seconds but when I apply schema, register the table into a SqlContext, >>>> and then run the query, it takes around ~16 seconds. This is using Spark >>>> 1.2.1 with Scala 2.10.0 >>>> I am wondering why there is such a big gap on performance if it is just >>>> a filter. Internally, the relation files are mapped to a JavaBean. This >>>> different data presentation (JavaBeans vs SparkSQL internal representation) >>>> could lead to such difference? Is there anything I could do to make the >>>> performance get closer to the "hard-coded" option? >>>> Thanks in advance for any suggestions or ideas. >>>> >>>> >>>> Renato M. >>>> >>> >>> >