There is a cost to converting from JavaBeans to Rows and this code path has not been optimized. That is likely what you are seeing.
On Mon, Apr 20, 2015 at 3:55 PM, ayan guha <guha.a...@gmail.com> wrote: > SparkSQL optimizes better by column pruning and predicate pushdown, > primarily. Here you are not taking advantage of either. > > I am curious to know what goes in your filter function, as you are not > using a filter in SQL side. > > Best > Ayan > On 21 Apr 2015 08:05, "Renato Marroquín Mogrovejo" < > renatoj.marroq...@gmail.com> wrote: > >> Does anybody have an idea? a clue? a hint? >> Thanks! >> >> >> Renato M. >> >> 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo < >> renatoj.marroq...@gmail.com>: >> >>> Hi all, >>> >>> I have a simple query "Select * from tableX where attribute1 between 0 >>> and 5" that I run over a Kryo file with four partitions that ends up being >>> around 3.5 million rows in our case. >>> If I run this query by doing a simple map().filter() it takes around >>> ~9.6 seconds but when I apply schema, register the table into a SqlContext, >>> and then run the query, it takes around ~16 seconds. This is using Spark >>> 1.2.1 with Scala 2.10.0 >>> I am wondering why there is such a big gap on performance if it is just >>> a filter. Internally, the relation files are mapped to a JavaBean. This >>> different data presentation (JavaBeans vs SparkSQL internal representation) >>> could lead to such difference? Is there anything I could do to make the >>> performance get closer to the "hard-coded" option? >>> Thanks in advance for any suggestions or ideas. >>> >>> >>> Renato M. >>> >> >>