SparkSQL optimizes better by column pruning and predicate pushdown, primarily. Here you are not taking advantage of either.
I am curious to know what goes in your filter function, as you are not using a filter in SQL side. Best Ayan On 21 Apr 2015 08:05, "Renato Marroquín Mogrovejo" < renatoj.marroq...@gmail.com> wrote: > Does anybody have an idea? a clue? a hint? > Thanks! > > > Renato M. > > 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo < > renatoj.marroq...@gmail.com>: > >> Hi all, >> >> I have a simple query "Select * from tableX where attribute1 between 0 >> and 5" that I run over a Kryo file with four partitions that ends up being >> around 3.5 million rows in our case. >> If I run this query by doing a simple map().filter() it takes around ~9.6 >> seconds but when I apply schema, register the table into a SqlContext, and >> then run the query, it takes around ~16 seconds. This is using Spark 1.2.1 >> with Scala 2.10.0 >> I am wondering why there is such a big gap on performance if it is just a >> filter. Internally, the relation files are mapped to a JavaBean. This >> different data presentation (JavaBeans vs SparkSQL internal representation) >> could lead to such difference? Is there anything I could do to make the >> performance get closer to the "hard-coded" option? >> Thanks in advance for any suggestions or ideas. >> >> >> Renato M. >> > >