Here is an example using rows directly: https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#programmatically-specifying-the-schema
Avro or parquet input would likely give you the best performance. On Tue, Apr 21, 2015 at 4:28 AM, Renato Marroquín Mogrovejo < renatoj.marroq...@gmail.com> wrote: > Thanks for the hints guys! much appreciated! > Even if I just do a something like: > > "Select * from tableX where attribute1 < 5" > > I see similar behaviour. > > @Michael > Could you point me to any sample code that uses Spark's Rows? We are at a > phase where we can actually change our JavaBeans for something that > provides a better performance than what we are seeing now. Would you > recommend using Avro presentation then? > Thanks again! > > > Renato M. > > 2015-04-21 1:18 GMT+02:00 Michael Armbrust <mich...@databricks.com>: > >> There is a cost to converting from JavaBeans to Rows and this code path >> has not been optimized. That is likely what you are seeing. >> >> On Mon, Apr 20, 2015 at 3:55 PM, ayan guha <guha.a...@gmail.com> wrote: >> >>> SparkSQL optimizes better by column pruning and predicate pushdown, >>> primarily. Here you are not taking advantage of either. >>> >>> I am curious to know what goes in your filter function, as you are not >>> using a filter in SQL side. >>> >>> Best >>> Ayan >>> On 21 Apr 2015 08:05, "Renato Marroquín Mogrovejo" < >>> renatoj.marroq...@gmail.com> wrote: >>> >>>> Does anybody have an idea? a clue? a hint? >>>> Thanks! >>>> >>>> >>>> Renato M. >>>> >>>> 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo < >>>> renatoj.marroq...@gmail.com>: >>>> >>>>> Hi all, >>>>> >>>>> I have a simple query "Select * from tableX where attribute1 between 0 >>>>> and 5" that I run over a Kryo file with four partitions that ends up being >>>>> around 3.5 million rows in our case. >>>>> If I run this query by doing a simple map().filter() it takes around >>>>> ~9.6 seconds but when I apply schema, register the table into a >>>>> SqlContext, >>>>> and then run the query, it takes around ~16 seconds. This is using Spark >>>>> 1.2.1 with Scala 2.10.0 >>>>> I am wondering why there is such a big gap on performance if it is >>>>> just a filter. Internally, the relation files are mapped to a JavaBean. >>>>> This different data presentation (JavaBeans vs SparkSQL internal >>>>> representation) could lead to such difference? Is there anything I could >>>>> do >>>>> to make the performance get closer to the "hard-coded" option? >>>>> Thanks in advance for any suggestions or ideas. >>>>> >>>>> >>>>> Renato M. >>>>> >>>> >>>> >> >