Thanks Michael! I have tried applying my schema programatically but I didn't get any improvement on performance :( Could you point me to some code examples using Avro please? Many thanks again!
Renato M. 2015-04-21 20:45 GMT+02:00 Michael Armbrust <mich...@databricks.com>: > Here is an example using rows directly: > > https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#programmatically-specifying-the-schema > > Avro or parquet input would likely give you the best performance. > > On Tue, Apr 21, 2015 at 4:28 AM, Renato Marroquín Mogrovejo < > renatoj.marroq...@gmail.com> wrote: > >> Thanks for the hints guys! much appreciated! >> Even if I just do a something like: >> >> "Select * from tableX where attribute1 < 5" >> >> I see similar behaviour. >> >> @Michael >> Could you point me to any sample code that uses Spark's Rows? We are at a >> phase where we can actually change our JavaBeans for something that >> provides a better performance than what we are seeing now. Would you >> recommend using Avro presentation then? >> Thanks again! >> >> >> Renato M. >> >> 2015-04-21 1:18 GMT+02:00 Michael Armbrust <mich...@databricks.com>: >> >>> There is a cost to converting from JavaBeans to Rows and this code path >>> has not been optimized. That is likely what you are seeing. >>> >>> On Mon, Apr 20, 2015 at 3:55 PM, ayan guha <guha.a...@gmail.com> wrote: >>> >>>> SparkSQL optimizes better by column pruning and predicate pushdown, >>>> primarily. Here you are not taking advantage of either. >>>> >>>> I am curious to know what goes in your filter function, as you are not >>>> using a filter in SQL side. >>>> >>>> Best >>>> Ayan >>>> On 21 Apr 2015 08:05, "Renato Marroquín Mogrovejo" < >>>> renatoj.marroq...@gmail.com> wrote: >>>> >>>>> Does anybody have an idea? a clue? a hint? >>>>> Thanks! >>>>> >>>>> >>>>> Renato M. >>>>> >>>>> 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo < >>>>> renatoj.marroq...@gmail.com>: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I have a simple query "Select * from tableX where attribute1 between >>>>>> 0 and 5" that I run over a Kryo file with four partitions that ends up >>>>>> being around 3.5 million rows in our case. >>>>>> If I run this query by doing a simple map().filter() it takes around >>>>>> ~9.6 seconds but when I apply schema, register the table into a >>>>>> SqlContext, >>>>>> and then run the query, it takes around ~16 seconds. This is using Spark >>>>>> 1.2.1 with Scala 2.10.0 >>>>>> I am wondering why there is such a big gap on performance if it is >>>>>> just a filter. Internally, the relation files are mapped to a JavaBean. >>>>>> This different data presentation (JavaBeans vs SparkSQL internal >>>>>> representation) could lead to such difference? Is there anything I could >>>>>> do >>>>>> to make the performance get closer to the "hard-coded" option? >>>>>> Thanks in advance for any suggestions or ideas. >>>>>> >>>>>> >>>>>> Renato M. >>>>>> >>>>> >>>>> >>> >> >