https://github.com/databricks/spark-avro
On Tue, Apr 21, 2015 at 3:09 PM, Renato Marroquín Mogrovejo < renatoj.marroq...@gmail.com> wrote: > Thanks Michael! > I have tried applying my schema programatically but I didn't get any > improvement on performance :( > Could you point me to some code examples using Avro please? > Many thanks again! > > > Renato M. > > 2015-04-21 20:45 GMT+02:00 Michael Armbrust <mich...@databricks.com>: > >> Here is an example using rows directly: >> >> https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#programmatically-specifying-the-schema >> >> Avro or parquet input would likely give you the best performance. >> >> On Tue, Apr 21, 2015 at 4:28 AM, Renato Marroquín Mogrovejo < >> renatoj.marroq...@gmail.com> wrote: >> >>> Thanks for the hints guys! much appreciated! >>> Even if I just do a something like: >>> >>> "Select * from tableX where attribute1 < 5" >>> >>> I see similar behaviour. >>> >>> @Michael >>> Could you point me to any sample code that uses Spark's Rows? We are at >>> a phase where we can actually change our JavaBeans for something that >>> provides a better performance than what we are seeing now. Would you >>> recommend using Avro presentation then? >>> Thanks again! >>> >>> >>> Renato M. >>> >>> 2015-04-21 1:18 GMT+02:00 Michael Armbrust <mich...@databricks.com>: >>> >>>> There is a cost to converting from JavaBeans to Rows and this code path >>>> has not been optimized. That is likely what you are seeing. >>>> >>>> On Mon, Apr 20, 2015 at 3:55 PM, ayan guha <guha.a...@gmail.com> wrote: >>>> >>>>> SparkSQL optimizes better by column pruning and predicate pushdown, >>>>> primarily. Here you are not taking advantage of either. >>>>> >>>>> I am curious to know what goes in your filter function, as you are not >>>>> using a filter in SQL side. >>>>> >>>>> Best >>>>> Ayan >>>>> On 21 Apr 2015 08:05, "Renato Marroquín Mogrovejo" < >>>>> renatoj.marroq...@gmail.com> wrote: >>>>> >>>>>> Does anybody have an idea? a clue? a hint? >>>>>> Thanks! >>>>>> >>>>>> >>>>>> Renato M. >>>>>> >>>>>> 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo < >>>>>> renatoj.marroq...@gmail.com>: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I have a simple query "Select * from tableX where attribute1 between >>>>>>> 0 and 5" that I run over a Kryo file with four partitions that ends up >>>>>>> being around 3.5 million rows in our case. >>>>>>> If I run this query by doing a simple map().filter() it takes around >>>>>>> ~9.6 seconds but when I apply schema, register the table into a >>>>>>> SqlContext, >>>>>>> and then run the query, it takes around ~16 seconds. This is using Spark >>>>>>> 1.2.1 with Scala 2.10.0 >>>>>>> I am wondering why there is such a big gap on performance if it is >>>>>>> just a filter. Internally, the relation files are mapped to a JavaBean. >>>>>>> This different data presentation (JavaBeans vs SparkSQL internal >>>>>>> representation) could lead to such difference? Is there anything I >>>>>>> could do >>>>>>> to make the performance get closer to the "hard-coded" option? >>>>>>> Thanks in advance for any suggestions or ideas. >>>>>>> >>>>>>> >>>>>>> Renato M. >>>>>>> >>>>>> >>>>>> >>>> >>> >> >