Re: SparkSQL performance

2015-04-22 Thread Michael Armbrust
https://github.com/databricks/spark-avro

On Tue, Apr 21, 2015 at 3:09 PM, Renato Marroquín Mogrovejo 
renatoj.marroq...@gmail.com wrote:

 Thanks Michael!
 I have tried applying my schema programatically but I didn't get any
 improvement on performance :(
 Could you point me to some code examples using Avro please?
 Many thanks again!


 Renato M.

 2015-04-21 20:45 GMT+02:00 Michael Armbrust mich...@databricks.com:

 Here is an example using rows directly:

 https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#programmatically-specifying-the-schema

 Avro or parquet input would likely give you the best performance.

 On Tue, Apr 21, 2015 at 4:28 AM, Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com wrote:

 Thanks for the hints guys! much appreciated!
 Even if I just do a something like:

 Select * from tableX where attribute1  5

 I see similar behaviour.

 @Michael
 Could you point me to any sample code that uses Spark's Rows? We are at
 a phase where we can actually change our JavaBeans for something that
 provides a better performance than what we are seeing now. Would you
 recommend using Avro presentation then?
 Thanks again!


 Renato M.

 2015-04-21 1:18 GMT+02:00 Michael Armbrust mich...@databricks.com:

 There is a cost to converting from JavaBeans to Rows and this code path
 has not been optimized.  That is likely what you are seeing.

 On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote:

 SparkSQL optimizes better by column pruning and predicate pushdown,
 primarily. Here you are not taking advantage of either.

 I am curious to know what goes in your filter function, as you are not
 using a filter in SQL side.

 Best
 Ayan
 On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com wrote:

 Does anybody have an idea? a clue? a hint?
 Thanks!


 Renato M.

 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com:

 Hi all,

 I have a simple query Select * from tableX where attribute1 between
 0 and 5 that I run over a Kryo file with four partitions that ends up
 being around 3.5 million rows in our case.
 If I run this query by doing a simple map().filter() it takes around
 ~9.6 seconds but when I apply schema, register the table into a 
 SqlContext,
 and then run the query, it takes around ~16 seconds. This is using Spark
 1.2.1 with Scala 2.10.0
 I am wondering why there is such a big gap on performance if it is
 just a filter. Internally, the relation files are mapped to a JavaBean.
 This different data presentation (JavaBeans vs SparkSQL internal
 representation) could lead to such difference? Is there anything I 
 could do
 to make the performance get closer to the hard-coded option?
 Thanks in advance for any suggestions or ideas.


 Renato M.









Re: SparkSQL performance

2015-04-21 Thread Michael Armbrust
Here is an example using rows directly:
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#programmatically-specifying-the-schema

Avro or parquet input would likely give you the best performance.

On Tue, Apr 21, 2015 at 4:28 AM, Renato Marroquín Mogrovejo 
renatoj.marroq...@gmail.com wrote:

 Thanks for the hints guys! much appreciated!
 Even if I just do a something like:

 Select * from tableX where attribute1  5

 I see similar behaviour.

 @Michael
 Could you point me to any sample code that uses Spark's Rows? We are at a
 phase where we can actually change our JavaBeans for something that
 provides a better performance than what we are seeing now. Would you
 recommend using Avro presentation then?
 Thanks again!


 Renato M.

 2015-04-21 1:18 GMT+02:00 Michael Armbrust mich...@databricks.com:

 There is a cost to converting from JavaBeans to Rows and this code path
 has not been optimized.  That is likely what you are seeing.

 On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote:

 SparkSQL optimizes better by column pruning and predicate pushdown,
 primarily. Here you are not taking advantage of either.

 I am curious to know what goes in your filter function, as you are not
 using a filter in SQL side.

 Best
 Ayan
 On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com wrote:

 Does anybody have an idea? a clue? a hint?
 Thanks!


 Renato M.

 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com:

 Hi all,

 I have a simple query Select * from tableX where attribute1 between 0
 and 5 that I run over a Kryo file with four partitions that ends up being
 around 3.5 million rows in our case.
 If I run this query by doing a simple map().filter() it takes around
 ~9.6 seconds but when I apply schema, register the table into a 
 SqlContext,
 and then run the query, it takes around ~16 seconds. This is using Spark
 1.2.1 with Scala 2.10.0
 I am wondering why there is such a big gap on performance if it is
 just a filter. Internally, the relation files are mapped to a JavaBean.
 This different data presentation (JavaBeans vs SparkSQL internal
 representation) could lead to such difference? Is there anything I could 
 do
 to make the performance get closer to the hard-coded option?
 Thanks in advance for any suggestions or ideas.


 Renato M.







Re: SparkSQL performance

2015-04-21 Thread Renato Marroquín Mogrovejo
Thanks Michael!
I have tried applying my schema programatically but I didn't get any
improvement on performance :(
Could you point me to some code examples using Avro please?
Many thanks again!


Renato M.

2015-04-21 20:45 GMT+02:00 Michael Armbrust mich...@databricks.com:

 Here is an example using rows directly:

 https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#programmatically-specifying-the-schema

 Avro or parquet input would likely give you the best performance.

 On Tue, Apr 21, 2015 at 4:28 AM, Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com wrote:

 Thanks for the hints guys! much appreciated!
 Even if I just do a something like:

 Select * from tableX where attribute1  5

 I see similar behaviour.

 @Michael
 Could you point me to any sample code that uses Spark's Rows? We are at a
 phase where we can actually change our JavaBeans for something that
 provides a better performance than what we are seeing now. Would you
 recommend using Avro presentation then?
 Thanks again!


 Renato M.

 2015-04-21 1:18 GMT+02:00 Michael Armbrust mich...@databricks.com:

 There is a cost to converting from JavaBeans to Rows and this code path
 has not been optimized.  That is likely what you are seeing.

 On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote:

 SparkSQL optimizes better by column pruning and predicate pushdown,
 primarily. Here you are not taking advantage of either.

 I am curious to know what goes in your filter function, as you are not
 using a filter in SQL side.

 Best
 Ayan
 On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com wrote:

 Does anybody have an idea? a clue? a hint?
 Thanks!


 Renato M.

 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com:

 Hi all,

 I have a simple query Select * from tableX where attribute1 between
 0 and 5 that I run over a Kryo file with four partitions that ends up
 being around 3.5 million rows in our case.
 If I run this query by doing a simple map().filter() it takes around
 ~9.6 seconds but when I apply schema, register the table into a 
 SqlContext,
 and then run the query, it takes around ~16 seconds. This is using Spark
 1.2.1 with Scala 2.10.0
 I am wondering why there is such a big gap on performance if it is
 just a filter. Internally, the relation files are mapped to a JavaBean.
 This different data presentation (JavaBeans vs SparkSQL internal
 representation) could lead to such difference? Is there anything I could 
 do
 to make the performance get closer to the hard-coded option?
 Thanks in advance for any suggestions or ideas.


 Renato M.








Re: SparkSQL performance

2015-04-21 Thread Renato Marroquín Mogrovejo
Thanks for the hints guys! much appreciated!
Even if I just do a something like:

Select * from tableX where attribute1  5

I see similar behaviour.

@Michael
Could you point me to any sample code that uses Spark's Rows? We are at a
phase where we can actually change our JavaBeans for something that
provides a better performance than what we are seeing now. Would you
recommend using Avro presentation then?
Thanks again!


Renato M.

2015-04-21 1:18 GMT+02:00 Michael Armbrust mich...@databricks.com:

 There is a cost to converting from JavaBeans to Rows and this code path
 has not been optimized.  That is likely what you are seeing.

 On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote:

 SparkSQL optimizes better by column pruning and predicate pushdown,
 primarily. Here you are not taking advantage of either.

 I am curious to know what goes in your filter function, as you are not
 using a filter in SQL side.

 Best
 Ayan
 On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com wrote:

 Does anybody have an idea? a clue? a hint?
 Thanks!


 Renato M.

 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com:

 Hi all,

 I have a simple query Select * from tableX where attribute1 between 0
 and 5 that I run over a Kryo file with four partitions that ends up being
 around 3.5 million rows in our case.
 If I run this query by doing a simple map().filter() it takes around
 ~9.6 seconds but when I apply schema, register the table into a SqlContext,
 and then run the query, it takes around ~16 seconds. This is using Spark
 1.2.1 with Scala 2.10.0
 I am wondering why there is such a big gap on performance if it is just
 a filter. Internally, the relation files are mapped to a JavaBean. This
 different data presentation (JavaBeans vs SparkSQL internal representation)
 could lead to such difference? Is there anything I could do to make the
 performance get closer to the hard-coded option?
 Thanks in advance for any suggestions or ideas.


 Renato M.






Re: SparkSQL performance

2015-04-20 Thread Renato Marroquín Mogrovejo
Does anybody have an idea? a clue? a hint?
Thanks!


Renato M.

2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo 
renatoj.marroq...@gmail.com:

 Hi all,

 I have a simple query Select * from tableX where attribute1 between 0 and
 5 that I run over a Kryo file with four partitions that ends up being
 around 3.5 million rows in our case.
 If I run this query by doing a simple map().filter() it takes around ~9.6
 seconds but when I apply schema, register the table into a SqlContext, and
 then run the query, it takes around ~16 seconds. This is using Spark 1.2.1
 with Scala 2.10.0
 I am wondering why there is such a big gap on performance if it is just a
 filter. Internally, the relation files are mapped to a JavaBean. This
 different data presentation (JavaBeans vs SparkSQL internal representation)
 could lead to such difference? Is there anything I could do to make the
 performance get closer to the hard-coded option?
 Thanks in advance for any suggestions or ideas.


 Renato M.



Re: SparkSQL performance

2015-04-20 Thread ayan guha
SparkSQL optimizes better by column pruning and predicate pushdown,
primarily. Here you are not taking advantage of either.

I am curious to know what goes in your filter function, as you are not
using a filter in SQL side.

Best
Ayan
On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo 
renatoj.marroq...@gmail.com wrote:

 Does anybody have an idea? a clue? a hint?
 Thanks!


 Renato M.

 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com:

 Hi all,

 I have a simple query Select * from tableX where attribute1 between 0
 and 5 that I run over a Kryo file with four partitions that ends up being
 around 3.5 million rows in our case.
 If I run this query by doing a simple map().filter() it takes around ~9.6
 seconds but when I apply schema, register the table into a SqlContext, and
 then run the query, it takes around ~16 seconds. This is using Spark 1.2.1
 with Scala 2.10.0
 I am wondering why there is such a big gap on performance if it is just a
 filter. Internally, the relation files are mapped to a JavaBean. This
 different data presentation (JavaBeans vs SparkSQL internal representation)
 could lead to such difference? Is there anything I could do to make the
 performance get closer to the hard-coded option?
 Thanks in advance for any suggestions or ideas.


 Renato M.





Re: SparkSQL performance

2015-04-20 Thread Michael Armbrust
There is a cost to converting from JavaBeans to Rows and this code path has
not been optimized.  That is likely what you are seeing.

On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote:

 SparkSQL optimizes better by column pruning and predicate pushdown,
 primarily. Here you are not taking advantage of either.

 I am curious to know what goes in your filter function, as you are not
 using a filter in SQL side.

 Best
 Ayan
 On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com wrote:

 Does anybody have an idea? a clue? a hint?
 Thanks!


 Renato M.

 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com:

 Hi all,

 I have a simple query Select * from tableX where attribute1 between 0
 and 5 that I run over a Kryo file with four partitions that ends up being
 around 3.5 million rows in our case.
 If I run this query by doing a simple map().filter() it takes around
 ~9.6 seconds but when I apply schema, register the table into a SqlContext,
 and then run the query, it takes around ~16 seconds. This is using Spark
 1.2.1 with Scala 2.10.0
 I am wondering why there is such a big gap on performance if it is just
 a filter. Internally, the relation files are mapped to a JavaBean. This
 different data presentation (JavaBeans vs SparkSQL internal representation)
 could lead to such difference? Is there anything I could do to make the
 performance get closer to the hard-coded option?
 Thanks in advance for any suggestions or ideas.


 Renato M.





Re: SparkSQL Performance Tuning Options

2015-01-27 Thread Cheng Lian


On 1/27/15 5:55 PM, Cheng Lian wrote:


On 1/27/15 11:38 AM, Manoj Samel wrote:

Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db.

Use case is Spark Yarn app will start and serve as query server for 
multiple users i.e. always up and running. At startup, there is 
option to cache data and also pre-compute some results sets, hash 
maps etc. that would be likely be asked by client APIs. I.e there is 
some option to use startup time to precompute/cache - but query 
response time requirement on large data set is very stringent


Hoping to use SparkSQL (but a combination of SQL and RDD APIs is also 
OK).


* Does SparkSQL execution uses underlying partition information ? 
(Data is from HDFS)
No. For example, if the underlying data has already been partitioned 
by some key, Spark SQL doesn't know it, and can't leverage that 
information to avoid shuffle when doing aggregation on that key. 
However, partitioning the data ahead of time does help minimizing 
shuffle network IO. There's a JIRA ticket to enable Spark SQL aware of 
underlying data distribution.


Maybe you are asking about locality? If that's the case, just want to 
add that Spark SQL does understand locality information of the 
underlying data. It's obtained from Hadoop InputFormat.


* Are there any ways to give hints to the SparkSQL execution about 
any precomputed/pre-cached RDDs?
Instead of caching raw RDD, it's recommended to transform raw RDD to 
SchemaRDD and then cache it, so that in-memory columnar storage can be 
used. Also Spark SQL recognizes cached SchemaRDDs automatically.
* Packages spark.sql.execution, spark.sql.execution.joins and other 
sql.xxx packages - would using these for tuning query plan is 
recommended? Would like to keep this as-needed if possible
Not sure whether I understood this question. Are you trying to use 
internal APIs to do customized optimizations?
* Features not in current release but scheduled for upcoming release 
would also be good to know.


Thanks,

PS: This is not a small topic so if someone prefers to start a 
offline thread on details, I can do that and summarize the 
conclusions back to this thread.








-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL performance

2014-11-03 Thread Marius Soutier
I did some simple experiments with Impala and Spark, and Impala came out ahead. 
But it’s also less flexible, couldn’t handle irregular schemas, didn't support 
Json, and so on.

On 01.11.2014, at 02:20, Soumya Simanta soumya.sima...@gmail.com wrote:

 I agree. My personal experience with Spark core is that it performs really 
 well once you tune it properly. 
 
 As far I understand SparkSQL under the hood performs many of these 
 optimizations (order of Spark operations) and uses a more efficient storage 
 format. Is this assumption correct? 
 
 Has anyone done any comparison of SparkSQL with Impala ? The fact that many 
 of the queries don't even finish in the benchmark is quite surprising and 
 hard to believe. 
 
 A few months ago there were a few emails about Spark not being able to handle 
 large volumes (TBs) of data. That myth was busted recently when the folks at 
 Databricks published their sorting record results. 
  
 
 Thanks
 -Soumya
 
 
 
 
  
 
 On Fri, Oct 31, 2014 at 7:35 PM, Du Li l...@yahoo-inc.com wrote:
 We have seen all kinds of results published that often contradict each other. 
 My take is that the authors often know more tricks about how to tune their 
 own/familiar products than the others. So the product on focus is tuned for 
 ideal performance while the competitors are not. The authors are not 
 necessarily biased but as a consequence the results are.
 
 Ideally it’s critical for the user community to be informed of all the 
 in-depth tuning tricks of all products. However, realistically, there is a 
 big gap in terms of documentation. Hope the Spark folks will make a 
 difference. :-)
 
 Du
 
 
 From: Soumya Simanta soumya.sima...@gmail.com
 Date: Friday, October 31, 2014 at 4:04 PM
 To: user@spark.apache.org user@spark.apache.org
 Subject: SparkSQL performance
 
 I was really surprised to see the results here, esp. SparkSQL not completing
 http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
 
 I was under the impression that SparkSQL performs really well because it can 
 optimize the RDD operations and load only the columns that are required. This 
 essentially means in most cases SparkSQL should be as fast as Spark is. 
 
 I would be very interested to hear what others in the group have to say about 
 this. 
 
 Thanks
 -Soumya
 
 
 



Re: SparkSQL performance

2014-10-31 Thread Du Li
We have seen all kinds of results published that often contradict each other. 
My take is that the authors often know more tricks about how to tune their 
own/familiar products than the others. So the product on focus is tuned for 
ideal performance while the competitors are not. The authors are not 
necessarily biased but as a consequence the results are.

Ideally it’s critical for the user community to be informed of all the in-depth 
tuning tricks of all products. However, realistically, there is a big gap in 
terms of documentation. Hope the Spark folks will make a difference. :-)

Du


From: Soumya Simanta soumya.sima...@gmail.commailto:soumya.sima...@gmail.com
Date: Friday, October 31, 2014 at 4:04 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: SparkSQL performance

I was really surprised to see the results here, esp. SparkSQL not completing
http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style

I was under the impression that SparkSQL performs really well because it can 
optimize the RDD operations and load only the columns that are required. This 
essentially means in most cases SparkSQL should be as fast as Spark is.

I would be very interested to hear what others in the group have to say about 
this.

Thanks
-Soumya




Re: SparkSQL performance

2014-10-31 Thread Soumya Simanta
I agree. My personal experience with Spark core is that it performs really
well once you tune it properly.

As far I understand SparkSQL under the hood performs many of these
optimizations (order of Spark operations) and uses a more efficient storage
format. Is this assumption correct?

Has anyone done any comparison of SparkSQL with Impala ? The fact that many
of the queries don't even finish in the benchmark is quite surprising and
hard to believe.

A few months ago there were a few emails about Spark not being able to
handle large volumes (TBs) of data. That myth was busted recently when the
folks at Databricks published their sorting record results.


Thanks
-Soumya






On Fri, Oct 31, 2014 at 7:35 PM, Du Li l...@yahoo-inc.com wrote:

   We have seen all kinds of results published that often contradict each
 other. My take is that the authors often know more tricks about how to tune
 their own/familiar products than the others. So the product on focus is
 tuned for ideal performance while the competitors are not. The authors are
 not necessarily biased but as a consequence the results are.

  Ideally it’s critical for the user community to be informed of all the
 in-depth tuning tricks of all products. However, realistically, there is a
 big gap in terms of documentation. Hope the Spark folks will make a
 difference. :-)

  Du


   From: Soumya Simanta soumya.sima...@gmail.com
 Date: Friday, October 31, 2014 at 4:04 PM
 To: user@spark.apache.org user@spark.apache.org
 Subject: SparkSQL performance

   I was really surprised to see the results here, esp. SparkSQL not
 completing
 http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style

  I was under the impression that SparkSQL performs really well because it
 can optimize the RDD operations and load only the columns that are
 required. This essentially means in most cases SparkSQL should be as fast
 as Spark is.

  I would be very interested to hear what others in the group have to say
 about this.

  Thanks
 -Soumya