Re: Announcing Spark SQL

2014-03-29 Thread Michael Armbrust
On Fri, Mar 28, 2014 at 9:53 PM, Rohit Rai ro...@tuplejump.com wrote:

 Upon discussion with couple of our clients, it seems the reason they would
 prefer using hive is that they have already invested a lot in it. Mostly in
 UDFs and HiveQL.
 1. Are there any plans to develop the SQL Parser to handdle more complex
 queries like HiveQL? Can we just plugin a custom parser instead of bringing
 in the whole hive deps?


We definitely want to have a more complete SQL parser without having to
pull in all of hive.  I think there are a couple of ways to do this.

1. Using a SQL-92 parser from something like optiq or writing our own
2. I haven't fully investigated the hive published artifacts, but if there
is some way to depend on only the parser that would be great.  If someone
has resources to investigate using the Hive parser without needing to
depend on all of hive this is a place where we would certainly welcome
contributions.  We could then consider making hiveql an option in a
standard SQLContext.


 2. Is there any way we can support UDFs in Catalyst without using Hive? It
 will bee fine if we don't support Hive UDFs as is and need minor porting
 effort.


All of the execution support for native scala udfs is already there, and in
fact when you use the DSL where
clausehttp://people.apache.org/~pwendell/catalyst-docs/api/sql/core/index.html#org.apache.spark.sql.SchemaRDDyou
are using this machinery.  For Spark 1.1 we will find a more general
way to expose this to users.


Re: Announcing Spark SQL

2014-03-28 Thread Rohit Rai
Thanks Patrick,

I was thinking about that... Upon analysis I realized (on date) it would be
something similar to the way Hive Context using CustomCatalog stuff.
I will review it again, on the lines of implementing SchemaRDD with
Cassandra. Thanks for the pointer.

Upon discussion with couple of our clients, it seems the reason they would
prefer using hive is that they have already invested a lot in it. Mostly in
UDFs and HiveQL.
1. Are there any plans to develop the SQL Parser to handdle more complex
queries like HiveQL? Can we just plugin a custom parser instead of bringing
in the whole hive deps?
2. Is there any way we can support UDFs in Catalyst without using Hive? It
will bee fine if we don't support Hive UDFs as is and need minor porting
effort.


Regards,
Rohit


*Founder  CEO, **Tuplejump, Inc.*

www.tuplejump.com
*The Data Engineering Platform*


On Fri, Mar 28, 2014 at 12:48 AM, Patrick Wendell pwend...@gmail.comwrote:

 Hey Rohit,

 I think external tables based on Cassandra or other datastores will work
 out-of-the box if you build Catalyst with Hive support.

 Michael may have feelings about this but I'd guess the longer term design
 for having schema support for Cassandra/HBase etc likely wouldn't rely on
 hive external tables because it's an unnecessary layer of indirection.

 Spark should be able to directly load an SchemaRDD from Cassandra by just
 letting the user give relevant information about the Cassandra schema. And
 it should let you write-back to Cassandra by giving a mapping of fields to
 the respective cassandra columns. I think all of this would be fairly easy
 to implement on SchemaRDD and likely will make it into Spark 1.1

 - Patrick


 On Wed, Mar 26, 2014 at 10:59 PM, Rohit Rai ro...@tuplejump.com wrote:

 Great work guys! Have been looking forward to this . . .

 In the blog it mentions support for reading from Hbase/Avro... What will
 be the recommended approach for this? Will it be writing custom wrappers
 for SQLContext like in HiveContext or using Hive's EXTERNAL TABLE support?

 I ask this because a few days back (based on your pull request in github)
 I started analyzing what it would take to support Spark SQL on Cassandra.
 One obvious approach will be to use Hive External Table support with our
 cassandra-hive handler. But second approach sounds tempting as it will give
 more fidelity.

 Regards,
 Rohit

 *Founder  CEO, **Tuplejump, Inc.*
 
 www.tuplejump.com
 *The Data Engineering Platform*


 On Thu, Mar 27, 2014 at 9:12 AM, Michael Armbrust mich...@databricks.com
  wrote:

 Any plans to make the SQL typesafe using something like Slick (
 http://slick.typesafe.com/)


 I would really like to do something like that, and maybe we will in a
 couple of months. However, in the near term, I think the top priorities are
 going to be performance and stability.

 Michael






Re: Announcing Spark SQL

2014-03-27 Thread Pascal Voitot Dev
Le 27 mars 2014 09:47, andy petrella andy.petre...@gmail.com a écrit :

 I hijack the thread, but my2c is that this feature is also important to
enable ad-hoc queries which is done at runtime. It doesn't remove interests
for such macro for precompiled jobs of course, but it may not be the first
use case envisioned with this Spark SQL.


I'm not sure to see what you call ad- hoc queries... Any sample?

 Again, only my0.2c (ok I divided by 10 after writing my thoughts ^^)

 Andy

 On Thu, Mar 27, 2014 at 9:16 AM, Pascal Voitot Dev 
pascal.voitot@gmail.com wrote:

 Hi,
 Quite interesting!

 Suggestion: why not go even fancier  parse SQL queries at compile-time
with a macro ? ;)

 Pascal



 On Wed, Mar 26, 2014 at 10:58 PM, Michael Armbrust 
mich...@databricks.com wrote:

 Hey Everyone,

 This already went out to the dev list, but I wanted to put a pointer
here as well to a new feature we are pretty excited about for Spark 1.0.


http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html

 Michael





Re: Announcing Spark SQL

2014-03-27 Thread Pascal Voitot Dev
On Thu, Mar 27, 2014 at 10:22 AM, andy petrella andy.petre...@gmail.comwrote:

 I just mean queries sent at runtime ^^, like for any RDBMS.
 In our project we have such requirement to have a layer to play with the
 data (custom and low level service layer of a lambda arch), and something
 like this is interesting.


Ok that's what I thought! But for these runtime queries, is a macro useful
for you?




 On Thu, Mar 27, 2014 at 10:15 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:


 Le 27 mars 2014 09:47, andy petrella andy.petre...@gmail.com a écrit
 :

 
  I hijack the thread, but my2c is that this feature is also important to
 enable ad-hoc queries which is done at runtime. It doesn't remove interests
 for such macro for precompiled jobs of course, but it may not be the first
 use case envisioned with this Spark SQL.
 

 I'm not sure to see what you call ad- hoc queries... Any sample?

  Again, only my0.2c (ok I divided by 10 after writing my thoughts ^^)
 
  Andy
 
  On Thu, Mar 27, 2014 at 9:16 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:
 
  Hi,
  Quite interesting!
 
  Suggestion: why not go even fancier  parse SQL queries at
 compile-time with a macro ? ;)
 
  Pascal
 
 
 
  On Wed, Mar 26, 2014 at 10:58 PM, Michael Armbrust 
 mich...@databricks.com wrote:
 
  Hey Everyone,
 
  This already went out to the dev list, but I wanted to put a pointer
 here as well to a new feature we are pretty excited about for Spark 1.0.
 
 
 http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html
 
  Michael
 
 
 





Re: Announcing Spark SQL

2014-03-27 Thread andy petrella
nope (what I said :-P)


On Thu, Mar 27, 2014 at 11:05 AM, Pascal Voitot Dev 
pascal.voitot@gmail.com wrote:




 On Thu, Mar 27, 2014 at 10:22 AM, andy petrella 
 andy.petre...@gmail.comwrote:

 I just mean queries sent at runtime ^^, like for any RDBMS.
 In our project we have such requirement to have a layer to play with the
 data (custom and low level service layer of a lambda arch), and something
 like this is interesting.


 Ok that's what I thought! But for these runtime queries, is a macro useful
 for you?




 On Thu, Mar 27, 2014 at 10:15 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:


 Le 27 mars 2014 09:47, andy petrella andy.petre...@gmail.com a
 écrit :

 
  I hijack the thread, but my2c is that this feature is also important
 to enable ad-hoc queries which is done at runtime. It doesn't remove
 interests for such macro for precompiled jobs of course, but it may not be
 the first use case envisioned with this Spark SQL.
 

 I'm not sure to see what you call ad- hoc queries... Any sample?

  Again, only my0.2c (ok I divided by 10 after writing my thoughts ^^)
 
  Andy
 
  On Thu, Mar 27, 2014 at 9:16 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:
 
  Hi,
  Quite interesting!
 
  Suggestion: why not go even fancier  parse SQL queries at
 compile-time with a macro ? ;)
 
  Pascal
 
 
 
  On Wed, Mar 26, 2014 at 10:58 PM, Michael Armbrust 
 mich...@databricks.com wrote:
 
  Hey Everyone,
 
  This already went out to the dev list, but I wanted to put a pointer
 here as well to a new feature we are pretty excited about for Spark 1.0.
 
 
 http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html
 
  Michael
 
 
 






Re: Announcing Spark SQL

2014-03-27 Thread Pascal Voitot Dev
On Thu, Mar 27, 2014 at 11:08 AM, andy petrella andy.petre...@gmail.comwrote:

 nope (what I said :-P)


That's also my answer to my own question :D

but I didn't understand that in your sentence: my2c is that this feature
is also important to enable ad-hoc queries which is done at runtime.




 On Thu, Mar 27, 2014 at 11:05 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:




 On Thu, Mar 27, 2014 at 10:22 AM, andy petrella 
 andy.petre...@gmail.comwrote:

 I just mean queries sent at runtime ^^, like for any RDBMS.
 In our project we have such requirement to have a layer to play with the
 data (custom and low level service layer of a lambda arch), and something
 like this is interesting.


 Ok that's what I thought! But for these runtime queries, is a macro
 useful for you?




 On Thu, Mar 27, 2014 at 10:15 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:


 Le 27 mars 2014 09:47, andy petrella andy.petre...@gmail.com a
 écrit :

 
  I hijack the thread, but my2c is that this feature is also important
 to enable ad-hoc queries which is done at runtime. It doesn't remove
 interests for such macro for precompiled jobs of course, but it may not be
 the first use case envisioned with this Spark SQL.
 

 I'm not sure to see what you call ad- hoc queries... Any sample?

  Again, only my0.2c (ok I divided by 10 after writing my thoughts ^^)
 
  Andy
 
  On Thu, Mar 27, 2014 at 9:16 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:
 
  Hi,
  Quite interesting!
 
  Suggestion: why not go even fancier  parse SQL queries at
 compile-time with a macro ? ;)
 
  Pascal
 
 
 
  On Wed, Mar 26, 2014 at 10:58 PM, Michael Armbrust 
 mich...@databricks.com wrote:
 
  Hey Everyone,
 
  This already went out to the dev list, but I wanted to put a
 pointer here as well to a new feature we are pretty excited about for Spark
 1.0.
 
 
 http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html
 
  Michael
 
 
 







Re: Announcing Spark SQL

2014-03-27 Thread yana
Does Shark not suit your needs? That's what we use at the moment and it's been 
good


Sent from my Samsung Galaxy S®4

 Original message 
From: andy petrella andy.petre...@gmail.com 
Date:03/27/2014  6:08 AM  (GMT-05:00) 
To: user@spark.apache.org 
Subject: Re: Announcing Spark SQL 

nope (what I said :-P)


On Thu, Mar 27, 2014 at 11:05 AM, Pascal Voitot Dev 
pascal.voitot@gmail.com wrote:



On Thu, Mar 27, 2014 at 10:22 AM, andy petrella andy.petre...@gmail.com wrote:
I just mean queries sent at runtime ^^, like for any RDBMS.
In our project we have such requirement to have a layer to play with the data 
(custom and low level service layer of a lambda arch), and something like this 
is interesting.


Ok that's what I thought! But for these runtime queries, is a macro useful for 
you?
 


On Thu, Mar 27, 2014 at 10:15 AM, Pascal Voitot Dev 
pascal.voitot@gmail.com wrote:

Le 27 mars 2014 09:47, andy petrella andy.petre...@gmail.com a écrit :



 I hijack the thread, but my2c is that this feature is also important to 
 enable ad-hoc queries which is done at runtime. It doesn't remove interests 
 for such macro for precompiled jobs of course, but it may not be the first 
 use case envisioned with this Spark SQL.

I'm not sure to see what you call ad- hoc queries... Any sample?

 Again, only my0.2c (ok I divided by 10 after writing my thoughts ^^)

 Andy

 On Thu, Mar 27, 2014 at 9:16 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:

 Hi,
 Quite interesting!

 Suggestion: why not go even fancier  parse SQL queries at compile-time with 
 a macro ? ;)

 Pascal



 On Wed, Mar 26, 2014 at 10:58 PM, Michael Armbrust mich...@databricks.com 
 wrote:

 Hey Everyone,

 This already went out to the dev list, but I wanted to put a pointer here 
 as well to a new feature we are pretty excited about for Spark 1.0.

 http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html

 Michael








Re: Announcing Spark SQL

2014-03-27 Thread andy petrella
Yes it could, of course. I didn't say that there is no tool to do it,
though ;-).

Andy


On Thu, Mar 27, 2014 at 12:49 PM, yana yana.kadiy...@gmail.com wrote:

 Does Shark not suit your needs? That's what we use at the moment and it's
 been good


 Sent from my Samsung Galaxy S®4


  Original message 
 From: andy petrella
 Date:03/27/2014 6:08 AM (GMT-05:00)
 To: user@spark.apache.org
 Subject: Re: Announcing Spark SQL

 nope (what I said :-P)


 On Thu, Mar 27, 2014 at 11:05 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:




 On Thu, Mar 27, 2014 at 10:22 AM, andy petrella 
 andy.petre...@gmail.comwrote:

 I just mean queries sent at runtime ^^, like for any RDBMS.
 In our project we have such requirement to have a layer to play with the
 data (custom and low level service layer of a lambda arch), and something
 like this is interesting.


 Ok that's what I thought! But for these runtime queries, is a macro
 useful for you?




 On Thu, Mar 27, 2014 at 10:15 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:


 Le 27 mars 2014 09:47, andy petrella andy.petre...@gmail.com a
 écrit :

 
  I hijack the thread, but my2c is that this feature is also important
 to enable ad-hoc queries which is done at runtime. It doesn't remove
 interests for such macro for precompiled jobs of course, but it may not be
 the first use case envisioned with this Spark SQL.
 

 I'm not sure to see what you call ad- hoc queries... Any sample?

  Again, only my0.2c (ok I divided by 10 after writing my thoughts ^^)
 
  Andy
 
  On Thu, Mar 27, 2014 at 9:16 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:
 
  Hi,
  Quite interesting!
 
  Suggestion: why not go even fancier  parse SQL queries at
 compile-time with a macro ? ;)
 
  Pascal
 
 
 
  On Wed, Mar 26, 2014 at 10:58 PM, Michael Armbrust 
 mich...@databricks.com wrote:
 
  Hey Everyone,
 
  This already went out to the dev list, but I wanted to put a
 pointer here as well to a new feature we are pretty excited about for Spark
 1.0.
 
 
 http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html
 
  Michael
 
 
 







Re: Announcing Spark SQL

2014-03-27 Thread Pascal Voitot Dev
when there is something new, it's also cool to let imagination fly far away
;)


On Thu, Mar 27, 2014 at 2:20 PM, andy petrella andy.petre...@gmail.comwrote:

 Yes it could, of course. I didn't say that there is no tool to do it,
 though ;-).

 Andy


 On Thu, Mar 27, 2014 at 12:49 PM, yana yana.kadiy...@gmail.com wrote:

 Does Shark not suit your needs? That's what we use at the moment and it's
 been good


 Sent from my Samsung Galaxy S®4


  Original message 
 From: andy petrella
 Date:03/27/2014 6:08 AM (GMT-05:00)
 To: user@spark.apache.org
 Subject: Re: Announcing Spark SQL

 nope (what I said :-P)


 On Thu, Mar 27, 2014 at 11:05 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:




 On Thu, Mar 27, 2014 at 10:22 AM, andy petrella andy.petre...@gmail.com
  wrote:

 I just mean queries sent at runtime ^^, like for any RDBMS.
 In our project we have such requirement to have a layer to play with
 the data (custom and low level service layer of a lambda arch), and
 something like this is interesting.


 Ok that's what I thought! But for these runtime queries, is a macro
 useful for you?




 On Thu, Mar 27, 2014 at 10:15 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:


 Le 27 mars 2014 09:47, andy petrella andy.petre...@gmail.com a
 écrit :

 
  I hijack the thread, but my2c is that this feature is also important
 to enable ad-hoc queries which is done at runtime. It doesn't remove
 interests for such macro for precompiled jobs of course, but it may not be
 the first use case envisioned with this Spark SQL.
 

 I'm not sure to see what you call ad- hoc queries... Any sample?

  Again, only my0.2c (ok I divided by 10 after writing my thoughts ^^)
 
  Andy
 
  On Thu, Mar 27, 2014 at 9:16 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:
 
  Hi,
  Quite interesting!
 
  Suggestion: why not go even fancier  parse SQL queries at
 compile-time with a macro ? ;)
 
  Pascal
 
 
 
  On Wed, Mar 26, 2014 at 10:58 PM, Michael Armbrust 
 mich...@databricks.com wrote:
 
  Hey Everyone,
 
  This already went out to the dev list, but I wanted to put a
 pointer here as well to a new feature we are pretty excited about for 
 Spark
 1.0.
 
 
 http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html
 
  Michael
 
 
 








Re: Announcing Spark SQL

2014-03-27 Thread Patrick Wendell
Hey Rohit,

I think external tables based on Cassandra or other datastores will work
out-of-the box if you build Catalyst with Hive support.

Michael may have feelings about this but I'd guess the longer term design
for having schema support for Cassandra/HBase etc likely wouldn't rely on
hive external tables because it's an unnecessary layer of indirection.

Spark should be able to directly load an SchemaRDD from Cassandra by just
letting the user give relevant information about the Cassandra schema. And
it should let you write-back to Cassandra by giving a mapping of fields to
the respective cassandra columns. I think all of this would be fairly easy
to implement on SchemaRDD and likely will make it into Spark 1.1

- Patrick


On Wed, Mar 26, 2014 at 10:59 PM, Rohit Rai ro...@tuplejump.com wrote:

 Great work guys! Have been looking forward to this . . .

 In the blog it mentions support for reading from Hbase/Avro... What will
 be the recommended approach for this? Will it be writing custom wrappers
 for SQLContext like in HiveContext or using Hive's EXTERNAL TABLE support?

 I ask this because a few days back (based on your pull request in github)
 I started analyzing what it would take to support Spark SQL on Cassandra.
 One obvious approach will be to use Hive External Table support with our
 cassandra-hive handler. But second approach sounds tempting as it will give
 more fidelity.

 Regards,
 Rohit

 *Founder  CEO, **Tuplejump, Inc.*
 
 www.tuplejump.com
 *The Data Engineering Platform*


 On Thu, Mar 27, 2014 at 9:12 AM, Michael Armbrust 
 mich...@databricks.comwrote:

 Any plans to make the SQL typesafe using something like Slick (
 http://slick.typesafe.com/)


 I would really like to do something like that, and maybe we will in a
 couple of months. However, in the near term, I think the top priorities are
 going to be performance and stability.

 Michael





Re: Announcing Spark SQL

2014-03-26 Thread Nicholas Chammas
This is so, so COOL. YES. I'm excited about using this once I'm a bit more
comfortable with Spark.

Nice work, people!


On Wed, Mar 26, 2014 at 5:58 PM, Michael Armbrust mich...@databricks.comwrote:

 Hey Everyone,

 This already went out to the dev list, but I wanted to put a pointer here
 as well to a new feature we are pretty excited about for Spark 1.0.


 http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html

 Michael



RE: Announcing Spark SQL

2014-03-26 Thread Bingham, Skyler
Fantastic!  Although, I think they missed an obvious name choice: SparkQL 
(pronounced sparkle) :)

Skyler

From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Wednesday, March 26, 2014 3:58 PM
To: user@spark.apache.org
Subject: Announcing Spark SQL

Hey Everyone,

This already went out to the dev list, but I wanted to put a pointer here as 
well to a new feature we are pretty excited about for Spark 1.0.

http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html

Michael


Re: Announcing Spark SQL

2014-03-26 Thread Matei Zaharia
Congrats Michael  co for putting this together — this is probably the neatest 
piece of technology added to Spark in the past few months, and it will greatly 
change what users can do as more data sources are added.

Matei


On Mar 26, 2014, at 3:22 PM, Ognen Duzlevski og...@plainvanillagames.com 
wrote:

 Wow!
 Ognen
 
 On 3/26/14, 4:58 PM, Michael Armbrust wrote:
 Hey Everyone,
 
 This already went out to the dev list, but I wanted to put a pointer here as 
 well to a new feature we are pretty excited about for Spark 1.0.
 
 http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html
 
 Michael
 



Re: Announcing Spark SQL

2014-03-26 Thread Christopher Nguyen
+1 Michael, Reynold et al. This is key to some of the things we're doing.

--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen



On Wed, Mar 26, 2014 at 2:58 PM, Michael Armbrust mich...@databricks.comwrote:

 Hey Everyone,

 This already went out to the dev list, but I wanted to put a pointer here
 as well to a new feature we are pretty excited about for Spark 1.0.


 http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html

 Michael



Re: Announcing Spark SQL

2014-03-26 Thread Soumya Simanta
Very nice.
Any plans to make the SQL typesafe using something like Slick (
http://slick.typesafe.com/)

Thanks !



On Wed, Mar 26, 2014 at 5:58 PM, Michael Armbrust mich...@databricks.comwrote:

 Hey Everyone,

 This already went out to the dev list, but I wanted to put a pointer here
 as well to a new feature we are pretty excited about for Spark 1.0.


 http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html

 Michael



Re: Announcing Spark SQL

2014-03-26 Thread Michael Armbrust

 Any plans to make the SQL typesafe using something like Slick (
 http://slick.typesafe.com/)


I would really like to do something like that, and maybe we will in a
couple of months. However, in the near term, I think the top priorities are
going to be performance and stability.

Michael