RE: Save an RDD to a SQL Database

2014-08-27 Thread bdev
I have similar requirement to export the data to mysql. Just wanted to know
what the best approach is so far after the research you guys have done.
Currently thinking of saving to hdfs and use sqoop to handle export. Is that
the best approach or is there any other way to write to mysql? Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Save-an-RDD-to-a-SQL-Database-tp11516p12921.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Save an RDD to a SQL Database

2014-08-07 Thread 诺铁
I haven't seen people write directly to sql database,
mainly because it's difficult to deal with failure,
what if network broken in half of the process?  should we drop all data in
database and restart from beginning?  if the process is Appending data to
database, then things becomes even complex.

but if this process can be doable, it would be a very good thing.


On Wed, Aug 6, 2014 at 11:24 PM, Yana yana.kadiy...@gmail.com wrote:

 Hi Vida,

 I am writing to a DB -- or trying to :).

 I believe the best practice for this (you can search the mailing list
 archives) is to do a combination of mapPartitions and use a grouped
 iterator.
 Look at this thread, esp. the comment from A. Boisvert and Matei's comment
 above it:
 https://groups.google.com/forum/#!topic/spark-users/LUb7ZysYp2k

 Basically the short story is that you want to open as few connections as
 possible but write more than 1 insert at a time.





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Save-an-RDD-to-a-SQL-Database-tp11516p11549.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Save an RDD to a SQL Database

2014-08-07 Thread Cheng Lian
Maybe a little off topic, but would you mind to share your motivation of saving 
the RDD into an SQL DB?

If you’re just trying to do further transformations/queries with SQL for 
convenience, then you may just use Spark SQL directly within your Spark 
application without saving them into DB:

  val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext)
  import sqlContext._

  // First create a case class to describe your schema
  case class Record(fieldA: T1, fieldB: T2, …)

  // Transform RDD elements to Records and register it as a SQL table
  rdd.map(…).registerAsTable(“myTable”)

  // Torture them until they tell you the truth :)
  sql(“SELECT fieldA FROM myTable WHERE fieldB  10”)

On Aug 6, 2014, at 11:29 AM, Vida Ha vid...@gmail.com wrote:

 
 Hi,
 
 I would like to save an RDD to a SQL database.  It seems like this would be a 
 common enough use case.  Are there any built in libraries to do it?
 
 Otherwise, I'm just planning on mapping my RDD, and having that call a method 
 to write to the database.   Given that a lot of records are going to be 
 written, the code would need to be smart and do a batch insert after enough 
 records have collected.  Does that sound like a reasonable approach?
 
 
 -Vida
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Save an RDD to a SQL Database

2014-08-07 Thread Nicholas Chammas
On Thu, Aug 7, 2014 at 11:08 AM, 诺铁 noty...@gmail.com wrote:

 what if network broken in half of the process?  should we drop all data in
 database and restart from beginning?


The best way to deal with this -- which, unfortunately, is not commonly
supported -- is with a two-phase commit that can span connections
http://stackoverflow.com/q/23354034/877069. PostgreSQL supports it, for
example.

This would guarantee that a multi-connection data load is atomic.

Nick


Re: Save an RDD to a SQL Database

2014-08-07 Thread Nicholas Chammas
On Thu, Aug 7, 2014 at 11:25 AM, Cheng Lian lian.cs@gmail.com wrote:

 Maybe a little off topic, but would you mind to share your motivation of
 saving the RDD into an SQL DB?


Many possible reasons (Vida, please chime in with yours!):

   - You have an existing database you want to load new data into so
   everything's together.
   - You want very low query latency, which you can probably get with Spark
   SQL but currently not with the ease you can get it from your average DBMS.
   - Tooling around traditional DBMSs is currently much more mature than
   tooling around Spark SQL, especially in the JDBC area.

Nick


Re: Save an RDD to a SQL Database

2014-08-07 Thread chutium
right, Spark is more like to act as an OLAP, i believe no one will use spark
as an OLTP, so there is always some question about how to share the data
between these two platform efficiently

and a more important is that most of enterprise BI tools rely on RDBMS or at
least a JDBC/ODBC interface




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Save-an-RDD-to-a-SQL-Database-tp11516p11672.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Save an RDD to a SQL Database

2014-08-07 Thread Vida Ha
The use case I was thinking of was outputting calculations made in Spark
into a SQL database for the presentation layer to access.  So in other
words, having a Spark backend in Java that writes to a SQL database and
then having a Rails front-end that can display the data nicely.


On Thu, Aug 7, 2014 at 8:42 AM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 On Thu, Aug 7, 2014 at 11:25 AM, Cheng Lian lian.cs@gmail.com wrote:

 Maybe a little off topic, but would you mind to share your motivation of
 saving the RDD into an SQL DB?


 Many possible reasons (Vida, please chime in with yours!):

- You have an existing database you want to load new data into so
everything's together.
- You want very low query latency, which you can probably get with
Spark SQL but currently not with the ease you can get it from your average
DBMS.
- Tooling around traditional DBMSs is currently much more mature than
tooling around Spark SQL, especially in the JDBC area.

 Nick



Re: Save an RDD to a SQL Database

2014-08-07 Thread Nicholas Chammas
Vida,

What kind of database are you trying to write to?

For example, I found that for loading into Redshift, by far the easiest
thing to do was to save my output from Spark as a CSV to S3, and then load
it from there into Redshift. This is not a slow as you think, because Spark
can write the output in parallel to S3, and Redshift, too, can load data
from multiple files in parallel
http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-single-copy-command.html
.

Nick


On Thu, Aug 7, 2014 at 1:52 PM, Vida Ha v...@databricks.com wrote:

 The use case I was thinking of was outputting calculations made in Spark
 into a SQL database for the presentation layer to access.  So in other
 words, having a Spark backend in Java that writes to a SQL database and
 then having a Rails front-end that can display the data nicely.


 On Thu, Aug 7, 2014 at 8:42 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 On Thu, Aug 7, 2014 at 11:25 AM, Cheng Lian lian.cs@gmail.com
 wrote:

 Maybe a little off topic, but would you mind to share your motivation of
 saving the RDD into an SQL DB?


 Many possible reasons (Vida, please chime in with yours!):

- You have an existing database you want to load new data into so
everything's together.
- You want very low query latency, which you can probably get with
Spark SQL but currently not with the ease you can get it from your average
DBMS.
- Tooling around traditional DBMSs is currently much more mature than
tooling around Spark SQL, especially in the JDBC area.

 Nick





Re: Save an RDD to a SQL Database

2014-08-07 Thread Flavio Pompermaier
Isn't sqoop export meant for that?

http://hadooped.blogspot.it/2013/06/apache-sqoop-part-3-data-transfer.html?m=1
On Aug 7, 2014 7:59 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:

 Vida,

 What kind of database are you trying to write to?

 For example, I found that for loading into Redshift, by far the easiest
 thing to do was to save my output from Spark as a CSV to S3, and then load
 it from there into Redshift. This is not a slow as you think, because Spark
 can write the output in parallel to S3, and Redshift, too, can load data
 from multiple files in parallel
 http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-single-copy-command.html
 .

 Nick


 On Thu, Aug 7, 2014 at 1:52 PM, Vida Ha v...@databricks.com wrote:

 The use case I was thinking of was outputting calculations made in Spark
 into a SQL database for the presentation layer to access.  So in other
 words, having a Spark backend in Java that writes to a SQL database and
 then having a Rails front-end that can display the data nicely.


 On Thu, Aug 7, 2014 at 8:42 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 On Thu, Aug 7, 2014 at 11:25 AM, Cheng Lian lian.cs@gmail.com
 wrote:

 Maybe a little off topic, but would you mind to share your motivation
 of saving the RDD into an SQL DB?


 Many possible reasons (Vida, please chime in with yours!):

- You have an existing database you want to load new data into so
everything's together.
- You want very low query latency, which you can probably get with
Spark SQL but currently not with the ease you can get it from your 
 average
DBMS.
- Tooling around traditional DBMSs is currently much more mature
than tooling around Spark SQL, especially in the JDBC area.

 Nick






Re: Save an RDD to a SQL Database

2014-08-07 Thread Vida Ha
That's a good idea - to write to files first and then load.   Thanks.


On Thu, Aug 7, 2014 at 11:26 AM, Flavio Pompermaier pomperma...@okkam.it
wrote:

 Isn't sqoop export meant for that?


 http://hadooped.blogspot.it/2013/06/apache-sqoop-part-3-data-transfer.html?m=1
 On Aug 7, 2014 7:59 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 Vida,

 What kind of database are you trying to write to?

 For example, I found that for loading into Redshift, by far the easiest
 thing to do was to save my output from Spark as a CSV to S3, and then load
 it from there into Redshift. This is not a slow as you think, because Spark
 can write the output in parallel to S3, and Redshift, too, can load data
 from multiple files in parallel
 http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-single-copy-command.html
 .

 Nick


 On Thu, Aug 7, 2014 at 1:52 PM, Vida Ha v...@databricks.com wrote:

 The use case I was thinking of was outputting calculations made in Spark
 into a SQL database for the presentation layer to access.  So in other
 words, having a Spark backend in Java that writes to a SQL database and
 then having a Rails front-end that can display the data nicely.


 On Thu, Aug 7, 2014 at 8:42 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 On Thu, Aug 7, 2014 at 11:25 AM, Cheng Lian lian.cs@gmail.com
 wrote:

 Maybe a little off topic, but would you mind to share your motivation
 of saving the RDD into an SQL DB?


 Many possible reasons (Vida, please chime in with yours!):

- You have an existing database you want to load new data into so
everything's together.
- You want very low query latency, which you can probably get with
Spark SQL but currently not with the ease you can get it from your 
 average
DBMS.
- Tooling around traditional DBMSs is currently much more mature
than tooling around Spark SQL, especially in the JDBC area.

 Nick






RE: Save an RDD to a SQL Database

2014-08-07 Thread Jim Donahue
Depending on what you mean by save, you might be able to use the Twitter 
Storehaus package to do this.  There was a nice talk about this at a Spark 
meetup -- Stores, Monoids and Dependency Injection - Abstractions for Spark 
Streaming Jobs.  Video here: 
https://www.youtube.com/watch?v=C7gWtxelYNMfeature=youtu.be.


Jim Donahue
Adobe

-Original Message-
From: Ron Gonzalez [mailto:zlgonza...@yahoo.com.INVALID] 
Sent: Wednesday, August 06, 2014 7:18 AM
To: Vida Ha
Cc: u...@spark.incubator.apache.org
Subject: Re: Save an RDD to a SQL Database

Hi Vida,
  It's possible to save an RDD as a hadoop file using hadoop output formats. It 
might be worthwhile to investigate using DBOutputFormat and see if this will 
work for you.
  I haven't personally written to a db, but I'd imagine this would be one way 
to do it.

Thanks,
Ron

Sent from my iPhone

 On Aug 5, 2014, at 8:29 PM, Vida Ha vid...@gmail.com wrote:
 
 
 Hi,
 
 I would like to save an RDD to a SQL database.  It seems like this would be a 
 common enough use case.  Are there any built in libraries to do it?
 
 Otherwise, I'm just planning on mapping my RDD, and having that call a method 
 to write to the database.   Given that a lot of records are going to be 
 written, the code would need to be smart and do a batch insert after enough 
 records have collected.  Does that sound like a reasonable approach?
 
 
 -Vida
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Save an RDD to a SQL Database

2014-08-06 Thread Ron Gonzalez
Hi Vida,
  It's possible to save an RDD as a hadoop file using hadoop output formats. It 
might be worthwhile to investigate using DBOutputFormat and see if this will 
work for you.
  I haven't personally written to a db, but I'd imagine this would be one way 
to do it.

Thanks,
Ron

Sent from my iPhone

 On Aug 5, 2014, at 8:29 PM, Vida Ha vid...@gmail.com wrote:
 
 
 Hi,
 
 I would like to save an RDD to a SQL database.  It seems like this would be a 
 common enough use case.  Are there any built in libraries to do it?
 
 Otherwise, I'm just planning on mapping my RDD, and having that call a method 
 to write to the database.   Given that a lot of records are going to be 
 written, the code would need to be smart and do a batch insert after enough 
 records have collected.  Does that sound like a reasonable approach?
 
 
 -Vida
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Save an RDD to a SQL Database

2014-08-06 Thread Yana
Hi Vida,

I am writing to a DB -- or trying to :).

I believe the best practice for this (you can search the mailing list
archives) is to do a combination of mapPartitions and use a grouped
iterator.
Look at this thread, esp. the comment from A. Boisvert and Matei's comment
above it:
https://groups.google.com/forum/#!topic/spark-users/LUb7ZysYp2k

Basically the short story is that you want to open as few connections as
possible but write more than 1 insert at a time.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Save-an-RDD-to-a-SQL-Database-tp11516p11549.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org