RE: Save an RDD to a SQL Database
I have similar requirement to export the data to mysql. Just wanted to know what the best approach is so far after the research you guys have done. Currently thinking of saving to hdfs and use sqoop to handle export. Is that the best approach or is there any other way to write to mysql? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Save-an-RDD-to-a-SQL-Database-tp11516p12921.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Save an RDD to a SQL Database
I haven't seen people write directly to sql database, mainly because it's difficult to deal with failure, what if network broken in half of the process? should we drop all data in database and restart from beginning? if the process is Appending data to database, then things becomes even complex. but if this process can be doable, it would be a very good thing. On Wed, Aug 6, 2014 at 11:24 PM, Yana yana.kadiy...@gmail.com wrote: Hi Vida, I am writing to a DB -- or trying to :). I believe the best practice for this (you can search the mailing list archives) is to do a combination of mapPartitions and use a grouped iterator. Look at this thread, esp. the comment from A. Boisvert and Matei's comment above it: https://groups.google.com/forum/#!topic/spark-users/LUb7ZysYp2k Basically the short story is that you want to open as few connections as possible but write more than 1 insert at a time. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Save-an-RDD-to-a-SQL-Database-tp11516p11549.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Save an RDD to a SQL Database
Maybe a little off topic, but would you mind to share your motivation of saving the RDD into an SQL DB? If you’re just trying to do further transformations/queries with SQL for convenience, then you may just use Spark SQL directly within your Spark application without saving them into DB: val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext) import sqlContext._ // First create a case class to describe your schema case class Record(fieldA: T1, fieldB: T2, …) // Transform RDD elements to Records and register it as a SQL table rdd.map(…).registerAsTable(“myTable”) // Torture them until they tell you the truth :) sql(“SELECT fieldA FROM myTable WHERE fieldB 10”) On Aug 6, 2014, at 11:29 AM, Vida Ha vid...@gmail.com wrote: Hi, I would like to save an RDD to a SQL database. It seems like this would be a common enough use case. Are there any built in libraries to do it? Otherwise, I'm just planning on mapping my RDD, and having that call a method to write to the database. Given that a lot of records are going to be written, the code would need to be smart and do a batch insert after enough records have collected. Does that sound like a reasonable approach? -Vida - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Save an RDD to a SQL Database
On Thu, Aug 7, 2014 at 11:08 AM, 诺铁 noty...@gmail.com wrote: what if network broken in half of the process? should we drop all data in database and restart from beginning? The best way to deal with this -- which, unfortunately, is not commonly supported -- is with a two-phase commit that can span connections http://stackoverflow.com/q/23354034/877069. PostgreSQL supports it, for example. This would guarantee that a multi-connection data load is atomic. Nick
Re: Save an RDD to a SQL Database
On Thu, Aug 7, 2014 at 11:25 AM, Cheng Lian lian.cs@gmail.com wrote: Maybe a little off topic, but would you mind to share your motivation of saving the RDD into an SQL DB? Many possible reasons (Vida, please chime in with yours!): - You have an existing database you want to load new data into so everything's together. - You want very low query latency, which you can probably get with Spark SQL but currently not with the ease you can get it from your average DBMS. - Tooling around traditional DBMSs is currently much more mature than tooling around Spark SQL, especially in the JDBC area. Nick
Re: Save an RDD to a SQL Database
right, Spark is more like to act as an OLAP, i believe no one will use spark as an OLTP, so there is always some question about how to share the data between these two platform efficiently and a more important is that most of enterprise BI tools rely on RDBMS or at least a JDBC/ODBC interface -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Save-an-RDD-to-a-SQL-Database-tp11516p11672.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Save an RDD to a SQL Database
The use case I was thinking of was outputting calculations made in Spark into a SQL database for the presentation layer to access. So in other words, having a Spark backend in Java that writes to a SQL database and then having a Rails front-end that can display the data nicely. On Thu, Aug 7, 2014 at 8:42 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On Thu, Aug 7, 2014 at 11:25 AM, Cheng Lian lian.cs@gmail.com wrote: Maybe a little off topic, but would you mind to share your motivation of saving the RDD into an SQL DB? Many possible reasons (Vida, please chime in with yours!): - You have an existing database you want to load new data into so everything's together. - You want very low query latency, which you can probably get with Spark SQL but currently not with the ease you can get it from your average DBMS. - Tooling around traditional DBMSs is currently much more mature than tooling around Spark SQL, especially in the JDBC area. Nick
Re: Save an RDD to a SQL Database
Vida, What kind of database are you trying to write to? For example, I found that for loading into Redshift, by far the easiest thing to do was to save my output from Spark as a CSV to S3, and then load it from there into Redshift. This is not a slow as you think, because Spark can write the output in parallel to S3, and Redshift, too, can load data from multiple files in parallel http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-single-copy-command.html . Nick On Thu, Aug 7, 2014 at 1:52 PM, Vida Ha v...@databricks.com wrote: The use case I was thinking of was outputting calculations made in Spark into a SQL database for the presentation layer to access. So in other words, having a Spark backend in Java that writes to a SQL database and then having a Rails front-end that can display the data nicely. On Thu, Aug 7, 2014 at 8:42 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On Thu, Aug 7, 2014 at 11:25 AM, Cheng Lian lian.cs@gmail.com wrote: Maybe a little off topic, but would you mind to share your motivation of saving the RDD into an SQL DB? Many possible reasons (Vida, please chime in with yours!): - You have an existing database you want to load new data into so everything's together. - You want very low query latency, which you can probably get with Spark SQL but currently not with the ease you can get it from your average DBMS. - Tooling around traditional DBMSs is currently much more mature than tooling around Spark SQL, especially in the JDBC area. Nick
Re: Save an RDD to a SQL Database
Isn't sqoop export meant for that? http://hadooped.blogspot.it/2013/06/apache-sqoop-part-3-data-transfer.html?m=1 On Aug 7, 2014 7:59 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Vida, What kind of database are you trying to write to? For example, I found that for loading into Redshift, by far the easiest thing to do was to save my output from Spark as a CSV to S3, and then load it from there into Redshift. This is not a slow as you think, because Spark can write the output in parallel to S3, and Redshift, too, can load data from multiple files in parallel http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-single-copy-command.html . Nick On Thu, Aug 7, 2014 at 1:52 PM, Vida Ha v...@databricks.com wrote: The use case I was thinking of was outputting calculations made in Spark into a SQL database for the presentation layer to access. So in other words, having a Spark backend in Java that writes to a SQL database and then having a Rails front-end that can display the data nicely. On Thu, Aug 7, 2014 at 8:42 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On Thu, Aug 7, 2014 at 11:25 AM, Cheng Lian lian.cs@gmail.com wrote: Maybe a little off topic, but would you mind to share your motivation of saving the RDD into an SQL DB? Many possible reasons (Vida, please chime in with yours!): - You have an existing database you want to load new data into so everything's together. - You want very low query latency, which you can probably get with Spark SQL but currently not with the ease you can get it from your average DBMS. - Tooling around traditional DBMSs is currently much more mature than tooling around Spark SQL, especially in the JDBC area. Nick
Re: Save an RDD to a SQL Database
That's a good idea - to write to files first and then load. Thanks. On Thu, Aug 7, 2014 at 11:26 AM, Flavio Pompermaier pomperma...@okkam.it wrote: Isn't sqoop export meant for that? http://hadooped.blogspot.it/2013/06/apache-sqoop-part-3-data-transfer.html?m=1 On Aug 7, 2014 7:59 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Vida, What kind of database are you trying to write to? For example, I found that for loading into Redshift, by far the easiest thing to do was to save my output from Spark as a CSV to S3, and then load it from there into Redshift. This is not a slow as you think, because Spark can write the output in parallel to S3, and Redshift, too, can load data from multiple files in parallel http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-single-copy-command.html . Nick On Thu, Aug 7, 2014 at 1:52 PM, Vida Ha v...@databricks.com wrote: The use case I was thinking of was outputting calculations made in Spark into a SQL database for the presentation layer to access. So in other words, having a Spark backend in Java that writes to a SQL database and then having a Rails front-end that can display the data nicely. On Thu, Aug 7, 2014 at 8:42 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On Thu, Aug 7, 2014 at 11:25 AM, Cheng Lian lian.cs@gmail.com wrote: Maybe a little off topic, but would you mind to share your motivation of saving the RDD into an SQL DB? Many possible reasons (Vida, please chime in with yours!): - You have an existing database you want to load new data into so everything's together. - You want very low query latency, which you can probably get with Spark SQL but currently not with the ease you can get it from your average DBMS. - Tooling around traditional DBMSs is currently much more mature than tooling around Spark SQL, especially in the JDBC area. Nick
RE: Save an RDD to a SQL Database
Depending on what you mean by save, you might be able to use the Twitter Storehaus package to do this. There was a nice talk about this at a Spark meetup -- Stores, Monoids and Dependency Injection - Abstractions for Spark Streaming Jobs. Video here: https://www.youtube.com/watch?v=C7gWtxelYNMfeature=youtu.be. Jim Donahue Adobe -Original Message- From: Ron Gonzalez [mailto:zlgonza...@yahoo.com.INVALID] Sent: Wednesday, August 06, 2014 7:18 AM To: Vida Ha Cc: u...@spark.incubator.apache.org Subject: Re: Save an RDD to a SQL Database Hi Vida, It's possible to save an RDD as a hadoop file using hadoop output formats. It might be worthwhile to investigate using DBOutputFormat and see if this will work for you. I haven't personally written to a db, but I'd imagine this would be one way to do it. Thanks, Ron Sent from my iPhone On Aug 5, 2014, at 8:29 PM, Vida Ha vid...@gmail.com wrote: Hi, I would like to save an RDD to a SQL database. It seems like this would be a common enough use case. Are there any built in libraries to do it? Otherwise, I'm just planning on mapping my RDD, and having that call a method to write to the database. Given that a lot of records are going to be written, the code would need to be smart and do a batch insert after enough records have collected. Does that sound like a reasonable approach? -Vida - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Save an RDD to a SQL Database
Hi Vida, It's possible to save an RDD as a hadoop file using hadoop output formats. It might be worthwhile to investigate using DBOutputFormat and see if this will work for you. I haven't personally written to a db, but I'd imagine this would be one way to do it. Thanks, Ron Sent from my iPhone On Aug 5, 2014, at 8:29 PM, Vida Ha vid...@gmail.com wrote: Hi, I would like to save an RDD to a SQL database. It seems like this would be a common enough use case. Are there any built in libraries to do it? Otherwise, I'm just planning on mapping my RDD, and having that call a method to write to the database. Given that a lot of records are going to be written, the code would need to be smart and do a batch insert after enough records have collected. Does that sound like a reasonable approach? -Vida - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Save an RDD to a SQL Database
Hi Vida, I am writing to a DB -- or trying to :). I believe the best practice for this (you can search the mailing list archives) is to do a combination of mapPartitions and use a grouped iterator. Look at this thread, esp. the comment from A. Boisvert and Matei's comment above it: https://groups.google.com/forum/#!topic/spark-users/LUb7ZysYp2k Basically the short story is that you want to open as few connections as possible but write more than 1 insert at a time. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Save-an-RDD-to-a-SQL-Database-tp11516p11549.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org