Which is the best way to get a connection to an external database per task in Spark Streaming?

Juan Rodríguez Hortalá Mon, 07 Jul 2014 11:41:37 -0700

Hi list,

I'm writing a Spark Streaming program that reads from a kafka topic,
performs some transformations on the data, and then inserts each record in
a database with foreachRDD. I was wondering which is the best way to handle
the connection to the database so each worker, or even each task, uses a
different connection to the database, and then database inserts/updates
would be performed in parallel.
- I understand that using a final variable in the driver code is not a good
idea because then the communication with the database would be performed in
the driver code, which leads to a bottleneck, according to
http://engineering.sharethrough.com/blog/2013/09/13/top-3-troubleshooting-tips-to-keep-you-sparking/
- I think creating a new connection in the call() method of the Function
passed to foreachRDD is also a bad idea, because then I wouldn't be reusing
the connection to the database for each batch RDD in the DStream
- I'm not sure that a broadcast variable with the connection handler is a
good idea in case the target database is distributed, because if the same
handler is used for all the nodes of the Spark cluster then than could have
a negative effect in the data locality of the connection to the database.
- From
http://apache-spark-user-list.1001560.n3.nabble.com/Database-connection-per-worker-td1280.html
I understand that by using an static variable and referencing it in the
call() method of the Function passed to foreachRDD we get a different
connection per Spark worker, I guess it's because there is a different JVM
per worker. But then all the tasks in the same worker would share the same
database handler object, am I right?
- Another idea is using updateStateByKey() using the database handler as
the state, but I guess that would only work for Serializable database
handlers, and for example not for an org.apache.hadoop.hbase.client.HTable
object.


So my question is, which is the best way to get a connection to an external
database per task in Spark Streaming? Or at least per worker. In
http://apache-spark-user-list.1001560.n3.nabble.com/Connecting-to-an-inmemory-database-from-Spark-td1343.html
there is a partial solution to this question, but there the database
handler object is missing. This other question
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Shared-hashmaps-td3247.html
is closer to mine, but there is no answer for it yet

Thanks in advance,

Greetings,

Juan

Which is the best way to get a connection to an external database per task in Spark Streaming?

Reply via email to