Re: Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-09 Thread Juan Rodríguez Hortalá
Hi Jerry, it's all clear to me now, I will try with something like Apache DBCP for the connection pool Thanks a lot for your help! 2014-07-09 3:08 GMT+02:00 Shao, Saisai saisai.s...@intel.com: Yes, that would be the Java equivalence to use static class member, but you should carefully

Re: Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-08 Thread Juan Rodríguez Hortalá
Hi Tobias, thanks for your help. I understand that with that code we obtain a database connection per partition, but I also suspect that with that code a new database connection is created per each execution of the function used as argument for mapPartitions(). That would be very inefficient

RE: Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-08 Thread Shao, Saisai
I think you can maintain a connection pool or keep the connection as a long-lived object in executor side (like lazily creating a singleton object in object { } in Scala), so your task can get this connection each time executing a task, not creating a new one, that would be good for your

Re: Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-08 Thread Juan Rodríguez Hortalá
Hi Jerry, thanks for your answer. I'm using Spark Streaming for Java, and I only have rudimentary knowledge about Scala, how could I recreate in Java the lazy creation of a singleton object that you propose for Scala? Maybe a static class member in Java for the connection would be the solution?

RE: Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-08 Thread Shao, Saisai
Yes, that would be the Java equivalence to use static class member, but you should carefully program to prevent resource leakage. A good choice is to use third-party DB connection library which supports connection pool, that will alleviate your programming efforts. Thanks Jerry From: Juan

Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-07 Thread Juan Rodríguez Hortalá
Hi list, I'm writing a Spark Streaming program that reads from a kafka topic, performs some transformations on the data, and then inserts each record in a database with foreachRDD. I was wondering which is the best way to handle the connection to the database so each worker, or even each task,

Re: Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-07 Thread Tobias Pfeiffer
Juan, I am doing something similar, just not insert into SQL database, but issue some RPC call. I think mapPartitions() may be helpful to you. You could do something like dstream.mapPartitions(iter = { val db = new DbConnection() // maybe only do the above if !iter.isEmpty iter.map(item =