Re: unserializable object in Spark Streaming context

2014-07-18 Thread Gino Bustelo
I get TD's recommendation of sharing a connection among tasks. Now, is there a good way to determine when to close connections? Gino B. On Jul 17, 2014, at 7:05 PM, Yan Fang yanfang...@gmail.com wrote: Hi Sean, Thank you. I see your point. What I was thinking is that, do computation in

Re: unserializable object in Spark Streaming context

2014-07-18 Thread Tathagata Das
Thats, a good question. My first reach is timeout. Timing out after 10s of seconds should be sufficient. So there should be a timer in the singleton that runs a check every second, on when the singleton was last used, and closes the connections after a time out. Any attempts to use the connection

Re: unserializable object in Spark Streaming context

2014-07-18 Thread Tathagata Das
Actually, let me clarify further. There are number of possibilities. 1. The easier, less efficient way is to create a connection object every time you do foreachPartition (as shown in the pseudocode earlier in the thread). For each partition, you create a connection, use it to push a all the

unserializable object in Spark Streaming context

2014-07-17 Thread Yan Fang
Hi guys, need some help in this problem. In our use case, we need to continuously insert values into the database. So our approach is to create the jdbc object in the main method and then do the inserting operation in the DStream foreachRDD operation. Is this approach reasonable? Then the

Re: unserializable object in Spark Streaming context

2014-07-17 Thread Marcelo Vanzin
Could you share some code (or pseudo-code)? Sounds like you're instantiating the JDBC connection in the driver, and using it inside a closure that would be run in a remote executor. That means that the connection object would need to be serializable. If that sounds like what you're doing, it

Re: unserializable object in Spark Streaming context

2014-07-17 Thread Tathagata Das
And if Marcelo's guess is correct, then the right way to do this would be to lazily / dynamically create the jdbc connection server as a singleton in the workers/executors and use that. Something like this. dstream.foreachRDD(rdd = { rdd.foreachPartition((iterator: Iterator[...]) = {

Re: unserializable object in Spark Streaming context

2014-07-17 Thread Yan Fang
Hi Marcelo and TD, Thank you for the help. If I use TD's approache, it works and there is no exception. Only drawback is that it will create many connections to the DB, which I was trying to avoid. Here is a snapshot of my code. Mark as red for the important code. What I was thinking is that, if

Re: unserializable object in Spark Streaming context

2014-07-17 Thread Yan Fang
Hi Sean, Thank you. I see your point. What I was thinking is that, do computation in a distributed fashion and do the storing from a single place. But you are right, having multiple DB connections actually is fine. Thanks for answering my questions. That helps me understand the system. Cheers,