Hi Sean, Thank you. I see your point. What I was thinking is that, do computation in a distributed fashion and do the storing from a single place. But you are right, having multiple DB connections actually is fine.
Thanks for answering my questions. That helps me understand the system. Cheers, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Thu, Jul 17, 2014 at 2:53 PM, Sean Owen <so...@cloudera.com> wrote: > On Thu, Jul 17, 2014 at 10:39 PM, Yan Fang <yanfang...@gmail.com> wrote: > > Thank you for the help. If I use TD's approache, it works and there is no > > exception. Only drawback is that it will create many connections to the > DB, > > which I was trying to avoid. > > Connection-like objects aren't data that can be serialized. What would > it mean to share one connection with N workers? that they all connect > back to the driver, and through one DB connection there? this defeats > the purpose of distributed computing. You want multiple DB > connections. You can limit the number of partitions if needed. > > > > Here is a snapshot of my code. Mark as red for the important code. What I > > was thinking is that, if I call the collect() method, Spark Streaming > will > > bring the data to the driver and then the db object does not need to be > sent > > The Function you pass to foreachRDD() has a reference to db though. > That's what is making it be serialized. > > > to executors. My observation is that, thought exceptions are thrown, the > > insert function still works. Any thought about that? Also paste the log > in > > case it helps .http://pastebin.com/T1bYvLWB > > Any executors that run locally might skip the serialization and > succeed (?) but I don't think the remote executors can be succeeding. >