Hi Sean,

Thank you. I see your point. What I was thinking is that, do computation in
a distributed fashion and do the storing from a single place. But you are
right, having multiple DB connections actually is fine.

Thanks for answering my questions. That helps me understand the system.

Cheers,

Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108


On Thu, Jul 17, 2014 at 2:53 PM, Sean Owen <so...@cloudera.com> wrote:

> On Thu, Jul 17, 2014 at 10:39 PM, Yan Fang <yanfang...@gmail.com> wrote:
> > Thank you for the help. If I use TD's approache, it works and there is no
> > exception. Only drawback is that it will create many connections to the
> DB,
> > which I was trying to avoid.
>
> Connection-like objects aren't data that can be serialized. What would
> it mean to share one connection with N workers? that they all connect
> back to the driver, and through one DB connection there? this defeats
> the purpose of distributed computing. You want multiple DB
> connections. You can limit the number of partitions if needed.
>
>
> > Here is a snapshot of my code. Mark as red for the important code. What I
> > was thinking is that, if I call the collect() method, Spark Streaming
> will
> > bring the data to the driver and then the db object does not need to be
> sent
>
> The Function you pass to foreachRDD() has a reference to db though.
> That's what is making it be serialized.
>
> > to executors. My observation is that, thought exceptions are thrown, the
> > insert function still works. Any thought about that? Also paste the log
> in
> > case it helps .http://pastebin.com/T1bYvLWB
>
> Any executors that run locally might skip the serialization and
> succeed (?) but I don't think the remote executors can be succeeding.
>

Reply via email to