Instead of foreach try to use forEachPartitions, that will initialize the
connector per partition rather than per record.

Thanks
Best Regards

On Fri, Aug 14, 2015 at 1:13 PM, Dawid Wysakowicz <
wysakowicz.da...@gmail.com> wrote:

> No the connector does not need to be serializable cause it is constructed
> on the worker. Only objects shuffled across partitions needs to be
> serializable.
>
> 2015-08-14 9:40 GMT+02:00 mark <manwoodv...@googlemail.com>:
>
>> I guess I'm looking for a more general way to use complex graphs of
>> objects that cannot be serialized in a task executing on a worker, not just
>> DB connectors. Something like shipping jars to the worker maybe?
>>
>> I'm not sure I understand how your foreach example solves the issue - the
>> Connector there would still need to be serializable surely?
>>
>> Thanks
>> On 14 Aug 2015 8:32 am, "Dawid Wysakowicz" <wysakowicz.da...@gmail.com>
>> wrote:
>>
>>> I am not an expert but first of all check if there is no ready connector
>>> (you mentioned Cassandra - check: spark-cassandra-connector
>>> <https://github.com/datastax/spark-cassandra-connector> ).
>>>
>>> If you really want to do sth on your own all objects constructed in the
>>> passed function will be allocated on the worker.
>>> Example given:
>>>
>>> sc.parrallelize((1 to 100)).forEach(x => new Connector().save(x))
>>>  but this way you allocate resources frequently
>>>
>>> 2015-08-14 9:05 GMT+02:00 mark <manwoodv...@googlemail.com>:
>>>
>>>> I have a Spark job that computes some values and needs to write those
>>>> values to a data store. The classes that write to the data store are not
>>>> serializable (eg, Cassandra session objects etc).
>>>>
>>>> I don't want to collect all the results at the driver, I want each
>>>> worker to write the data - what is the suggested approach for using code
>>>> that can't be serialized in a task?
>>>>
>>>
>>>
>

Reply via email to