Instead of foreach try to use forEachPartitions, that will initialize the connector per partition rather than per record.
Thanks Best Regards On Fri, Aug 14, 2015 at 1:13 PM, Dawid Wysakowicz < wysakowicz.da...@gmail.com> wrote: > No the connector does not need to be serializable cause it is constructed > on the worker. Only objects shuffled across partitions needs to be > serializable. > > 2015-08-14 9:40 GMT+02:00 mark <manwoodv...@googlemail.com>: > >> I guess I'm looking for a more general way to use complex graphs of >> objects that cannot be serialized in a task executing on a worker, not just >> DB connectors. Something like shipping jars to the worker maybe? >> >> I'm not sure I understand how your foreach example solves the issue - the >> Connector there would still need to be serializable surely? >> >> Thanks >> On 14 Aug 2015 8:32 am, "Dawid Wysakowicz" <wysakowicz.da...@gmail.com> >> wrote: >> >>> I am not an expert but first of all check if there is no ready connector >>> (you mentioned Cassandra - check: spark-cassandra-connector >>> <https://github.com/datastax/spark-cassandra-connector> ). >>> >>> If you really want to do sth on your own all objects constructed in the >>> passed function will be allocated on the worker. >>> Example given: >>> >>> sc.parrallelize((1 to 100)).forEach(x => new Connector().save(x)) >>> but this way you allocate resources frequently >>> >>> 2015-08-14 9:05 GMT+02:00 mark <manwoodv...@googlemail.com>: >>> >>>> I have a Spark job that computes some values and needs to write those >>>> values to a data store. The classes that write to the data store are not >>>> serializable (eg, Cassandra session objects etc). >>>> >>>> I don't want to collect all the results at the driver, I want each >>>> worker to write the data - what is the suggested approach for using code >>>> that can't be serialized in a task? >>>> >>> >>> >