Re: Using unserializable classes in tasks
Instead of foreach try to use forEachPartitions, that will initialize the connector per partition rather than per record. Thanks Best Regards On Fri, Aug 14, 2015 at 1:13 PM, Dawid Wysakowicz wysakowicz.da...@gmail.com wrote: No the connector does not need to be serializable cause it is constructed on the worker. Only objects shuffled across partitions needs to be serializable. 2015-08-14 9:40 GMT+02:00 mark manwoodv...@googlemail.com: I guess I'm looking for a more general way to use complex graphs of objects that cannot be serialized in a task executing on a worker, not just DB connectors. Something like shipping jars to the worker maybe? I'm not sure I understand how your foreach example solves the issue - the Connector there would still need to be serializable surely? Thanks On 14 Aug 2015 8:32 am, Dawid Wysakowicz wysakowicz.da...@gmail.com wrote: I am not an expert but first of all check if there is no ready connector (you mentioned Cassandra - check: spark-cassandra-connector https://github.com/datastax/spark-cassandra-connector ). If you really want to do sth on your own all objects constructed in the passed function will be allocated on the worker. Example given: sc.parrallelize((1 to 100)).forEach(x = new Connector().save(x)) but this way you allocate resources frequently 2015-08-14 9:05 GMT+02:00 mark manwoodv...@googlemail.com: I have a Spark job that computes some values and needs to write those values to a data store. The classes that write to the data store are not serializable (eg, Cassandra session objects etc). I don't want to collect all the results at the driver, I want each worker to write the data - what is the suggested approach for using code that can't be serialized in a task?
Fwd: Using unserializable classes in tasks
-- Forwarded message -- From: Dawid Wysakowicz wysakowicz.da...@gmail.com Date: 2015-08-14 9:32 GMT+02:00 Subject: Re: Using unserializable classes in tasks To: mark manwoodv...@googlemail.com I am not an expert but first of all check if there is no ready connector (you mentioned Cassandra - check: spark-cassandra-connector https://github.com/datastax/spark-cassandra-connector ). If you really want to do sth on your own all objects constructed in the passed function will be allocated on the worker. Example given: sc.parrallelize((1 to 100)).forEach(x = new Connector().save(x)) but this way you allocate resources frequently 2015-08-14 9:05 GMT+02:00 mark manwoodv...@googlemail.com: I have a Spark job that computes some values and needs to write those values to a data store. The classes that write to the data store are not serializable (eg, Cassandra session objects etc). I don't want to collect all the results at the driver, I want each worker to write the data - what is the suggested approach for using code that can't be serialized in a task?
Using unserializable classes in tasks
I have a Spark job that computes some values and needs to write those values to a data store. The classes that write to the data store are not serializable (eg, Cassandra session objects etc). I don't want to collect all the results at the driver, I want each worker to write the data - what is the suggested approach for using code that can't be serialized in a task?
Re: Using unserializable classes in tasks
No the connector does not need to be serializable cause it is constructed on the worker. Only objects shuffled across partitions needs to be serializable. 2015-08-14 9:40 GMT+02:00 mark manwoodv...@googlemail.com: I guess I'm looking for a more general way to use complex graphs of objects that cannot be serialized in a task executing on a worker, not just DB connectors. Something like shipping jars to the worker maybe? I'm not sure I understand how your foreach example solves the issue - the Connector there would still need to be serializable surely? Thanks On 14 Aug 2015 8:32 am, Dawid Wysakowicz wysakowicz.da...@gmail.com wrote: I am not an expert but first of all check if there is no ready connector (you mentioned Cassandra - check: spark-cassandra-connector https://github.com/datastax/spark-cassandra-connector ). If you really want to do sth on your own all objects constructed in the passed function will be allocated on the worker. Example given: sc.parrallelize((1 to 100)).forEach(x = new Connector().save(x)) but this way you allocate resources frequently 2015-08-14 9:05 GMT+02:00 mark manwoodv...@googlemail.com: I have a Spark job that computes some values and needs to write those values to a data store. The classes that write to the data store are not serializable (eg, Cassandra session objects etc). I don't want to collect all the results at the driver, I want each worker to write the data - what is the suggested approach for using code that can't be serialized in a task?