Re: Using unserializable classes in tasks

2015-08-25 Thread Akhil Das
Instead of foreach try to use forEachPartitions, that will initialize the
connector per partition rather than per record.

Thanks
Best Regards

On Fri, Aug 14, 2015 at 1:13 PM, Dawid Wysakowicz 
wysakowicz.da...@gmail.com wrote:

 No the connector does not need to be serializable cause it is constructed
 on the worker. Only objects shuffled across partitions needs to be
 serializable.

 2015-08-14 9:40 GMT+02:00 mark manwoodv...@googlemail.com:

 I guess I'm looking for a more general way to use complex graphs of
 objects that cannot be serialized in a task executing on a worker, not just
 DB connectors. Something like shipping jars to the worker maybe?

 I'm not sure I understand how your foreach example solves the issue - the
 Connector there would still need to be serializable surely?

 Thanks
 On 14 Aug 2015 8:32 am, Dawid Wysakowicz wysakowicz.da...@gmail.com
 wrote:

 I am not an expert but first of all check if there is no ready connector
 (you mentioned Cassandra - check: spark-cassandra-connector
 https://github.com/datastax/spark-cassandra-connector ).

 If you really want to do sth on your own all objects constructed in the
 passed function will be allocated on the worker.
 Example given:

 sc.parrallelize((1 to 100)).forEach(x = new Connector().save(x))
  but this way you allocate resources frequently

 2015-08-14 9:05 GMT+02:00 mark manwoodv...@googlemail.com:

 I have a Spark job that computes some values and needs to write those
 values to a data store. The classes that write to the data store are not
 serializable (eg, Cassandra session objects etc).

 I don't want to collect all the results at the driver, I want each
 worker to write the data - what is the suggested approach for using code
 that can't be serialized in a task?






Fwd: Using unserializable classes in tasks

2015-08-14 Thread Dawid Wysakowicz
-- Forwarded message --
From: Dawid Wysakowicz wysakowicz.da...@gmail.com
Date: 2015-08-14 9:32 GMT+02:00
Subject: Re: Using unserializable classes in tasks
To: mark manwoodv...@googlemail.com


I am not an expert but first of all check if there is no ready connector
(you mentioned Cassandra - check: spark-cassandra-connector
https://github.com/datastax/spark-cassandra-connector ).

If you really want to do sth on your own all objects constructed in the
passed function will be allocated on the worker.
Example given:

sc.parrallelize((1 to 100)).forEach(x = new Connector().save(x))
 but this way you allocate resources frequently

2015-08-14 9:05 GMT+02:00 mark manwoodv...@googlemail.com:

 I have a Spark job that computes some values and needs to write those
 values to a data store. The classes that write to the data store are not
 serializable (eg, Cassandra session objects etc).

 I don't want to collect all the results at the driver, I want each worker
 to write the data - what is the suggested approach for using code that
 can't be serialized in a task?



Using unserializable classes in tasks

2015-08-14 Thread mark
I have a Spark job that computes some values and needs to write those
values to a data store. The classes that write to the data store are not
serializable (eg, Cassandra session objects etc).

I don't want to collect all the results at the driver, I want each worker
to write the data - what is the suggested approach for using code that
can't be serialized in a task?


Re: Using unserializable classes in tasks

2015-08-14 Thread Dawid Wysakowicz
No the connector does not need to be serializable cause it is constructed
on the worker. Only objects shuffled across partitions needs to be
serializable.

2015-08-14 9:40 GMT+02:00 mark manwoodv...@googlemail.com:

 I guess I'm looking for a more general way to use complex graphs of
 objects that cannot be serialized in a task executing on a worker, not just
 DB connectors. Something like shipping jars to the worker maybe?

 I'm not sure I understand how your foreach example solves the issue - the
 Connector there would still need to be serializable surely?

 Thanks
 On 14 Aug 2015 8:32 am, Dawid Wysakowicz wysakowicz.da...@gmail.com
 wrote:

 I am not an expert but first of all check if there is no ready connector
 (you mentioned Cassandra - check: spark-cassandra-connector
 https://github.com/datastax/spark-cassandra-connector ).

 If you really want to do sth on your own all objects constructed in the
 passed function will be allocated on the worker.
 Example given:

 sc.parrallelize((1 to 100)).forEach(x = new Connector().save(x))
  but this way you allocate resources frequently

 2015-08-14 9:05 GMT+02:00 mark manwoodv...@googlemail.com:

 I have a Spark job that computes some values and needs to write those
 values to a data store. The classes that write to the data store are not
 serializable (eg, Cassandra session objects etc).

 I don't want to collect all the results at the driver, I want each
 worker to write the data - what is the suggested approach for using code
 that can't be serialized in a task?