subject:"\[Beg for help\] spark job with very low efficiency"

Re: [Beg for help] spark job with very low efficiency

2015-12-21 Thread Zhiliang Zhu

Dear Sab , I must appreciate your kind reply very much, it would be much helpful. On Monday, December 21, 2015 8:49 PM, Sabarish Sasidharan wrote: collect() will bring everything to driver and is costly. Instead of using collect() + parallelize, you

Re: [Beg for help] spark job with very low efficiency

2015-12-21 Thread Sabarish Sasidharan

collect() will bring everything to driver and is costly. Instead of using collect() + parallelize, you could use rdd1.checkpoint() along with a more efficient action like rdd1.count(). This you can do within the for loop. Hopefully you are using the Kryo serializer already. Regards Sab On Mon,