Re: Doing RDD.count in parallel , at at least parallelize it as much as possible?

2014-10-31 Thread Sean Owen
cache() won't speed up a single operation on an RDD, since it is computed the same way before it is persisted. On Thu, Oct 30, 2014 at 7:15 PM, Sameer Farooqui same...@databricks.com wrote: By the way, in case you haven't done so, do try to .cache() the RDD before running a .count() on it as

Doing RDD.count in parallel , at at least parallelize it as much as possible?

2014-10-30 Thread shahab
Hi, I noticed that the count (of RDD) in many of my queries is the most time consuming one as it runs in the driver process rather then done by parallel worker nodes, Is there any way to perform count in parallel , at at least parallelize it as much as possible? best, /Shahab

Re: Doing RDD.count in parallel , at at least parallelize it as much as possible?

2014-10-30 Thread Sameer Farooqui
Hi Shahab, Are you running Spark in Local, Standalone, YARN or Mesos mode? If you're running in Standalone/YARN/Mesos, then the .count() action is indeed automatically parallelized across multiple Executors. When you run a .count() on an RDD, it is actually distributing tasks to different

Re: Doing RDD.count in parallel , at at least parallelize it as much as possible?

2014-10-30 Thread Sameer Farooqui
By the way, in case you haven't done so, do try to .cache() the RDD before running a .count() on it as that could make a big speed improvement. On Thu, Oct 30, 2014 at 11:12 AM, Sameer Farooqui same...@databricks.com wrote: Hi Shahab, Are you running Spark in Local, Standalone, YARN or

Re: Doing RDD.count in parallel , at at least parallelize it as much as possible?

2014-10-30 Thread Sonal Goyal
Hey Sameer, Wouldnt local[x] run count parallelly in each of the x threads? Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Thu, Oct 30, 2014 at 11:42 PM, Sameer Farooqui same...@databricks.com wrote: Hi Shahab, Are you running Spark