cache() won't speed up a single operation on an RDD, since it is computed the same way before it is persisted.
On Thu, Oct 30, 2014 at 7:15 PM, Sameer Farooqui <same...@databricks.com> wrote: > By the way, in case you haven't done so, do try to .cache() the RDD before > running a .count() on it as that could make a big speed improvement. > > > > On Thu, Oct 30, 2014 at 11:12 AM, Sameer Farooqui <same...@databricks.com> > wrote: >> >> Hi Shahab, >> >> Are you running Spark in Local, Standalone, YARN or Mesos mode? >> >> If you're running in Standalone/YARN/Mesos, then the .count() action is >> indeed automatically parallelized across multiple Executors. >> >> When you run a .count() on an RDD, it is actually distributing tasks to >> different executors to each do a local count on a local partition and then >> all the tasks send their sub-counts back to the driver for final >> aggregation. This sounds like the kind of behavior you're looking for. >> >> However, in Local mode, everything runs in a single JVM (the driver + >> executor), so there's no parallelization across Executors. >> >> >> >> On Thu, Oct 30, 2014 at 10:25 AM, shahab <shahab.mok...@gmail.com> wrote: >>> >>> Hi, >>> >>> I noticed that the "count" (of RDD) in many of my queries is the most >>> time consuming one as it runs in the "driver" process rather then done by >>> parallel worker nodes, >>> >>> Is there any way to perform "count" in parallel , at at least parallelize >>> it as much as possible? >>> >>> best, >>> /Shahab >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org