cache() won't speed up a single operation on an RDD, since it is
computed the same way before it is persisted.

On Thu, Oct 30, 2014 at 7:15 PM, Sameer Farooqui <same...@databricks.com> wrote:
> By the way, in case you haven't done so, do try to .cache() the RDD before
> running a .count() on it as that could make a big speed improvement.
>
>
>
> On Thu, Oct 30, 2014 at 11:12 AM, Sameer Farooqui <same...@databricks.com>
> wrote:
>>
>> Hi Shahab,
>>
>> Are you running Spark in Local, Standalone, YARN or Mesos mode?
>>
>> If you're running in Standalone/YARN/Mesos, then the .count() action is
>> indeed automatically parallelized across multiple Executors.
>>
>> When you run a .count() on an RDD, it is actually distributing tasks to
>> different executors to each do a local count on a local partition and then
>> all the tasks send their sub-counts back to the driver for final
>> aggregation. This sounds like the kind of behavior you're looking for.
>>
>> However, in Local mode, everything runs in a single JVM (the driver +
>> executor), so there's no parallelization across Executors.
>>
>>
>>
>> On Thu, Oct 30, 2014 at 10:25 AM, shahab <shahab.mok...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I noticed that the "count" (of RDD)  in many of my queries is the most
>>> time consuming one as it runs in the "driver" process rather then done by
>>> parallel worker nodes,
>>>
>>> Is there any way to perform "count" in parallel , at at least parallelize
>>> it as much as possible?
>>>
>>> best,
>>> /Shahab
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to