If you are not co-locating spark executor processes on the same machines
where the data is stored, and using an rdd that knows about which node to
prefer scheduling a task on, yes, the data will be pulled over the network.

Of the options you listed, S3 and DynamoDB cannot have spark running on the
same machines. Cassandra can be run on the same nodes as spark, and recent
versions of the spark cassandra connector implement preferred locations.
You can run an rdbms on the same nodes as spark, but JdbcRDD doesn't
implement preferred locations.

On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter <franc.car...@rozettatech.com>
wrote:

>
> Hi,
>
> I'm trying to understand how a Spark Cluster behaves when the data it is
> processing resides on a centralized/remote store (S3, Cassandra, DynamoDB,
> RDBMS etc).
>
> Does every node in the cluster retrieve all the data from the central
> store ?
>
> thanks
>
> --
>
> *Franc Carter* | Systems Architect | Rozetta Technology
>
> franc.car...@rozettatech.com  <franc.car...@rozettatech.com>|
> www.rozettatechnology.com
>
> Tel: +61 2 8355 2515
>
> Level 4, 55 Harrington St, The Rocks NSW 2000
>
> PO Box H58, Australia Square, Sydney NSW 1215
>
> AUSTRALIA
>
>

Reply via email to