Thanks, that's what I suspected.

cheers

On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger <c...@koeninger.org> wrote:

> If you are not co-locating spark executor processes on the same machines
> where the data is stored, and using an rdd that knows about which node to
> prefer scheduling a task on, yes, the data will be pulled over the network.
>
> Of the options you listed, S3 and DynamoDB cannot have spark running on
> the same machines. Cassandra can be run on the same nodes as spark, and
> recent versions of the spark cassandra connector implement preferred
> locations.  You can run an rdbms on the same nodes as spark, but JdbcRDD
> doesn't implement preferred locations.
>
> On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter <franc.car...@rozettatech.com
> > wrote:
>
>>
>> Hi,
>>
>> I'm trying to understand how a Spark Cluster behaves when the data it is
>> processing resides on a centralized/remote store (S3, Cassandra, DynamoDB,
>> RDBMS etc).
>>
>> Does every node in the cluster retrieve all the data from the central
>> store ?
>>
>> thanks
>>
>> --
>>
>> *Franc Carter* | Systems Architect | Rozetta Technology
>>
>> franc.car...@rozettatech.com  <franc.car...@rozettatech.com>|
>> www.rozettatechnology.com
>>
>> Tel: +61 2 8355 2515
>>
>> Level 4, 55 Harrington St, The Rocks NSW 2000
>>
>> PO Box H58, Australia Square, Sydney NSW 1215
>>
>> AUSTRALIA
>>
>>
>


-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  <franc.car...@rozettatech.com>|
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA

Reply via email to