One more question, to be clarify. Will every node pull in all the data ? thanks
On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger <c...@koeninger.org> wrote: > If you are not co-locating spark executor processes on the same machines > where the data is stored, and using an rdd that knows about which node to > prefer scheduling a task on, yes, the data will be pulled over the network. > > Of the options you listed, S3 and DynamoDB cannot have spark running on > the same machines. Cassandra can be run on the same nodes as spark, and > recent versions of the spark cassandra connector implement preferred > locations. You can run an rdbms on the same nodes as spark, but JdbcRDD > doesn't implement preferred locations. > > On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter <franc.car...@rozettatech.com > > wrote: > >> >> Hi, >> >> I'm trying to understand how a Spark Cluster behaves when the data it is >> processing resides on a centralized/remote store (S3, Cassandra, DynamoDB, >> RDBMS etc). >> >> Does every node in the cluster retrieve all the data from the central >> store ? >> >> thanks >> >> -- >> >> *Franc Carter* | Systems Architect | Rozetta Technology >> >> franc.car...@rozettatech.com <franc.car...@rozettatech.com>| >> www.rozettatechnology.com >> >> Tel: +61 2 8355 2515 >> >> Level 4, 55 Harrington St, The Rocks NSW 2000 >> >> PO Box H58, Australia Square, Sydney NSW 1215 >> >> AUSTRALIA >> >> > -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com <franc.car...@rozettatech.com>| www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA