No, most rdds partition input data appropriately.
On Tue, Jan 6, 2015 at 1:41 PM, Franc Carter franc.car...@rozettatech.com
wrote:
One more question, to be clarify. Will every node pull in all the data ?
thanks
On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger c...@koeninger.org
wrote:
If
Ah, so it's rdd specific - that would make sense. For those systems where
it is possible to extract sensible susbets the rdds do so. My use case,
which is probably biasing my thinking is DynamoDb which I don't think can
efficiently extract records from M-to-N
cheers
On Wed, Jan 7, 2015 at 6:59
If you are not co-locating spark executor processes on the same machines
where the data is stored, and using an rdd that knows about which node to
prefer scheduling a task on, yes, the data will be pulled over the network.
Of the options you listed, S3 and DynamoDB cannot have spark running on
Thanks, that's what I suspected.
cheers
On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger c...@koeninger.org wrote:
If you are not co-locating spark executor processes on the same machines
where the data is stored, and using an rdd that knows about which node to
prefer scheduling a task on,
Hi,
I'm trying to understand how a Spark Cluster behaves when the data it is
processing resides on a centralized/remote store (S3, Cassandra, DynamoDB,
RDBMS etc).
Does every node in the cluster retrieve all the data from the central store
?
thanks
--
*Franc Carter* | Systems Architect |