Re: Reading from a centralized stored

2015-01-06 Thread Cody Koeninger
No, most rdds partition input data appropriately.

On Tue, Jan 6, 2015 at 1:41 PM, Franc Carter franc.car...@rozettatech.com
wrote:


 One more question, to be clarify. Will every node pull in all the data ?

 thanks

 On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger c...@koeninger.org
 wrote:

 If you are not co-locating spark executor processes on the same machines
 where the data is stored, and using an rdd that knows about which node to
 prefer scheduling a task on, yes, the data will be pulled over the network.

 Of the options you listed, S3 and DynamoDB cannot have spark running on
 the same machines. Cassandra can be run on the same nodes as spark, and
 recent versions of the spark cassandra connector implement preferred
 locations.  You can run an rdbms on the same nodes as spark, but JdbcRDD
 doesn't implement preferred locations.

 On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter 
 franc.car...@rozettatech.com wrote:


 Hi,

 I'm trying to understand how a Spark Cluster behaves when the data it is
 processing resides on a centralized/remote store (S3, Cassandra, DynamoDB,
 RDBMS etc).

 Does every node in the cluster retrieve all the data from the central
 store ?

 thanks

 --

 *Franc Carter* | Systems Architect | Rozetta Technology

 franc.car...@rozettatech.com  franc.car...@rozettatech.com|
 www.rozettatechnology.com

 Tel: +61 2 8355 2515

 Level 4, 55 Harrington St, The Rocks NSW 2000

 PO Box H58, Australia Square, Sydney NSW 1215

 AUSTRALIA





 --

 *Franc Carter* | Systems Architect | Rozetta Technology

 franc.car...@rozettatech.com  franc.car...@rozettatech.com|
 www.rozettatechnology.com

 Tel: +61 2 8355 2515

 Level 4, 55 Harrington St, The Rocks NSW 2000

 PO Box H58, Australia Square, Sydney NSW 1215

 AUSTRALIA




Re: Reading from a centralized stored

2015-01-06 Thread Franc Carter
Ah, so it's rdd specific - that would make sense. For those systems where
it is possible to extract sensible susbets the rdds do so. My use case,
which is probably biasing my thinking is DynamoDb which I don't think can
efficiently extract records from M-to-N

cheers

On Wed, Jan 7, 2015 at 6:59 AM, Cody Koeninger c...@koeninger.org wrote:

 No, most rdds partition input data appropriately.

 On Tue, Jan 6, 2015 at 1:41 PM, Franc Carter franc.car...@rozettatech.com
  wrote:


 One more question, to be clarify. Will every node pull in all the data ?

 thanks

 On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger c...@koeninger.org
 wrote:

 If you are not co-locating spark executor processes on the same machines
 where the data is stored, and using an rdd that knows about which node to
 prefer scheduling a task on, yes, the data will be pulled over the network.

 Of the options you listed, S3 and DynamoDB cannot have spark running on
 the same machines. Cassandra can be run on the same nodes as spark, and
 recent versions of the spark cassandra connector implement preferred
 locations.  You can run an rdbms on the same nodes as spark, but JdbcRDD
 doesn't implement preferred locations.

 On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter 
 franc.car...@rozettatech.com wrote:


 Hi,

 I'm trying to understand how a Spark Cluster behaves when the data it
 is processing resides on a centralized/remote store (S3, Cassandra,
 DynamoDB, RDBMS etc).

 Does every node in the cluster retrieve all the data from the central
 store ?

 thanks

 --

 *Franc Carter* | Systems Architect | Rozetta Technology

 franc.car...@rozettatech.com  franc.car...@rozettatech.com|
 www.rozettatechnology.com

 Tel: +61 2 8355 2515

 Level 4, 55 Harrington St, The Rocks NSW 2000

 PO Box H58, Australia Square, Sydney NSW 1215

 AUSTRALIA





 --

 *Franc Carter* | Systems Architect | Rozetta Technology

 franc.car...@rozettatech.com  franc.car...@rozettatech.com|
 www.rozettatechnology.com

 Tel: +61 2 8355 2515

 Level 4, 55 Harrington St, The Rocks NSW 2000

 PO Box H58, Australia Square, Sydney NSW 1215

 AUSTRALIA





-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  franc.car...@rozettatech.com|
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA


Re: Reading from a centralized stored

2015-01-05 Thread Cody Koeninger
If you are not co-locating spark executor processes on the same machines
where the data is stored, and using an rdd that knows about which node to
prefer scheduling a task on, yes, the data will be pulled over the network.

Of the options you listed, S3 and DynamoDB cannot have spark running on the
same machines. Cassandra can be run on the same nodes as spark, and recent
versions of the spark cassandra connector implement preferred locations.
You can run an rdbms on the same nodes as spark, but JdbcRDD doesn't
implement preferred locations.

On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter franc.car...@rozettatech.com
wrote:


 Hi,

 I'm trying to understand how a Spark Cluster behaves when the data it is
 processing resides on a centralized/remote store (S3, Cassandra, DynamoDB,
 RDBMS etc).

 Does every node in the cluster retrieve all the data from the central
 store ?

 thanks

 --

 *Franc Carter* | Systems Architect | Rozetta Technology

 franc.car...@rozettatech.com  franc.car...@rozettatech.com|
 www.rozettatechnology.com

 Tel: +61 2 8355 2515

 Level 4, 55 Harrington St, The Rocks NSW 2000

 PO Box H58, Australia Square, Sydney NSW 1215

 AUSTRALIA




Re: Reading from a centralized stored

2015-01-05 Thread Franc Carter
Thanks, that's what I suspected.

cheers

On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger c...@koeninger.org wrote:

 If you are not co-locating spark executor processes on the same machines
 where the data is stored, and using an rdd that knows about which node to
 prefer scheduling a task on, yes, the data will be pulled over the network.

 Of the options you listed, S3 and DynamoDB cannot have spark running on
 the same machines. Cassandra can be run on the same nodes as spark, and
 recent versions of the spark cassandra connector implement preferred
 locations.  You can run an rdbms on the same nodes as spark, but JdbcRDD
 doesn't implement preferred locations.

 On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter franc.car...@rozettatech.com
  wrote:


 Hi,

 I'm trying to understand how a Spark Cluster behaves when the data it is
 processing resides on a centralized/remote store (S3, Cassandra, DynamoDB,
 RDBMS etc).

 Does every node in the cluster retrieve all the data from the central
 store ?

 thanks

 --

 *Franc Carter* | Systems Architect | Rozetta Technology

 franc.car...@rozettatech.com  franc.car...@rozettatech.com|
 www.rozettatechnology.com

 Tel: +61 2 8355 2515

 Level 4, 55 Harrington St, The Rocks NSW 2000

 PO Box H58, Australia Square, Sydney NSW 1215

 AUSTRALIA





-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  franc.car...@rozettatech.com|
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA


Reading from a centralized stored

2015-01-05 Thread Franc Carter
Hi,

I'm trying to understand how a Spark Cluster behaves when the data it is
processing resides on a centralized/remote store (S3, Cassandra, DynamoDB,
RDBMS etc).

Does every node in the cluster retrieve all the data from the central store
?

thanks

-- 

*Franc Carter* | Systems Architect | Rozetta Technology

franc.car...@rozettatech.com  franc.car...@rozettatech.com|
www.rozettatechnology.com

Tel: +61 2 8355 2515

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

AUSTRALIA