Re: Getting number of physical machines in Spark

2015-08-28 Thread Alexey Grishchenko
There's no canonical way to do this as I understand. For instance, when
running under YARN, you have completely no idea where your containers would
be started. Moreover, if one of the containers would fail, it might be
restarted on another machine so the machine number might change at runtime

To check the current number of machines you can do something like this
(python):

import socket
machines = sc.parallelize(xrange(1000)).mapPartitions(lambda x:
[socket.gethostname()]).distinct().collect()


On Fri, Aug 28, 2015 at 9:09 PM, Jason ja...@jasonknight.us wrote:

 I've wanted similar functionality too: when network IO bound (for me I was
 trying to pull things from s3 to hdfs) I wish there was a `.mapMachines`
 api where I wouldn't have to try guess at the proper partitioning of a
 'driver' RDD for `sc.parallelize(1 to N, N).map( i= pull the i'th chunk
 from S3 )`.

 On Thu, Aug 27, 2015 at 10:01 AM Young, Matthew T 
 matthew.t.yo...@intel.com wrote:

 What’s the canonical way to find out the number of physical machines in a
 cluster at runtime in Spark? I believe SparkContext.defaultParallelism will
 give me the number of cores, but I’m interested in the number of NICs.



 I’m writing a Spark streaming application to ingest from Kafka with the
 Receiver API and want to create one DStream per physical machine for read
 parallelism’s sake. How can I figure out at run time how many machines
 there are so I know how many DStreams to create?




-- 
Best regards, Alexey Grishchenko

phone: +353 (87) 262-2154
email: programme...@gmail.com
web:   http://0x0fff.com


Re: Getting number of physical machines in Spark

2015-08-28 Thread Jason
I've wanted similar functionality too: when network IO bound (for me I was
trying to pull things from s3 to hdfs) I wish there was a `.mapMachines`
api where I wouldn't have to try guess at the proper partitioning of a
'driver' RDD for `sc.parallelize(1 to N, N).map( i= pull the i'th chunk
from S3 )`.

On Thu, Aug 27, 2015 at 10:01 AM Young, Matthew T matthew.t.yo...@intel.com
wrote:

 What’s the canonical way to find out the number of physical machines in a
 cluster at runtime in Spark? I believe SparkContext.defaultParallelism will
 give me the number of cores, but I’m interested in the number of NICs.



 I’m writing a Spark streaming application to ingest from Kafka with the
 Receiver API and want to create one DStream per physical machine for read
 parallelism’s sake. How can I figure out at run time how many machines
 there are so I know how many DStreams to create?



Getting number of physical machines in Spark

2015-08-27 Thread Young, Matthew T
What's the canonical way to find out the number of physical machines in a 
cluster at runtime in Spark? I believe SparkContext.defaultParallelism will 
give me the number of cores, but I'm interested in the number of NICs.

I'm writing a Spark streaming application to ingest from Kafka with the 
Receiver API and want to create one DStream per physical machine for read 
parallelism's sake. How can I figure out at run time how many machines there 
are so I know how many DStreams to create?