There's no canonical way to do this as I understand. For instance, when
running under YARN, you have completely no idea where your containers would
be started. Moreover, if one of the containers would fail, it might be
restarted on another machine so the machine number might change at runtime
To check the current number of machines you can do something like this
(python):
import socket
machines = sc.parallelize(xrange(1000)).mapPartitions(lambda x:
[socket.gethostname()]).distinct().collect()
On Fri, Aug 28, 2015 at 9:09 PM, Jason ja...@jasonknight.us wrote:
I've wanted similar functionality too: when network IO bound (for me I was
trying to pull things from s3 to hdfs) I wish there was a `.mapMachines`
api where I wouldn't have to try guess at the proper partitioning of a
'driver' RDD for `sc.parallelize(1 to N, N).map( i= pull the i'th chunk
from S3 )`.
On Thu, Aug 27, 2015 at 10:01 AM Young, Matthew T
matthew.t.yo...@intel.com wrote:
What’s the canonical way to find out the number of physical machines in a
cluster at runtime in Spark? I believe SparkContext.defaultParallelism will
give me the number of cores, but I’m interested in the number of NICs.
I’m writing a Spark streaming application to ingest from Kafka with the
Receiver API and want to create one DStream per physical machine for read
parallelism’s sake. How can I figure out at run time how many machines
there are so I know how many DStreams to create?
--
Best regards, Alexey Grishchenko
phone: +353 (87) 262-2154
email: programme...@gmail.com
web: http://0x0fff.com