I've wanted similar functionality too: when network IO bound (for me I was trying to pull things from s3 to hdfs) I wish there was a `.mapMachines` api where I wouldn't have to try guess at the proper partitioning of a 'driver' RDD for `sc.parallelize(1 to N, N).map( i=> pull the i'th chunk from S3 )`.
On Thu, Aug 27, 2015 at 10:01 AM Young, Matthew T <matthew.t.yo...@intel.com> wrote: > What’s the canonical way to find out the number of physical machines in a > cluster at runtime in Spark? I believe SparkContext.defaultParallelism will > give me the number of cores, but I’m interested in the number of NICs. > > > > I’m writing a Spark streaming application to ingest from Kafka with the > Receiver API and want to create one DStream per physical machine for read > parallelism’s sake. How can I figure out at run time how many machines > there are so I know how many DStreams to create? >