I don't know if (your assertion/expectation that) workers will process things (multiple partitions) in parallel is really valid. Or if having more partitions than workers will necessarily help (unless you are memory bound - so partitions is essentially helping your work size rather than execution parallelism).
[Disclaimer: I am no authority on Spark, but wanted to throw my spin based my own understanding]. Nothing official about it :) -abhishek- > On Jul 31, 2015, at 1:03 PM, Sujit Pal <sujitatgt...@gmail.com> wrote: > > Hello, > > I am trying to run a Spark job that hits an external webservice to get back > some information. The cluster is 1 master + 4 workers, each worker has 60GB > RAM and 4 CPUs. The external webservice is a standalone Solr server, and is > accessed using code similar to that shown below. > >> def getResults(keyValues: Iterator[(String, Array[String])]): >> Iterator[(String, String)] = { >> val solr = new HttpSolrClient() >> initializeSolrParameters(solr) >> keyValues.map(keyValue => (keyValue._1, process(solr, keyValue))) >> } >> myRDD.repartition(10) >> .mapPartitions(keyValues => getResults(keyValues)) > > The mapPartitions does some initialization to the SolrJ client per partition > and then hits it for each record in the partition via the getResults() call. > > I repartitioned in the hope that this will result in 10 clients hitting Solr > simultaneously (I would like to go upto maybe 30-40 simultaneous clients if I > can). However, I counted the number of open connections using "netstat -anp | > grep ":8983.*ESTABLISHED" in a loop on the Solr box and observed that Solr > has a constant 4 clients (ie, equal to the number of workers) over the > lifetime of the run. > > My observation leads me to believe that each worker processes a single stream > of work sequentially. However, from what I understand about how Spark works, > each worker should be able to process number of tasks parallelly, and that > repartition() is a hint for it to do so. > > Is there some SparkConf environment variable I should set to increase > parallelism in these workers, or should I just configure a cluster with > multiple workers per machine? Or is there something I am doing wrong? > > Thank you in advance for any pointers you can provide. > > -sujit >