I don't know if (your assertion/expectation that) workers will process things 
(multiple partitions) in parallel is really valid. Or if having more partitions 
than workers will necessarily help (unless you are memory bound - so partitions 
is essentially helping your work size rather than execution parallelism).

[Disclaimer: I am no authority on Spark, but wanted to throw my spin based my 
own understanding]. 

Nothing official about it :)

-abhishek-

> On Jul 31, 2015, at 1:03 PM, Sujit Pal <sujitatgt...@gmail.com> wrote:
> 
> Hello,
> 
> I am trying to run a Spark job that hits an external webservice to get back 
> some information. The cluster is 1 master + 4 workers, each worker has 60GB 
> RAM and 4 CPUs. The external webservice is a standalone Solr server, and is 
> accessed using code similar to that shown below.
> 
>> def getResults(keyValues: Iterator[(String, Array[String])]):
>>         Iterator[(String, String)] = {
>>     val solr = new HttpSolrClient()
>>     initializeSolrParameters(solr)
>>     keyValues.map(keyValue => (keyValue._1, process(solr, keyValue)))
>> }
>> myRDD.repartition(10)     
>>              .mapPartitions(keyValues => getResults(keyValues))
>  
> The mapPartitions does some initialization to the SolrJ client per partition 
> and then hits it for each record in the partition via the getResults() call.
> 
> I repartitioned in the hope that this will result in 10 clients hitting Solr 
> simultaneously (I would like to go upto maybe 30-40 simultaneous clients if I 
> can). However, I counted the number of open connections using "netstat -anp | 
> grep ":8983.*ESTABLISHED" in a loop on the Solr box and observed that Solr 
> has a constant 4 clients (ie, equal to the number of workers) over the 
> lifetime of the run.
> 
> My observation leads me to believe that each worker processes a single stream 
> of work sequentially. However, from what I understand about how Spark works, 
> each worker should be able to process number of tasks parallelly, and that 
> repartition() is a hint for it to do so.
> 
> Is there some SparkConf environment variable I should set to increase 
> parallelism in these workers, or should I just configure a cluster with 
> multiple workers per machine? Or is there something I am doing wrong?
> 
> Thank you in advance for any pointers you can provide.
> 
> -sujit
> 

Reply via email to