Hello,

I am trying to run a Spark job that hits an external webservice to get back
some information. The cluster is 1 master + 4 workers, each worker has 60GB
RAM and 4 CPUs. The external webservice is a standalone Solr server, and is
accessed using code similar to that shown below.

def getResults(keyValues: Iterator[(String, Array[String])]):
>         Iterator[(String, String)] = {
>     val solr = new HttpSolrClient()
>     initializeSolrParameters(solr)
>     keyValues.map(keyValue => (keyValue._1, process(solr, keyValue)))
> }
> myRDD.repartition(10)

             .mapPartitions(keyValues => getResults(keyValues))
>

The mapPartitions does some initialization to the SolrJ client per
partition and then hits it for each record in the partition via the
getResults() call.

I repartitioned in the hope that this will result in 10 clients hitting
Solr simultaneously (I would like to go upto maybe 30-40 simultaneous
clients if I can). However, I counted the number of open connections using
"netstat -anp | grep ":8983.*ESTABLISHED" in a loop on the Solr box and
observed that Solr has a constant 4 clients (ie, equal to the number of
workers) over the lifetime of the run.

My observation leads me to believe that each worker processes a single
stream of work sequentially. However, from what I understand about how
Spark works, each worker should be able to process number of tasks
parallelly, and that repartition() is a hint for it to do so.

Is there some SparkConf environment variable I should set to increase
parallelism in these workers, or should I just configure a cluster with
multiple workers per machine? Or is there something I am doing wrong?

Thank you in advance for any pointers you can provide.

-sujit

Reply via email to