Hello, I am trying to run a Spark job that hits an external webservice to get back some information. The cluster is 1 master + 4 workers, each worker has 60GB RAM and 4 CPUs. The external webservice is a standalone Solr server, and is accessed using code similar to that shown below.
def getResults(keyValues: Iterator[(String, Array[String])]): > Iterator[(String, String)] = { > val solr = new HttpSolrClient() > initializeSolrParameters(solr) > keyValues.map(keyValue => (keyValue._1, process(solr, keyValue))) > } > myRDD.repartition(10) .mapPartitions(keyValues => getResults(keyValues)) > The mapPartitions does some initialization to the SolrJ client per partition and then hits it for each record in the partition via the getResults() call. I repartitioned in the hope that this will result in 10 clients hitting Solr simultaneously (I would like to go upto maybe 30-40 simultaneous clients if I can). However, I counted the number of open connections using "netstat -anp | grep ":8983.*ESTABLISHED" in a loop on the Solr box and observed that Solr has a constant 4 clients (ie, equal to the number of workers) over the lifetime of the run. My observation leads me to believe that each worker processes a single stream of work sequentially. However, from what I understand about how Spark works, each worker should be able to process number of tasks parallelly, and that repartition() is a hint for it to do so. Is there some SparkConf environment variable I should set to increase parallelism in these workers, or should I just configure a cluster with multiple workers per machine? Or is there something I am doing wrong? Thank you in advance for any pointers you can provide. -sujit