Re: How to increase parallelism of a Spark cluster?

2015-08-04 Thread Richard Marscher
I think you did a good job of summarizing terminology and describing spark's operation. However #7 is inaccurate if I am interpreting correctly. The scheduler schedules X tasks from the current stage across all executors, where X is the the number of cores assigned to the application (assuming

Re: How to increase parallelism of a Spark cluster?

2015-08-03 Thread Sujit Pal
@Silvio: the mapPartitions instantiates a HttpSolrServer, then for each query string in the partition, sends the query to Solr using SolrJ, and gets back the top N results. It then reformats the result data into one long string and returns the key value pair as (query string, result string).

Re: How to increase parallelism of a Spark cluster?

2015-08-03 Thread Ajay Singal
Hi Sujit, From experimenting with Spark (and other documentation), my understanding is as follows: 1. Each application consists of one or more Jobs 2. Each Job has one or more Stages 3. Each Stage creates one or more Tasks (normally, one Task per Partition) 4. Master

Re: How to increase parallelism of a Spark cluster?

2015-08-03 Thread shahid ashraf
hi sujit Can you spin it with 4 (server)*4 (cores) 16 cores i.e there should be 16 cores in your cluster, try to use same no. of partitions. Also look at the http://apache-spark-user-list.1001560.n3.nabble.com/No-of-Task-vs-No-of-Executors-td23824.html On Tue, Aug 4, 2015 at 1:46 AM, Ajay

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Igor Berman
What kind of cluster? How many cores on each worker? Is there config for http solr client? I remember standard httpclient has limit per route/host. On Aug 2, 2015 8:17 PM, Sujit Pal sujitatgt...@gmail.com wrote: No one has any ideas? Is there some more information I should provide? I am

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Sujit Pal
No one has any ideas? Is there some more information I should provide? I am looking for ways to increase the parallelism among workers. Currently I just see number of simultaneous connections to Solr equal to the number of workers. My number of partitions is (2.5x) larger than number of workers,

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Igor Berman
so how many cores you configure per node? do u have something like total-executor-cores or maybe --num-executors config(I'm not sure what kind of cluster databricks platform provides, if it's standalone then first option should be used)? if you have 4 cores at total, then even though you have

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Sujit Pal
Hi Igor, The cluster is a Databricks Spark cluster. It consists of 1 master + 4 workers, each worker has 60GB RAM and 4 CPUs. The original mail has some more details (also the reference to the HttpSolrClient in there should be HttpSolrServer, sorry about that, mistake while writing the email).

RE: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Silvio Fiorito
Can you share the transformations up to the foreachPartition? From: Sujit Palmailto:sujitatgt...@gmail.com Sent: ‎8/‎2/‎2015 4:42 PM To: Igor Bermanmailto:igor.ber...@gmail.com Cc: usermailto:user@spark.apache.org Subject: Re: How to increase parallelism of a Spark

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Abhishek R. Singh
I don't know if (your assertion/expectation that) workers will process things (multiple partitions) in parallel is really valid. Or if having more partitions than workers will necessarily help (unless you are memory bound - so partitions is essentially helping your work size rather than

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Steve Loughran
On 2 Aug 2015, at 13:42, Sujit Pal sujitatgt...@gmail.commailto:sujitatgt...@gmail.com wrote: There is no additional configuration on the external Solr host from my code, I am using the default HttpClient provided by HttpSolrServer. According to the Javadocs, you can pass in a HttpClient

How to increase parallelism of a Spark cluster?

2015-07-31 Thread Sujit Pal
Hello, I am trying to run a Spark job that hits an external webservice to get back some information. The cluster is 1 master + 4 workers, each worker has 60GB RAM and 4 CPUs. The external webservice is a standalone Solr server, and is accessed using code similar to that shown below. def