I think you did a good job of summarizing terminology and describing
spark's operation. However #7 is inaccurate if I am interpreting correctly.
The scheduler schedules X tasks from the current stage across all
executors, where X is the the number of cores assigned to the application
(assuming
@Silvio: the mapPartitions instantiates a HttpSolrServer, then for each
query string in the partition, sends the query to Solr using SolrJ, and
gets back the top N results. It then reformats the result data into one
long string and returns the key value pair as (query string, result string).
Hi Sujit,
From experimenting with Spark (and other documentation), my understanding
is as follows:
1. Each application consists of one or more Jobs
2. Each Job has one or more Stages
3. Each Stage creates one or more Tasks (normally, one Task per
Partition)
4. Master
hi sujit
Can you spin it with 4 (server)*4 (cores) 16 cores i.e there should be 16
cores in your cluster, try to use same no. of partitions. Also look at the
http://apache-spark-user-list.1001560.n3.nabble.com/No-of-Task-vs-No-of-Executors-td23824.html
On Tue, Aug 4, 2015 at 1:46 AM, Ajay
What kind of cluster? How many cores on each worker? Is there config for
http solr client? I remember standard httpclient has limit per route/host.
On Aug 2, 2015 8:17 PM, Sujit Pal sujitatgt...@gmail.com wrote:
No one has any ideas?
Is there some more information I should provide?
I am
No one has any ideas?
Is there some more information I should provide?
I am looking for ways to increase the parallelism among workers. Currently
I just see number of simultaneous connections to Solr equal to the number
of workers. My number of partitions is (2.5x) larger than number of
workers,
so how many cores you configure per node?
do u have something like total-executor-cores or maybe
--num-executors config(I'm
not sure what kind of cluster databricks platform provides, if it's
standalone then first option should be used)? if you have 4 cores at total,
then even though you have
Hi Igor,
The cluster is a Databricks Spark cluster. It consists of 1 master + 4
workers, each worker has 60GB RAM and 4 CPUs. The original mail has some
more details (also the reference to the HttpSolrClient in there should be
HttpSolrServer, sorry about that, mistake while writing the email).
Can you share the transformations up to the foreachPartition?
From: Sujit Palmailto:sujitatgt...@gmail.com
Sent: 8/2/2015 4:42 PM
To: Igor Bermanmailto:igor.ber...@gmail.com
Cc: usermailto:user@spark.apache.org
Subject: Re: How to increase parallelism of a Spark
I don't know if (your assertion/expectation that) workers will process things
(multiple partitions) in parallel is really valid. Or if having more partitions
than workers will necessarily help (unless you are memory bound - so partitions
is essentially helping your work size rather than
On 2 Aug 2015, at 13:42, Sujit Pal
sujitatgt...@gmail.commailto:sujitatgt...@gmail.com wrote:
There is no additional configuration on the external Solr host from my code, I
am using the default HttpClient provided by HttpSolrServer. According to the
Javadocs, you can pass in a HttpClient
11 matches
Mail list logo