The DataFrames parallelism currently controlled through configuration option
spark.sql.shuffle.partitions. The default value is 200
I have raised an Improvement Jira to make it possible to specify the number
of partitions in https://issues.apache.org/jira/browse/SPARK-9872
--
View this
Thanks for the suggestion. Repartition didn't help us unfortunately. It
still puts everything into the same partition.
We did manage to improve the situation by making a new partitioner that
extends HashPartitioner. It treats certain exception keys differently.
These keys that are known to
Thanks Himanshu and RahulKumar!
The databricks forum post was extremely useful. It is great to see an
article that clearly details how and when shuffles are cleaned up.
--
View this message in context:
I have 2 RDDs I want to Join. We will call them RDD A and RDD B. RDD A has
1 billion rows; RDD B has 100k rows. I want to join them on a single key.
95% of the rows in RDD A have the same key to join with RDD B. Before I can
join the two RDDs, I must map them to tuples where the first element
I am using Spark on a machine with limited disk space. I am using it to
analyze very large (100GB to 1TB per file) data sets stored in HDFS. When I
analyze these datasets, I will run groups, joins and cogroups. All of these
operations mean lots of shuffle files written to disk.
Unfortunately
Yarn-cluster. When i run in yarn-client the driver is just run on the
machine that runs spark-submit.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Driver-Host-under-Yarn-tp21536p21558.html
Sent from the Apache Spark User List mailing list archive
Open up 'yarn-site.xml' in your hadoop configuration. You want to create
configuration for yarn.nodemanager.resource.memory-mb and
yarn.scheduler.maximum-allocation-mb. Have a look here for details on how
they work:
I'm running Spark 1.2 with Yarn. My logs show that my executors are failing
to connect to my driver. This is because they are using the wrong hostname.
Since I'm running with Yarn, I can't set spark.driver.host as explained in
SPARK-4253. So it should come from my HDFS configuration. Do you
You can also check rdd.partitions.size. This will be 0 for empty RDDs and
0 for RDDs with data.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-tp1678p21170.html
Sent from the Apache Spark User List mailing list archive at
Is there a planned release date for Spark 1.2? I saw on the Spark Wiki
https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage that we
are already in the latter part of the release window.
--
View this message in context:
Awesome. Thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-Release-Date-tp20765p20767.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To
I am using Spark 1.1.1. I am seeing an issue that only appears when I run in
standalone clustered mode with at least 2 workers. The workers are on
separate physical machines.
I am performing a simple join on 2 RDDs. After the join I run first() on
the joined RDD (in Scala) to get the first
12 matches
Mail list logo