Re: Spark DataFrames uses too many partition

2015-08-12 Thread Al M
The DataFrames parallelism currently controlled through configuration option spark.sql.shuffle.partitions. The default value is 200 I have raised an Improvement Jira to make it possible to specify the number of partitions in https://issues.apache.org/jira/browse/SPARK-9872 -- View this

Re: Shuffle produces one huge partition and many tiny partitions

2015-06-18 Thread Al M
Thanks for the suggestion. Repartition didn't help us unfortunately. It still puts everything into the same partition. We did manage to improve the situation by making a new partitioner that extends HashPartitioner. It treats certain exception keys differently. These keys that are known to

Re: Limit Spark Shuffle Disk Usage

2015-06-17 Thread Al M
Thanks Himanshu and RahulKumar! The databricks forum post was extremely useful. It is great to see an article that clearly details how and when shuffles are cleaned up. -- View this message in context:

Shuffle produces one huge partition

2015-06-17 Thread Al M
I have 2 RDDs I want to Join. We will call them RDD A and RDD B. RDD A has 1 billion rows; RDD B has 100k rows. I want to join them on a single key. 95% of the rows in RDD A have the same key to join with RDD B. Before I can join the two RDDs, I must map them to tuples where the first element

Limit Spark Shuffle Disk Usage

2015-06-11 Thread Al M
I am using Spark on a machine with limited disk space. I am using it to analyze very large (100GB to 1TB per file) data sets stored in HDFS. When I analyze these datasets, I will run groups, joins and cogroups. All of these operations mean lots of shuffle files written to disk. Unfortunately

Re: Spark Driver Host under Yarn

2015-02-09 Thread Al M
Yarn-cluster. When i run in yarn-client the driver is just run on the machine that runs spark-submit. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Driver-Host-under-Yarn-tp21536p21558.html Sent from the Apache Spark User List mailing list archive

Re: getting error when submit spark with master as yarn

2015-02-09 Thread Al M
Open up 'yarn-site.xml' in your hadoop configuration. You want to create configuration for yarn.nodemanager.resource.memory-mb and yarn.scheduler.maximum-allocation-mb. Have a look here for details on how they work:

Spark Driver Host under Yarn

2015-02-06 Thread Al M
I'm running Spark 1.2 with Yarn. My logs show that my executors are failing to connect to my driver. This is because they are using the wrong hostname. Since I'm running with Yarn, I can't set spark.driver.host as explained in SPARK-4253. So it should come from my HDFS configuration. Do you

Re: Testing if an RDD is empty?

2015-01-15 Thread Al M
You can also check rdd.partitions.size. This will be 0 for empty RDDs and 0 for RDDs with data. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-tp1678p21170.html Sent from the Apache Spark User List mailing list archive at

Spark 1.2 Release Date

2014-12-18 Thread Al M
Is there a planned release date for Spark 1.2? I saw on the Spark Wiki https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage that we are already in the latter part of the release window. -- View this message in context:

Re: Spark 1.2 Release Date

2014-12-18 Thread Al M
Awesome. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-Release-Date-tp20765p20767.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To

Failed fetch: Could not get block(s)

2014-12-03 Thread Al M
I am using Spark 1.1.1. I am seeing an issue that only appears when I run in standalone clustered mode with at least 2 workers. The workers are on separate physical machines. I am performing a simple join on 2 RDDs. After the join I run first() on the joined RDD (in Scala) to get the first