Re: trouble with join on large RDDs

2014-04-14 Thread Harry Brundage
: http://apache-spark-user-list.1001560.n3.nabble.com/trouble-with-join-on-large-RDDs-tp3864p4243.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: trouble with join on large RDDs

2014-04-09 Thread Andrew Ash
A JVM can easily be limited in how much memory it uses with the -Xmx parameter, but Python doesn't have memory limits built in in such a first-class way. Maybe the memory limits aren't making it to the python executors. What was your SPARK_MEM setting? The JVM below seems to be using 603201

Re: trouble with join on large RDDs

2014-04-09 Thread Brad Miller
I set SPARK_MEM in the driver process by setting spark.executor.memory to 10G. Each machine had 32G of RAM and a dedicated 32G spill volume. I believe all of the units are in pages, and the page size is the standard 4K. There are 15 slave nodes in the cluster and the sizes of the datasets I'm

Re: trouble with join on large RDDs

2014-04-07 Thread Patrick Wendell
On Mon, Apr 7, 2014 at 7:37 PM, Brad Miller bmill...@eecs.berkeley.eduwrote: I am running the latest version of PySpark branch-0.9 and having some trouble with join. One RDD is about 100G (25GB compressed and serialized in memory) with 130K records, the other RDD is about 10G (2.5G