:
http://apache-spark-user-list.1001560.n3.nabble.com/trouble-with-join-on-large-RDDs-tp3864p4243.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
A JVM can easily be limited in how much memory it uses with the -Xmx
parameter, but Python doesn't have memory limits built in in such a
first-class way. Maybe the memory limits aren't making it to the python
executors.
What was your SPARK_MEM setting? The JVM below seems to be using 603201
I set SPARK_MEM in the driver process by setting
spark.executor.memory to 10G. Each machine had 32G of RAM and a
dedicated 32G spill volume. I believe all of the units are in pages,
and the page size is the standard 4K. There are 15 slave nodes in the
cluster and the sizes of the datasets I'm
On Mon, Apr 7, 2014 at 7:37 PM, Brad Miller bmill...@eecs.berkeley.eduwrote:
I am running the latest version of PySpark branch-0.9 and having some
trouble with join.
One RDD is about 100G (25GB compressed and serialized in memory) with
130K records, the other RDD is about 10G (2.5G