Brad: did you ever manage to figure this out? We're experiencing similar problems, and have also found that the memory limitations supplied to the Java side of PySpark don't limit how much memory Python can consume (which makes sense).
Have you profiled the datasets you are trying to join? Is there any "skew" to the data where a handful of join key values occur far more often than the rest of the values? Note that the join key in PySpark is computed by default using the python `hash` function which for non-builtin values can have unexpected behaviour. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/trouble-with-join-on-large-RDDs-tp3864p4243.html Sent from the Apache Spark User List mailing list archive at Nabble.com.