Brad: did you ever manage to figure this out? We're experiencing similar
problems, and have also found that the memory limitations supplied to the
Java side of PySpark don't limit how much memory Python can consume (which
makes sense). 

Have you profiled the datasets you are trying to join? Is there any "skew"
to the data where a handful of join key values occur far more often than the
rest of the values? Note that the join key in PySpark is computed by default
using the python `hash` function which for non-builtin values can have
unexpected behaviour. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/trouble-with-join-on-large-RDDs-tp3864p4243.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to