Hi All,
I have experienced some crashing behavior with join in pyspark. When I
attempt a join with 2000 partitions in the result, the join succeeds, but
when I use only 200 partitions in the result, the join fails with the
message Job aborted due to stage failure: Master removed our application:
In PySpark, the data processed by each reduce task needs to fit in memory
within the Python process, so you should use more tasks to process this
dataset. Data is spilled to disk across tasks.
I’ve created https://issues.apache.org/jira/browse/SPARK-2021 to track this —
it’s something we’ve
Hi Matei,
Thanks for the reply and creating the JIRA. I hear what you're saying,
although to be clear I want to still state that it seems like each reduce
task is loading significantly more data than just the records needed for
that task. The workers seem to load all data from each block
I think the problem is that once unpacked in Python, the objects take
considerably more space, as they are stored as Python objects in a Python
dictionary. Take a look at python/pyspark/join.py and combineByKey in
python/pyspark/rdd.py. We should probably try to store these in serialized form.