Pyspark application hangs (no error messages) on Python RDD .map

Daniel Stojanov Tue, 10 Nov 2020 21:32:01 -0800

Hi,

This code will hang indefinitely at the last line (the .map()).
Interestingly, if I run the same code at the beginning of my application
(removing the .write step) it executes as expected. Otherwise, the code
appears further along in my application which is where it hangs. The
debugging message "I saw a row" never appears in the executor's standard
output.


Note, this error occurs when running on a yarn cluster, but not on a
standalone cluster or in local mode. I have tried running with num-cores=1
and 1 executor.

I have been working on this for a long time, any clues would be appreciated.

Regards,


def map_to_keys(row):
    print("I saw a row", row["id"])
    return (hash(row["id"]), row)

df.write.mode("overwrite").format("orc").save("/tmp/df_full")
df = spark.read.format("orc").load("/tmp/df_full")
rdd = df.rdd.map(map_to_keys)

Pyspark application hangs (no error messages) on Python RDD .map

Reply via email to