Out of memory HDFS Multiple Cluster Write

Ruijing Li Fri, 20 Dec 2019 00:35:24 -0800

Hi all,

I have encountered a strange executor OOM error. I have a data pipeline
using Spark 2.3 Scala 2.11.12. This pipeline writes the output to one HDFS
location as parquet then reads the files back in and writes to multiple
hadoop clusters (all co-located in the same datacenter).  It should be a
very simple task, but executors are being killed off exceeding container
thresholds. From logs, it is exceeding given memory (using Mesos as the
cluster manager).


The ETL process works perfectly fine with the given resources, doing joins
and adding columns. The output is written successfully the first time. *Only
when the pipeline at the end reads the output from HDFS and writes it to
different HDFS cluster paths does it fail.* (It does a
spark.read.parquet(source).write.parquet(dest))

This doesn't really make sense and I'm wondering what configurations I
should start looking at.

-- 
Cheers,
Ruijing Li
-- 
Cheers,
Ruijing Li

Out of memory HDFS Multiple Cluster Write

Reply via email to