Parquet reads in Spark need lots of tempory heap memory due to
ColumnVectors and write block size. See a similar issue:
https://jira.snappydata.io/browse/SNAP-3111
In addition writes too consume significant amount of heap due to
parquet.block.size. One solution is to reduce the spark.executor.core
@Chris destPaths is just a Seq[String] that holds the paths we wish to copy
the output to. Even if the collection only holds one path, it does not
work. However, the job runs fine if we don’t copy the output. The pipeline
succeeds in read input -> perform logic as dataframe -> write output. As
for
Does it work for just a single path input and single output?
Is the destPath a collection that is sitting on the driver?
On Sun, 22 Dec 2019, 7:59 pm Ruijing Li, wrote:
> I was experimenting and found something interesting. I have executor OOM
> even if I don’t write to remote clusters. So it i
I was experimenting and found something interesting. I have executor OOM
even if I don’t write to remote clusters. So it is purely a dataframe read
and write issue
—
To recap, I have an ETL data pipeline that does some logic, repartitions to
reduce the amount of files written, w