subject:"Out of memory HDFS Read and Write"

Re: Out of memory HDFS Read and Write

2019-12-22 Thread Sumedh Wale

Parquet reads in Spark need lots of tempory heap memory due to ColumnVectors and write block size. See a similar issue: https://jira.snappydata.io/browse/SNAP-3111 In addition writes too consume significant amount of heap due to parquet.block.size. One solution is to reduce the spark.executor.core

Re: Out of memory HDFS Read and Write

2019-12-22 Thread Ruijing Li

@Chris destPaths is just a Seq[String] that holds the paths we wish to copy the output to. Even if the collection only holds one path, it does not work. However, the job runs fine if we don’t copy the output. The pipeline succeeds in read input -> perform logic as dataframe -> write output. As for

Re: Out of memory HDFS Read and Write

2019-12-22 Thread Chris Teoh

Does it work for just a single path input and single output? Is the destPath a collection that is sitting on the driver? On Sun, 22 Dec 2019, 7:59 pm Ruijing Li, wrote: > I was experimenting and found something interesting. I have executor OOM > even if I don’t write to remote clusters. So it i

Re: Out of memory HDFS Read and Write

2019-12-22 Thread Ruijing Li

I was experimenting and found something interesting. I have executor OOM even if I don’t write to remote clusters. So it is purely a dataframe read and write issue — To recap, I have an ETL data pipeline that does some logic, repartitions to reduce the amount of files written, w