I'm using Spark to read in a data from many files and write it back out in
Parquet format for ease of use later on. Currently, I'm using this code:

 val fnamesRDD = sc.parallelize(fnames,
ceil(fnames.length.toFloat/numfilesperpartition).toInt)
 val results = fnamesRDD.mapPartitionsWithIndex((index, fnames) =>
extractData(fnames, variablenames, index))
 results.toDF.saveAsParquetFile(outputdir)

extractData returns an array of tuples of (filename, array of floats)
corresponding to all the files in the partition. Each partition results in
about .6Gb data, corresponding to just 3 files per partition.

The problem is, I have 100s of files to convert, and apparently
saveAsParquetFile tries to pull down the data from too many of the
conversion tasks onto the driver at a time, so causes an OOM. E.g., I got an
error about it trying to pull down >4GB of data corresponding to 9 tasks
onto the driver. I could increase the driver memory, but this wouldn't help
if saveAsParquet then decided to pull in 100 tasks at a time.

Is there a way to avoid this OOM error?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/out-of-memory-error-with-Parquet-tp25381.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to