Re: out of memory error with Parquet
Tip: jump straight to 1.5.2; it has some key bug fixes. Sent from my phone > On Nov 13, 2015, at 10:02 PM, AlexG wrote: > > Never mind; when I switched to Spark 1.5.0, my code works as written and is > pretty fast! Looking at some Parquet related Spark jiras, it seems that > Parquet is known to have some memory issues with buffering and writing, and > that at least some were resolved in Spark 1.5.0. > > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/out-of-memory-error-with-Parquet-tp25381p25382.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: out of memory error with Parquet
Never mind; when I switched to Spark 1.5.0, my code works as written and is pretty fast! Looking at some Parquet related Spark jiras, it seems that Parquet is known to have some memory issues with buffering and writing, and that at least some were resolved in Spark 1.5.0. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/out-of-memory-error-with-Parquet-tp25381p25382.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
out of memory error with Parquet
I'm using Spark to read in a data from many files and write it back out in Parquet format for ease of use later on. Currently, I'm using this code: val fnamesRDD = sc.parallelize(fnames, ceil(fnames.length.toFloat/numfilesperpartition).toInt) val results = fnamesRDD.mapPartitionsWithIndex((index, fnames) => extractData(fnames, variablenames, index)) results.toDF.saveAsParquetFile(outputdir) extractData returns an array of tuples of (filename, array of floats) corresponding to all the files in the partition. Each partition results in about .6Gb data, corresponding to just 3 files per partition. The problem is, I have 100s of files to convert, and apparently saveAsParquetFile tries to pull down the data from too many of the conversion tasks onto the driver at a time, so causes an OOM. E.g., I got an error about it trying to pull down >4GB of data corresponding to 9 tasks onto the driver. I could increase the driver memory, but this wouldn't help if saveAsParquet then decided to pull in 100 tasks at a time. Is there a way to avoid this OOM error? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/out-of-memory-error-with-Parquet-tp25381.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org