Re: out of memory error with Parquet

2015-11-13 Thread Josh Rosen
Tip: jump straight to 1.5.2; it has some key bug fixes.

Sent from my phone

> On Nov 13, 2015, at 10:02 PM, AlexG  wrote:
> 
> Never mind; when I switched to Spark 1.5.0, my code works as written and is
> pretty fast! Looking at some Parquet related Spark jiras, it seems that
> Parquet is known to have some memory issues with buffering and writing, and
> that at least some were resolved in Spark 1.5.0. 
> 
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/out-of-memory-error-with-Parquet-tp25381p25382.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: out of memory error with Parquet

2015-11-13 Thread AlexG
Never mind; when I switched to Spark 1.5.0, my code works as written and is
pretty fast! Looking at some Parquet related Spark jiras, it seems that
Parquet is known to have some memory issues with buffering and writing, and
that at least some were resolved in Spark 1.5.0. 






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/out-of-memory-error-with-Parquet-tp25381p25382.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



out of memory error with Parquet

2015-11-13 Thread AlexG
I'm using Spark to read in a data from many files and write it back out in
Parquet format for ease of use later on. Currently, I'm using this code:

 val fnamesRDD = sc.parallelize(fnames,
ceil(fnames.length.toFloat/numfilesperpartition).toInt)
 val results = fnamesRDD.mapPartitionsWithIndex((index, fnames) =>
extractData(fnames, variablenames, index))
 results.toDF.saveAsParquetFile(outputdir)

extractData returns an array of tuples of (filename, array of floats)
corresponding to all the files in the partition. Each partition results in
about .6Gb data, corresponding to just 3 files per partition.

The problem is, I have 100s of files to convert, and apparently
saveAsParquetFile tries to pull down the data from too many of the
conversion tasks onto the driver at a time, so causes an OOM. E.g., I got an
error about it trying to pull down >4GB of data corresponding to 9 tasks
onto the driver. I could increase the driver memory, but this wouldn't help
if saveAsParquet then decided to pull in 100 tasks at a time.

Is there a way to avoid this OOM error?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/out-of-memory-error-with-Parquet-tp25381.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org