I'm using Spark to read in a data from many files and write it back out in Parquet format for ease of use later on. Currently, I'm using this code:
val fnamesRDD = sc.parallelize(fnames, ceil(fnames.length.toFloat/numfilesperpartition).toInt) val results = fnamesRDD.mapPartitionsWithIndex((index, fnames) => extractData(fnames, variablenames, index)) results.toDF.saveAsParquetFile(outputdir) extractData returns an array of tuples of (filename, array of floats) corresponding to all the files in the partition. Each partition results in about .6Gb data, corresponding to just 3 files per partition. The problem is, I have 100s of files to convert, and apparently saveAsParquetFile tries to pull down the data from too many of the conversion tasks onto the driver at a time, so causes an OOM. E.g., I got an error about it trying to pull down >4GB of data corresponding to 9 tasks onto the driver. I could increase the driver memory, but this wouldn't help if saveAsParquet then decided to pull in 100 tasks at a time. Is there a way to avoid this OOM error? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/out-of-memory-error-with-Parquet-tp25381.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org