[ https://issues.apache.org/jira/browse/SPARK-5387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368578#comment-14368578 ]
Chaozhong Yang commented on SPARK-5387: --------------------------------------- We can locate the bug in parquet-hadoop through the exception stack. That is to say, creating an issue on parquet project may be more appropriate. > parquet writer runs into OOM during writing when number of rows is large > ------------------------------------------------------------------------ > > Key: SPARK-5387 > URL: https://issues.apache.org/jira/browse/SPARK-5387 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 1.1.1 > Reporter: Shirley Wu > > When the number of records is large in RDD, the saveAsParquet will have OOM. > Here is the stack trace: > 15/01/23 10:00:02 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, > hdc2-s3.niara.com): java.lang.OutOfMemoryError: Java heap space > > parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65) > > parquet.bytes.CapacityByteArrayOutputStream.<init>(CapacityByteArrayOutputStream.java:57) > > parquet.column.values.rle.RunLengthBitPackingHybridEncoder.<init>(RunLengthBitPackingHybridEncoder.java:125) > > parquet.column.values.rle.RunLengthBitPackingHybridValuesWriter.<init>(RunLengthBitPackingHybridValuesWriter.java:36) > > parquet.column.ParquetProperties.getColumnDescriptorValuesWriter(ParquetProperties.java:61) > parquet.column.impl.ColumnWriterImpl.<init>(ColumnWriterImpl.java:73) > > parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68) > > parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56) > > parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnIO.java:124) > parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:315) > > parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:106) > > parquet.hadoop.InternalParquetRecordWriter.checkBlockSizeReached(InternalParquetRecordWriter.java:126) > > parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:117) > parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) > parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) > > org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:303) > > org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) > > org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > org.apache.spark.scheduler.Task.run(Task.scala:54) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:180) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > It seems the writeShard() API needs to flush to disk periodically. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org