[ 
https://issues.apache.org/jira/browse/PARQUET-222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14369698#comment-14369698
 ] 

Ryan Blue commented on PARQUET-222:
-----------------------------------

The problem is usually that Parquet buffers records in memory to write them as 
large groups; by default 128MB of space is required for each open file. I'm not 
sure how Spark SQL is writing data, but multiple open files can easily cause an 
OOM exception. What is the heap space allocated to your task?

> parquet writer runs into OOM during writing when calling 
> DataFrame.saveAsParquetFile in Spark SQL
> -------------------------------------------------------------------------------------------------
>
>                 Key: PARQUET-222
>                 URL: https://issues.apache.org/jira/browse/PARQUET-222
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.6.0
>            Reporter: Chaozhong Yang
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> In Spark SQL, there is a function `saveAsParquetFile` in DataFrame or 
> SchemaRDD. That function calls method in parquet-mr, and sometimes it will 
> fail due to the OOM error thrown by parquet-mr. We can see the exception 
> stack trace  as follows:
> WARN] [task-result-getter-3] 03-19 11:17:58,274 [TaskSetManager] - Lost task 
> 0.2 in stag
> e 137.0 (TID 309, hb1.avoscloud.com): java.lang.OutOfMemoryError: Java heap 
> space
>         at parquet.column.values.dictionary.IntList.initSlab(IntList.java:87)
>         at parquet.column.values.dictionary.IntList.<init>(IntList.java:83)
>         at 
> parquet.column.values.dictionary.DictionaryValuesWriter.<init>(DictionaryValue
> sWriter.java:85)
>         at 
> parquet.column.values.dictionary.DictionaryValuesWriter$PlainIntegerDictionary
> ValuesWriter.<init>(DictionaryValuesWriter.java:549)
>         at 
> parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:88)
>         at 
> parquet.column.impl.ColumnWriterImpl.<init>(ColumnWriterImpl.java:74)
>         at 
> parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.jav
> a:68)
>         at 
> parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.
> java:56)
>         at 
> parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnI
> O.java:178)
>         at 
> parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
>         at 
> parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWrit
> er.java:108)
>         at 
> parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.
> java:94)
>         at 
> parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:64)
>         at 
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:28
> 2)
>         at 
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:25
> 2)
>         at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parqu
> et$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:304)
>         at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$
> 1.apply(ParquetTableOperations.scala:325)
>         at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$
> 1.apply(ParquetTableOperations.scala:325)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>         at org.apache.spark.scheduler.Task.run(Task.scala:56)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java
> :886)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908
> )
>         at java.lang.Thread.run(Thread.java:662)
> By the way, there is another similar issue 
> https://issues.apache.org/jira/browse/PARQUET-99. But the reporter has closed 
> it and mark it as resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to