[jira] [Commented] (PARQUET-1359) Out of Memory when reading large parquet file

Yuming Wang (JIRA) Thu, 26 Jul 2018 23:43:37 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559333#comment-16559333
 ]


Yuming Wang commented on PARQUET-1359:
--------------------------------------

Is it duplicate to 
[PARQUET-980|https://issues.apache.org/jira/browse/PARQUET-980]?

> Out of Memory when reading large parquet file
> ---------------------------------------------
>
>                 Key: PARQUET-1359
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1359
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Ryan Sachs
>            Priority: Major
>
> Hi,
> We are successfully reading parquet files block by block, and are running 
> into a JVM out of memory issue in a certain edge case. Consider the following 
> scenario:
> Parquet file has one column and one block and is 10 GB
> Our JVM is 5 GB
> Is there any way to read such a file? Below is our implementation/stack trace
> {code:java}
> Caused by: java.lang.OutOfMemoryError: Java heap space
> at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:778)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:511)
> try {
>   ParquetMetadata readFooter = ParquetFileReader.readFooter(hfsConfig, path,
>                                ParquetMetadataConverter.NO_FILTER);
>   MessageType schema = readFooter.getFileMetaData().getSchema();
>   long a = readFooter.getBlocks().stream().
>     reduce(0L, (left, right) -> left > 
>       right.getTotalByteSize() ? left : right.getTotalByteSize(), 
>     (leftl, rightl) -> leftl > rightl ? leftl : rightl);
>   for (BlockMetaData block : readFooter.getBlocks()) {
>     try {
>       fileReader = new ParquetFileReader(hfsConfig, 
>                    readFooter.getFileMetaData(), path, Collections
>       .singletonList(block), schema.getColumns());
>       PageReadStore pages;
>     while (null != (pages = fileReader.readNextRowGroup())) {
>       //exception gets thrown here on blocks larger than jvm memory
>       final long rows = pages.getRowCount();
>       final MessageColumnIO columnIO = new 
>                             ColumnIOFactory().getColumnIO(schema);
>       final RecordReader<Group> recordReader = 
>             columnIO.getRecordReader(pages, new GroupRecordConverter(schema));
>       for (int i = 0; i < rows; i++) {
>         final Group group = recordReader.read();
>         int fieldCount = group.getType().getFieldCount();
>         for (int field = 0; field < fieldCount; field++) {
>           int valueCount = group.getFieldRepetitionCount(field);
>           Type fieldType = group.getType().getType(field);
>           String fieldName = fieldType.getName();
>           for (int index = 0; index < valueCount; index++) {
>             // Process data 
>           }
>         }
>       }
>     }
>   } catch (IOException e) {
>     ...
>   } finally {
>     ...
>   }
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1359) Out of Memory when reading large parquet file

Reply via email to