[
https://issues.apache.org/jira/browse/PARQUET-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559333#comment-16559333
]
Yuming Wang commented on PARQUET-1359:
--------------------------------------
Is it duplicate to
[PARQUET-980|https://issues.apache.org/jira/browse/PARQUET-980]?
> Out of Memory when reading large parquet file
> ---------------------------------------------
>
> Key: PARQUET-1359
> URL: https://issues.apache.org/jira/browse/PARQUET-1359
> Project: Parquet
> Issue Type: Bug
> Reporter: Ryan Sachs
> Priority: Major
>
> Hi,
> We are successfully reading parquet files block by block, and are running
> into a JVM out of memory issue in a certain edge case. Consider the following
> scenario:
> Parquet file has one column and one block and is 10 GB
> Our JVM is 5 GB
> Is there any way to read such a file? Below is our implementation/stack trace
> {code:java}
> Caused by: java.lang.OutOfMemoryError: Java heap space
> at
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:778)
> at
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:511)
> try {
> ParquetMetadata readFooter = ParquetFileReader.readFooter(hfsConfig, path,
> ParquetMetadataConverter.NO_FILTER);
> MessageType schema = readFooter.getFileMetaData().getSchema();
> long a = readFooter.getBlocks().stream().
> reduce(0L, (left, right) -> left >
> right.getTotalByteSize() ? left : right.getTotalByteSize(),
> (leftl, rightl) -> leftl > rightl ? leftl : rightl);
> for (BlockMetaData block : readFooter.getBlocks()) {
> try {
> fileReader = new ParquetFileReader(hfsConfig,
> readFooter.getFileMetaData(), path, Collections
> .singletonList(block), schema.getColumns());
> PageReadStore pages;
> while (null != (pages = fileReader.readNextRowGroup())) {
> //exception gets thrown here on blocks larger than jvm memory
> final long rows = pages.getRowCount();
> final MessageColumnIO columnIO = new
> ColumnIOFactory().getColumnIO(schema);
> final RecordReader<Group> recordReader =
> columnIO.getRecordReader(pages, new GroupRecordConverter(schema));
> for (int i = 0; i < rows; i++) {
> final Group group = recordReader.read();
> int fieldCount = group.getType().getFieldCount();
> for (int field = 0; field < fieldCount; field++) {
> int valueCount = group.getFieldRepetitionCount(field);
> Type fieldType = group.getType().getType(field);
> String fieldName = fieldType.getName();
> for (int index = 0; index < valueCount; index++) {
> // Process data
> }
> }
> }
> }
> } catch (IOException e) {
> ...
> } finally {
> ...
> }
> }{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)