[
https://issues.apache.org/jira/browse/PARQUET-244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14483098#comment-14483098
]
Alosh Bennett commented on PARQUET-244:
---------------------------------------
Code to reproduce the bug. This will create a Parquet file that should contain
a UTF-8 column split across multiple pages. The bug can be reproduced by
reading this file from a sample program or via spark-shell.
{code:title=Bug.java|borderStyle=solid}
public static void main(String[] args) throws IOException {
String parquetFile = "file:///home/abennett/parquet/bug/sample.par";
String schema = "message Document { required binary message (UTF8); }";
WriteSupport<Document> writeSupport = new WriteSupport<Document>() {
RecordConsumer rec;
@Override
public WriteContext init(Configuration configuration) {
return new
WriteContext(MessageTypeParser.parseMessageType(schema), new HashMap<>());
}
@Override
public void prepareForWrite(RecordConsumer recordConsumer) {
rec = recordConsumer;
}
@Override
public void write(Document document) {
rec.startMessage();
rec.startField("message", 0);
rec.addBinary(Binary.fromString(document.message));
rec.endField("message", 0);
rec.endMessage();
}
};
ParquetWriter<Document> writer = new ParquetWriter<Document>(new
Path(parquetFile), writeSupport, CompressionCodecName.SNAPPY,
ParquetWriter.DEFAULT_BLOCK_SIZE,
ParquetWriter.DEFAULT_PAGE_SIZE,
ParquetWriter.DEFAULT_PAGE_SIZE, true, false,
ParquetProperties.WriterVersion.PARQUET_2_0);
Document doc = new Document();
for(int i = 0; i < 100000; i++) {
doc.message = UUID.randomUUID().toString();
writer.write(doc);
}
writer.close();
}
private static class Document {
String message;
}
{code}
> DeltaByteArrayReader fails with ArrayIndexOutOfBoundsException when moving
> across pages
> ---------------------------------------------------------------------------------------
>
> Key: PARQUET-244
> URL: https://issues.apache.org/jira/browse/PARQUET-244
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Affects Versions: parquet-mr_1.6.0
> Reporter: Alosh Bennett
>
> DeltaByteArrayReader.readBytes() fails with ArrayIndexOutOfBoundsException
> soon after it has processed a new page via initFromPage(). This is happening
> because
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)