[
https://issues.apache.org/jira/browse/PARQUET-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14484997#comment-14484997
]
Konstantin Shaposhnikov commented on PARQUET-246:
-------------------------------------------------
The following pull request seems to fix the problem:
https://github.com/apache/incubator-parquet-mr/pull/171
The following code can be used to generate problematic parquet file:
{code}
import java.io.File;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import parquet.column.ParquetProperties.WriterVersion;
import parquet.hadoop.ParquetWriter;
import parquet.hadoop.api.WriteSupport;
import parquet.hadoop.metadata.CompressionCodecName;
public class Test {
@SuppressWarnings("unchecked")
private static <T> WriteSupport<T> writeSupport(Schema avroSchema) {
return (WriteSupport<T>) new AvroWriteSupport(
new AvroSchemaConverter().convert(avroSchema), avroSchema);
}
public static void main(String[] args) throws Exception {
Schema schema = new
Schema.Parser().parse("{\"type\":\"record\",\"name\":\"A\",\"namespace\":\"B\",\"fields\":[{\"name\":\"id\",\"type\":\"string\"}]}");
new File("C:/Temp/test.parquet").delete();
Path outputPath = new Path("C:/Temp/test.parquet");
ParquetWriter<GenericRecord> parquetOut = new
ParquetWriter<GenericRecord>(outputPath,
Test.<GenericRecord>writeSupport(schema),
CompressionCodecName.SNAPPY,
ParquetWriter.DEFAULT_BLOCK_SIZE,
ParquetWriter.DEFAULT_PAGE_SIZE, ParquetWriter.DEFAULT_PAGE_SIZE,
true, // enable dictionary
false, // validating
WriterVersion.PARQUET_2_0,
new Configuration());
for (int i = 0; i < 100000; i++) {
GenericRecord record = new GenericData.Record(schema);
record.put("id", "a" + i);
parquetOut.write(record);
}
parquetOut.close();
}
}
{code}
When dumping the created file using dump command from parquet tools an
exception is shown.
> ArrayIndexOutOfBoundsException with Parquet write version v2
> ------------------------------------------------------------
>
> Key: PARQUET-246
> URL: https://issues.apache.org/jira/browse/PARQUET-246
> Project: Parquet
> Issue Type: Bug
> Affects Versions: 1.6.0
> Reporter: Konstantin Shaposhnikov
>
> I am getting the following exception when reading a parquet file that was
> created using Avro WriteSupport and Parquet write version v2.0:
> {noformat}
> Caused by: parquet.io.ParquetDecodingException: Can't read value in column
> [colName, rows, array, name] BINARY at value 313601 out of 428260, 1 out of
> 39200 in currentPage. repetition level: 0, definition level: 2
> at
> parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462)
> at
> parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:364)
> at
> parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
> at
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
> ... 27 more
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at
> parquet.column.values.deltastrings.DeltaByteArrayReader.readBytes(DeltaByteArrayReader.java:70)
> at
> parquet.column.impl.ColumnReaderImpl$2$6.read(ColumnReaderImpl.java:307)
> at
> parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:458)
> ... 30 more
> {noformat}
> The file is quite big (500Mb) so I cannot upload it here, but possibly there
> is enough information in the exception message to understand the cause of
> error.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)