The second bug is on https://issues.apache.org/jira/browse/PARQUET-152
The problem is that the dictionary page size is less than the fixed byte array. Just make it equals, and you will be able to read that file. - Sergio On Thu, Jun 18, 2015 at 3:36 PM, Nezih Yigitbasi < nyigitb...@netflix.com.invalid> wrote: > Yep I will, seemed like a bug to me too. > > Thanks, > Nezih > > On Thu, Jun 18, 2015 at 1:33 PM, Ryan Blue <b...@cloudera.com> wrote: > > > The first issue looks like the delta byte array problem: > > > > https://issues.apache.org/jira/browse/PARQUET-246 > > > > The second one looks like the write side uses delta_byte_array for fixed, > > but the read side doesn't expect it. File a bug? > > > > rb > > > > On 06/18/2015 12:50 PM, Nezih Yigitbasi wrote: > > > >> Hi all, > >> > >> I have generated some test data using the method here > >> < > >> > https://github.com/apache/parquet-mr/blob/master/parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/DataGenerator.java#L68 > >> >. > >> > >> What I notice is if I use WriterVersion.PARQUET_2_0, the default block > and > >> page sizes, and GZIP compression (test case 1 below) I cannot read the > >> file > >> with parquet-tools dump (see stack trace below). When I switch to > >> PARQUET_1_0 (test case 2 below) I can use dump tool to read the data. > >> Weird > >> enough when I reduce the number of rows I create to 1K and use > PARQUET_2_0 > >> writer again (test case 3) dump still fails but with a different > >> exception. > >> > >> Are these known issues? > >> > >> Nezih > >> Test Case 1 [FAILS] > >> > >> WriterVersion.PARQUET_2_0 > >> default block and page size > >> GZIP compression > >> 1M rows > >> > >> Schema: > >> > >> file schema: test > >> > >> > -------------------------------------------------------------------------------- > >> binary_field: REQUIRED BINARY R:0 D:0 > >> int32_field: REQUIRED INT32 R:0 D:0 > >> int64_field: REQUIRED INT64 R:0 D:0 > >> boolean_field: REQUIRED BOOLEAN R:0 D:0 > >> float_field: REQUIRED FLOAT R:0 D:0 > >> double_field: REQUIRED DOUBLE R:0 D:0 > >> flba_field: REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0 > >> int96_field: REQUIRED INT96 R:0 D:0 > >> > >> row group 1: RC:1000000 TS:38744008 OFFSET:4 > >> > >> > -------------------------------------------------------------------------------- > >> binary_field: BINARY GZIP DO:0 FPO:4 SZ:20683253/36526089/1.77 > >> VC:1000000 ENC:DELTA_BYTE_ARRAY > >> int32_field: INT32 GZIP DO:0 FPO:20683257 SZ:524/39330/75.06 > >> VC:1000000 ENC:DELTA_BINARY_PACKED > >> int64_field: INT64 GZIP DO:0 FPO:20683781 SZ:693/498/0.72 > >> VC:1000000 ENC:PLAIN,RLE_DICTIONARY > >> boolean_field: BOOLEAN GZIP DO:0 FPO:20684474 SZ:63/43/0.68 VC:1000000 > >> ENC:RLE > >> float_field: FLOAT GZIP DO:0 FPO:20684537 SZ:362/242/0.67 > >> VC:1000000 ENC:PLAIN,RLE_DICTIONARY > >> double_field: DOUBLE GZIP DO:0 FPO:20684899 SZ:694/498/0.72 > >> VC:1000000 ENC:PLAIN,RLE_DICTIONARY > >> flba_field: FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:20685593 > >> SZ:2118034/2176246/1.03 VC:1000000 ENC:DELTA_BYTE_ARRAY > >> int96_field: INT96 GZIP DO:0 FPO:22803627 SZ:1413/1062/0.75 > >> VC:1000000 ENC:PLAIN,RLE_DICTIONARY > >> > >> parquet-tools dump fails with: > >> > >> value 377601: R:0 D:0 V:parquet.io.ParquetDecodingException: Can't > >> read value in column [binary_field] BINARY at value 377601 out of > >> 1000000, 1 out of 23600 in currentPage. repetition level: 0, > >> definition level: 0 > >> at > >> > parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462) > >> at > >> > parquet.column.impl.ColumnReaderImpl.getBinary(ColumnReaderImpl.java:410) > >> at parquet.tools.command.DumpCommand.dump(DumpCommand.java:288) > >> at parquet.tools.command.DumpCommand.dump(DumpCommand.java:215) > >> at parquet.tools.command.DumpCommand.execute(DumpCommand.java:136) > >> at parquet.tools.Main.main(Main.java:219) > >> Caused by: java.lang.ArrayIndexOutOfBoundsException > >> at > >> > parquet.column.values.deltastrings.DeltaByteArrayReader.readBytes(DeltaByteArrayReader.java:70) > >> at > >> parquet.column.impl.ColumnReaderImpl$2$6.read(ColumnReaderImpl.java:307) > >> at > >> > parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:458) > >> ... 5 more > >> Can't read value in column [binary_field] BINARY at value 377601 out > >> of 1000000, 1 out of 23600 in currentPage. repetition level: 0, > >> definition level: 0 > >> > >> Test Case 2 [SUCCEEDS] > >> > >> WriterVersion.PARQUET_1_0 > >> default block and page size > >> GZIP compression > >> 1M rows > >> > >> Schema: > >> > >> file schema: test > >> > >> > -------------------------------------------------------------------------------- > >> binary_field: REQUIRED BINARY R:0 D:0 > >> int32_field: REQUIRED INT32 R:0 D:0 > >> int64_field: REQUIRED INT64 R:0 D:0 > >> boolean_field: REQUIRED BOOLEAN R:0 D:0 > >> float_field: REQUIRED FLOAT R:0 D:0 > >> double_field: REQUIRED DOUBLE R:0 D:0 > >> flba_field: REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0 > >> int96_field: REQUIRED INT96 R:0 D:0 > >> > >> row group 1: RC:1000000 TS:1070161196 OFFSET:4 > >> > >> > -------------------------------------------------------------------------------- > >> binary_field: BINARY GZIP DO:0 FPO:4 SZ:21862183/40004054/1.83 > >> VC:1000000 ENC:PLAIN,BIT_PACKED > >> int32_field: INT32 GZIP DO:0 FPO:21862187 SZ:1383313/4000159/2.89 > >> VC:1000000 ENC:PLAIN,BIT_PACKED > >> int64_field: INT64 GZIP DO:0 FPO:23245500 SZ:572/397/0.69 > >> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED > >> boolean_field: BOOLEAN GZIP DO:0 FPO:23246072 SZ:188/125032/665.06 > >> VC:1000000 ENC:PLAIN,BIT_PACKED > >> float_field: FLOAT GZIP DO:0 FPO:23246260 SZ:273/173/0.63 > >> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED > >> double_field: DOUBLE GZIP DO:0 FPO:23246533 SZ:573/397/0.69 > >> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED > >> flba_field: FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:23247106 > >> SZ:3057410/1026030079/335.59 VC:1000000 ENC:PLAIN,BIT_PACKED > >> int96_field: INT96 GZIP DO:0 FPO:26304516 SZ:1236/905/0.73 > >> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED > >> > >> Test Case 3 [FAILS] > >> > >> WriterVersion.PARQUET_2_0 > >> default block and page size > >> GZIP compression > >> 1K rows > >> > >> Schema: > >> > >> file schema: test > >> > >> > -------------------------------------------------------------------------------- > >> binary_field: REQUIRED BINARY R:0 D:0 > >> int32_field: REQUIRED INT32 R:0 D:0 > >> int64_field: REQUIRED INT64 R:0 D:0 > >> boolean_field: REQUIRED BOOLEAN R:0 D:0 > >> float_field: REQUIRED FLOAT R:0 D:0 > >> double_field: REQUIRED DOUBLE R:0 D:0 > >> flba_field: REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0 > >> int96_field: REQUIRED INT96 R:0 D:0 > >> > >> row group 1: RC:1000 TS:40502 OFFSET:4 > >> > >> > -------------------------------------------------------------------------------- > >> binary_field: BINARY GZIP DO:0 FPO:4 SZ:21466/36672/1.71 VC:1000 > >> ENC:DELTA_BYTE_ARRAY > >> int32_field: INT32 GZIP DO:0 FPO:21470 SZ:70/85/1.21 VC:1000 > >> ENC:DELTA_BINARY_PACKED > >> int64_field: INT64 GZIP DO:0 FPO:21540 SZ:106/71/0.67 VC:1000 > >> ENC:RLE_DICTIONARY,PLAIN > >> boolean_field: BOOLEAN GZIP DO:0 FPO:21646 SZ:60/40/0.67 VC:1000 > ENC:RLE > >> float_field: FLOAT GZIP DO:0 FPO:21706 SZ:99/59/0.60 VC:1000 > >> ENC:RLE_DICTIONARY,PLAIN > >> double_field: DOUBLE GZIP DO:0 FPO:21805 SZ:107/71/0.66 VC:1000 > >> ENC:RLE_DICTIONARY,PLAIN > >> flba_field: FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:21912 > >> SZ:2152/3421/1.59 VC:1000 ENC:DELTA_BYTE_ARRAY > >> int96_field: INT96 GZIP DO:0 FPO:24064 SZ:114/83/0.73 VC:1000 > >> ENC:RLE_DICTIONARY,PLAIN > >> > >> parquet-tools dump fails when dumping the fixed len byte array field: > >> > >> FIXED_LEN_BYTE_ARRAY flba_field > >> > >> > -------------------------------------------------------------------------------- > >> parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is only > >> supported for type BINARY > >> at parquet.column.Encoding$7.getValuesReader(Encoding.java:196) > >> at > >> > parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:537) > >> at > >> > parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:577) > >> at > >> > parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:57) > >> at > >> parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:521) > >> at > >> parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:513) > >> at parquet.column.page.DataPageV2.accept(DataPageV2.java:141) > >> at > >> parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:513) > >> at > >> > parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:505) > >> at > >> parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:607) > >> at > >> parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:351) > >> at > >> > parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:66) > >> at > >> > parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:61) > >> at parquet.tools.command.DumpCommand.dump(DumpCommand.java:278) > >> at parquet.tools.command.DumpCommand.dump(DumpCommand.java:215) > >> at parquet.tools.command.DumpCommand.execute(DumpCommand.java:136) > >> at parquet.tools.Main.main(Main.java:219) > >> Encoding DELTA_BYTE_ARRAY is only supported for type BINARY > >> > >> > >> > >> > > > > -- > > Ryan Blue > > Software Engineer > > Cloudera, Inc. > > >