The second bug is on https://issues.apache.org/jira/browse/PARQUET-152

The problem is that the dictionary page size is less than the fixed byte
array. Just make it equals, and you will be able to read that file.

- Sergio

On Thu, Jun 18, 2015 at 3:36 PM, Nezih Yigitbasi <
nyigitb...@netflix.com.invalid> wrote:

> Yep I will, seemed like a bug to me too.
>
> Thanks,
> Nezih
>
> On Thu, Jun 18, 2015 at 1:33 PM, Ryan Blue <b...@cloudera.com> wrote:
>
> > The first issue looks like the delta byte array problem:
> >
> >   https://issues.apache.org/jira/browse/PARQUET-246
> >
> > The second one looks like the write side uses delta_byte_array for fixed,
> > but the read side doesn't expect it. File a bug?
> >
> > rb
> >
> > On 06/18/2015 12:50 PM, Nezih Yigitbasi wrote:
> >
> >> Hi all,
> >>
> >> I have generated some test data using the method here
> >> <
> >>
> https://github.com/apache/parquet-mr/blob/master/parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/DataGenerator.java#L68
> >> >.
> >>
> >> What I notice is if I use WriterVersion.PARQUET_2_0, the default block
> and
> >> page sizes, and GZIP compression (test case 1 below) I cannot read the
> >> file
> >> with parquet-tools dump (see stack trace below). When I switch to
> >> PARQUET_1_0 (test case 2 below) I can use dump tool to read the data.
> >> Weird
> >> enough when I reduce the number of rows I create to 1K and use
> PARQUET_2_0
> >> writer again (test case 3) dump still fails but with a different
> >> exception.
> >>
> >> Are these known issues?
> >>
> >> Nezih
> >> Test Case 1 [FAILS]
> >>
> >> WriterVersion.PARQUET_2_0
> >> default block and page size
> >> GZIP compression
> >> 1M rows
> >>
> >> Schema:
> >>
> >> file schema:   test
> >>
> >>
> --------------------------------------------------------------------------------
> >> binary_field:  REQUIRED BINARY R:0 D:0
> >> int32_field:   REQUIRED INT32 R:0 D:0
> >> int64_field:   REQUIRED INT64 R:0 D:0
> >> boolean_field: REQUIRED BOOLEAN R:0 D:0
> >> float_field:   REQUIRED FLOAT R:0 D:0
> >> double_field:  REQUIRED DOUBLE R:0 D:0
> >> flba_field:    REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0
> >> int96_field:   REQUIRED INT96 R:0 D:0
> >>
> >> row group 1:   RC:1000000 TS:38744008 OFFSET:4
> >>
> >>
> --------------------------------------------------------------------------------
> >> binary_field:   BINARY GZIP DO:0 FPO:4 SZ:20683253/36526089/1.77
> >> VC:1000000 ENC:DELTA_BYTE_ARRAY
> >> int32_field:    INT32 GZIP DO:0 FPO:20683257 SZ:524/39330/75.06
> >> VC:1000000 ENC:DELTA_BINARY_PACKED
> >> int64_field:    INT64 GZIP DO:0 FPO:20683781 SZ:693/498/0.72
> >> VC:1000000 ENC:PLAIN,RLE_DICTIONARY
> >> boolean_field:  BOOLEAN GZIP DO:0 FPO:20684474 SZ:63/43/0.68 VC:1000000
> >> ENC:RLE
> >> float_field:    FLOAT GZIP DO:0 FPO:20684537 SZ:362/242/0.67
> >> VC:1000000 ENC:PLAIN,RLE_DICTIONARY
> >> double_field:   DOUBLE GZIP DO:0 FPO:20684899 SZ:694/498/0.72
> >> VC:1000000 ENC:PLAIN,RLE_DICTIONARY
> >> flba_field:     FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:20685593
> >> SZ:2118034/2176246/1.03 VC:1000000 ENC:DELTA_BYTE_ARRAY
> >> int96_field:    INT96 GZIP DO:0 FPO:22803627 SZ:1413/1062/0.75
> >> VC:1000000 ENC:PLAIN,RLE_DICTIONARY
> >>
> >> parquet-tools dump fails with:
> >>
> >> value 377601: R:0 D:0 V:parquet.io.ParquetDecodingException: Can't
> >> read value in column [binary_field] BINARY at value 377601 out of
> >> 1000000, 1 out of 23600 in currentPage. repetition level: 0,
> >> definition level: 0
> >>      at
> >>
> parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462)
> >>      at
> >>
> parquet.column.impl.ColumnReaderImpl.getBinary(ColumnReaderImpl.java:410)
> >>      at parquet.tools.command.DumpCommand.dump(DumpCommand.java:288)
> >>      at parquet.tools.command.DumpCommand.dump(DumpCommand.java:215)
> >>      at parquet.tools.command.DumpCommand.execute(DumpCommand.java:136)
> >>      at parquet.tools.Main.main(Main.java:219)
> >> Caused by: java.lang.ArrayIndexOutOfBoundsException
> >>      at
> >>
> parquet.column.values.deltastrings.DeltaByteArrayReader.readBytes(DeltaByteArrayReader.java:70)
> >>      at
> >> parquet.column.impl.ColumnReaderImpl$2$6.read(ColumnReaderImpl.java:307)
> >>      at
> >>
> parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:458)
> >>      ... 5 more
> >> Can't read value in column [binary_field] BINARY at value 377601 out
> >> of 1000000, 1 out of 23600 in currentPage. repetition level: 0,
> >> definition level: 0
> >>
> >> Test Case 2 [SUCCEEDS]
> >>
> >> WriterVersion.PARQUET_1_0
> >> default block and page size
> >> GZIP compression
> >> 1M rows
> >>
> >> Schema:
> >>
> >> file schema:   test
> >>
> >>
> --------------------------------------------------------------------------------
> >> binary_field:  REQUIRED BINARY R:0 D:0
> >> int32_field:   REQUIRED INT32 R:0 D:0
> >> int64_field:   REQUIRED INT64 R:0 D:0
> >> boolean_field: REQUIRED BOOLEAN R:0 D:0
> >> float_field:   REQUIRED FLOAT R:0 D:0
> >> double_field:  REQUIRED DOUBLE R:0 D:0
> >> flba_field:    REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0
> >> int96_field:   REQUIRED INT96 R:0 D:0
> >>
> >> row group 1:   RC:1000000 TS:1070161196 OFFSET:4
> >>
> >>
> --------------------------------------------------------------------------------
> >> binary_field:   BINARY GZIP DO:0 FPO:4 SZ:21862183/40004054/1.83
> >> VC:1000000 ENC:PLAIN,BIT_PACKED
> >> int32_field:    INT32 GZIP DO:0 FPO:21862187 SZ:1383313/4000159/2.89
> >> VC:1000000 ENC:PLAIN,BIT_PACKED
> >> int64_field:    INT64 GZIP DO:0 FPO:23245500 SZ:572/397/0.69
> >> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
> >> boolean_field:  BOOLEAN GZIP DO:0 FPO:23246072 SZ:188/125032/665.06
> >> VC:1000000 ENC:PLAIN,BIT_PACKED
> >> float_field:    FLOAT GZIP DO:0 FPO:23246260 SZ:273/173/0.63
> >> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
> >> double_field:   DOUBLE GZIP DO:0 FPO:23246533 SZ:573/397/0.69
> >> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
> >> flba_field:     FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:23247106
> >> SZ:3057410/1026030079/335.59 VC:1000000 ENC:PLAIN,BIT_PACKED
> >> int96_field:    INT96 GZIP DO:0 FPO:26304516 SZ:1236/905/0.73
> >> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
> >>
> >> Test Case 3 [FAILS]
> >>
> >> WriterVersion.PARQUET_2_0
> >> default block and page size
> >> GZIP compression
> >> 1K rows
> >>
> >> Schema:
> >>
> >> file schema:   test
> >>
> >>
> --------------------------------------------------------------------------------
> >> binary_field:  REQUIRED BINARY R:0 D:0
> >> int32_field:   REQUIRED INT32 R:0 D:0
> >> int64_field:   REQUIRED INT64 R:0 D:0
> >> boolean_field: REQUIRED BOOLEAN R:0 D:0
> >> float_field:   REQUIRED FLOAT R:0 D:0
> >> double_field:  REQUIRED DOUBLE R:0 D:0
> >> flba_field:    REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0
> >> int96_field:   REQUIRED INT96 R:0 D:0
> >>
> >> row group 1:   RC:1000 TS:40502 OFFSET:4
> >>
> >>
> --------------------------------------------------------------------------------
> >> binary_field:   BINARY GZIP DO:0 FPO:4 SZ:21466/36672/1.71 VC:1000
> >> ENC:DELTA_BYTE_ARRAY
> >> int32_field:    INT32 GZIP DO:0 FPO:21470 SZ:70/85/1.21 VC:1000
> >> ENC:DELTA_BINARY_PACKED
> >> int64_field:    INT64 GZIP DO:0 FPO:21540 SZ:106/71/0.67 VC:1000
> >> ENC:RLE_DICTIONARY,PLAIN
> >> boolean_field:  BOOLEAN GZIP DO:0 FPO:21646 SZ:60/40/0.67 VC:1000
> ENC:RLE
> >> float_field:    FLOAT GZIP DO:0 FPO:21706 SZ:99/59/0.60 VC:1000
> >> ENC:RLE_DICTIONARY,PLAIN
> >> double_field:   DOUBLE GZIP DO:0 FPO:21805 SZ:107/71/0.66 VC:1000
> >> ENC:RLE_DICTIONARY,PLAIN
> >> flba_field:     FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:21912
> >> SZ:2152/3421/1.59 VC:1000 ENC:DELTA_BYTE_ARRAY
> >> int96_field:    INT96 GZIP DO:0 FPO:24064 SZ:114/83/0.73 VC:1000
> >> ENC:RLE_DICTIONARY,PLAIN
> >>
> >> parquet-tools dump fails when dumping the fixed len byte array field:
> >>
> >> FIXED_LEN_BYTE_ARRAY flba_field
> >>
> >>
> --------------------------------------------------------------------------------
> >> parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is only
> >> supported for type BINARY
> >>      at parquet.column.Encoding$7.getValuesReader(Encoding.java:196)
> >>      at
> >>
> parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:537)
> >>      at
> >>
> parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:577)
> >>      at
> >>
> parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:57)
> >>      at
> >> parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:521)
> >>      at
> >> parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:513)
> >>      at parquet.column.page.DataPageV2.accept(DataPageV2.java:141)
> >>      at
> >> parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:513)
> >>      at
> >>
> parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:505)
> >>      at
> >> parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:607)
> >>      at
> >> parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:351)
> >>      at
> >>
> parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:66)
> >>      at
> >>
> parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:61)
> >>      at parquet.tools.command.DumpCommand.dump(DumpCommand.java:278)
> >>      at parquet.tools.command.DumpCommand.dump(DumpCommand.java:215)
> >>      at parquet.tools.command.DumpCommand.execute(DumpCommand.java:136)
> >>      at parquet.tools.Main.main(Main.java:219)
> >> Encoding DELTA_BYTE_ARRAY is only supported for type BINARY
> >>
> >> ​
> >>
> >>
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Cloudera, Inc.
> >
>

Reply via email to