Ruta Dhaneshwar created ARROW-10442: ---------------------------------------
Summary: WriteBatchSpaced writes incorrect value for parquet when input contains NULL list Key: ARROW-10442 URL: https://issues.apache.org/jira/browse/ARROW-10442 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Ruta Dhaneshwar Attachments: image-2020-10-30-17-44-55-191.png, image-2020-10-30-17-45-37-423.png, image-2020-10-30-17-46-33-370.png, image-2020-10-30-17-47-31-022.png, image-2020-10-30-17-48-11-872.png When trying to write a column of parquet lists, if there is a NULL list, WriteBatchSpaced will either throw an error (case 1 below) or incorrectly write the last value in the last list as the first value from the first list (case 2 below). Schema: message schema { optional group _COL_0 (LIST) { repeated group list { optional binary item (UTF8); } } } *CASE 1* Data (3 lists): [ "one" ] null [ "two" ] Parameters to TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced: # num_values: 3 # def_levels: [3, 0, 3] # rep_levels: [0, 0, 0] # valid_bits: 0x05 (bit representation 101) # valid_bits_offset: 0 # values: ["one", nullptr, "two"] When I use WriteBatchSpaced(num_values, def_levels, rep_levels, valid_bits, valid_bits_offset, values), I get the following error when running [parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools] on the outputted parquet file: !image-2020-10-30-17-45-37-423.png|width=358,height=56! !image-2020-10-30-17-46-33-370.png|width=638,height=210! Additionally, if I add another list into the data that I write, then the last element of that additional list is incorrectly written as the first element of the first list. See below. *CASE 2* Data (4 lists): [ "one" ] null [ "two" ] [ "three", "four" ] TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced: # num_values: 5 # def_levels: [3, 0, 3, 3, 3] # rep_levels: [0, 0, 0, 0, 1] # valid_bits: 0x29 (bit representation 11101) # valid_bits_offset: 0 # values: ["one", nullptr, "two", "three", "four"] Outputted Parquet File: !image-2020-10-30-17-47-31-022.png|width=77,height=155! !image-2020-10-30-17-48-11-872.png|width=233,height=75! Here we see that the "four" in the last list actually shows up as "one". -- This message was sent by Atlassian Jira (v8.3.4#803005)