[ https://issues.apache.org/jira/browse/ARROW-14047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoine Pitrou resolved ARROW-14047. ------------------------------------ Resolution: Fixed Patch was incorporated with ARROW-15550. > [C++] [Parquet] FileReader returns inconsistent results on repeat reads > ----------------------------------------------------------------------- > > Key: ARROW-14047 > URL: https://issues.apache.org/jira/browse/ARROW-14047 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Affects Versions: 5.0.0, 6.0.0, 6.0.1, 7.0.0 > Environment: Centos 7 gcc 9.2.0 > Reporter: Radu Teodorescu > Assignee: Will Jones > Priority: Major > Labels: pull-request-available > Fix For: 7.0.1, 8.0.0 > > Attachments: Capture.PNG, writeReadRowGroup.parquet > > Time Spent: 8h 10m > Remaining Estimate: 0h > > We are seeing that for certain data sets when dealing with lists of structs, > repeated reads yield different results - I have a file that exhibits this > behavior and below is the code for reproducing it: > {code:java} > filesystem::path filePath = dirPath / "writeReadRowGroup.parquet"; > arrow::MemoryPool *pool = arrow::default_memory_pool(); > std::shared_ptr<arrow::io::ReadableFile> infile; > PARQUET_ASSIGN_OR_THROW(infile, arrow::io::ReadableFile::Open(filePath, > pool)); > std::unique_ptr<parquet::arrow::FileReader> arrow_reader; > auto status = parquet::arrow::OpenFile(infile, pool, &arrow_reader); > CHECK_OK(status); std::shared_ptr<arrow::Schema> readSchema; > CHECK_OK(arrow_reader->GetSchema(&readSchema)); > std::shared_ptr<arrow::Table> table; > std::vector<int> indicesToGet; > CHECK_OK(arrow_reader->ReadTable(&table)); auto recordListCol1 = > arrow::Table::Make(arrow::schema({table->schema()->GetFieldByName("recordList")}), > > {table->GetColumnByName("recordList")}); for (int i = 0; i < 20; ++i) { > cout << "data reread operation number = " + std::to_string(i) << endl; > std::shared_ptr<arrow::Table> table2; > CHECK_OK(arrow_reader->ReadTable(&table2)); > auto recordListCol2 = > arrow::Table::Make(arrow::schema({table2->schema()->GetFieldByName("recordList")}), > > {table2->GetColumnByName("recordList")}); > bool equals = recordListCol1->Equals(*recordListCol2); > if (!equals) { > cout << recordListCol1->ToString() << endl; > cout << endl << "new table" << endl; > cout << recordListCol2->ToString() << endl; > throw std::runtime_error("Subsequent re-read failure "); > } } > {code} > Apparently, as shown in the attached capture the state machine used to track > nulls is broken on subsequent usage > -- This message was sent by Atlassian Jira (v8.20.1#820001)