timothydijamco commented on issue #45287: URL: https://github.com/apache/arrow/issues/45287#issuecomment-2627468938
Superficially (in the context of "scan node") it looks safe from what I can tell from the code: * The `Fragment`s whose `physical_schema_`s we would delete are **first created** at the beginning of `MakeScanNode` ([here](https://github.com/apache/arrow/blob/42b84bd0330522e490cd5c57ae2cce491d0a1666/cpp/src/arrow/dataset/scanner.cc#L1007), where `dataset->GetFragments` is called). * These `Fragment`s **do not propogate into** the `ExecNode` that is created at the end of `MakeScanNode` (since we're passing a generator of only *`ExecBatch`es* to `acero::MakeExecNode`). * So it seems like we have to make sure that any usages of `Fragment` within the long chain of generator logic between <when the `Fragment`s are created> and <when they have finally been turned into just `ExecBatch`es> don't need to use `physical_schema_` _before_ we clear `physical_schema_`. Where is `Fragment` used in this long chain of generator logic? * A call to `Fragment::ScanBatchesAsync` [here](https://github.com/apache/arrow/blob/42b84bd0330522e490cd5c57ae2cce491d0a1666/cpp/src/arrow/dataset/scanner.cc#L307) * In your PR this currently happens before you trigger the clearing of any fields of `Fragment` (i.e. before we would clear `physical_schema_`) anyways, so this call won't be a problem * A call to `Fragment::partition_expression` [here](https://github.com/apache/arrow/blob/42b84bd0330522e490cd5c57ae2cce491d0a1666/cpp/src/arrow/dataset/scanner.cc#L1048) * Today, this (happens to) not use `physical_schema_` — the behavior of this method for `Fragment` and all its subclasses is to return a `partition_expression_` value that was passed to the constructor of `Fragment`. * A call to `Fragment::ToString` [here](https://github.com/apache/arrow/blob/42b84bd0330522e490cd5c57ae2cce491d0a1666/cpp/src/arrow/dataset/scanner.cc#L1059) * Today, this (happens to) not use `physical_schema_` — the behavior of this method is to return a `type_name()` value that is overriden by each subclass to return a constant. In summary, it seems that today, clearing `physical_schema_` [here](https://github.com/apache/arrow/blob/main/cpp/src/arrow/dataset/scanner.cc#L323) (as you're doing in your PR) would be safe, although doing it [here](https://github.com/apache/arrow/blob/42b84bd0330522e490cd5c57ae2cce491d0a1666/cpp/src/arrow/dataset/scanner.cc#L1060) (i.e. after the last time the `Fragment`s are used) is safest. --- I locally added the below check to the end of your "Parquet cached metadata" test [here](https://github.com/apache/arrow/pull/45330/files#diff-d88654840d0432223c1617e8fd9289db0f4e6fff6b34e9f062861ef8eec724fcL349) and it runs OK. I think this exercises that `physical_schema_` is re-populated and usable after previously clearing it and then doing another read (because `CountRows` calls into `ParquetFileFragment::TestRowGroups` which makes use of `physical_schema_` [here](https://github.com/apache/arrow/blob/f8723722341df31c0091c91ec451319ded58c214/cpp/src/arrow/dataset/file_parquet.cc#L927)) ``` // ... auto predicate = compute::equal(compute::field_ref("x"), compute::literal(0)); ASSERT_OK_AND_ASSIGN(predicate, predicate.Bind(*test_schema)); ASSERT_FINISHES_OK_AND_EQ(std::make_optional<int64_t>(1), pq_fragment->CountRows(predicate, options)); } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
