laserninja commented on PR #16110: URL: https://github.com/apache/iceberg/pull/16110#issuecomment-4418042377
Ignored can be the right abstraction. The key semantic difference from AlwaysFalse is in or: or(x, AlwaysFalse) correctly simplifies to x, but or(x, Ignored) must stay Ignored because the missing column could have matching rows, so we can't push down the OR at all. Proposed semantics: not(Ignored) → Ignored or(x, Ignored) / or(Ignored, x) → Ignored (can't push; might miss rows) and(real, Ignored) / and(Ignored, real) → real (safe to push the resolvable side; AND with an ignored term can only be more restrictive) and(Ignored, Ignored) → Ignored convert(): treat Ignored.INSTANCE same as AlwaysTrue.INSTANCE → NOOP This gives "correct result in many cases" for AND-heavy filters on partially evolved files. I'll update the PR with: The Ignored sentinel class and the visitor changes above A TableScan-level integration test using the schema evolution scenario (write file without column, add column, scan with predicate on the new column) Updated unit tests in TestParquetFilters Does the and semantic look right to you, or would you prefer and(real, Ignored) → Ignored (simpler, NOOP the whole thing)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
