rok commented on PR #45360: URL: https://github.com/apache/arrow/pull/45360#issuecomment-2651021582
Thanks for doing this @kszucs ! I like how this doesn't need any changes to readers. Questions: - As it stands in this PR, CDC is either on or off for all columns. How about enabling it per column? In general case some columns might not be worthy candidates for it. - Use case described in [HF blogpost](https://huggingface.co/blog/improve_parquet_dedupe) describes cases where rows are added or removed but not much else is changed. Wouldn't it then make sense to first try a shortcut deduplication where if we identify a duplication in the first column we first check for the same duplication at the same indices in all other columns before running a full hashing pass? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
