Thanks Antoine for sharing the blog post! I skimmed it quickly and it seems that the main issue is the absolute file offset used by metadata of page and column chunk. It may take a long time to migrate if we want to replace them with relative offsets in the current thrift definition. Perhaps it is a good chance to improve this with the current flatbuffer experiment?
Best, Gang On Wed, Oct 9, 2024 at 8:51 AM Antoine Pitrou <[email protected]> wrote: > > Hello, > > The Hugging Face developers published this insightful blog post about > their attempts to deduplicate Parquet files when they have similar > contents. They offer a couple suggestions for improvement at the end: > https://huggingface.co/blog/improve_parquet_dedupe > > Regards > > Antoine. > > >
