Thanks Antoine for sharing the blog post!

I skimmed it quickly and it seems that the main issue is the absolute
file offset used by metadata of page and column chunk. It may take a
long time to migrate if we want to replace them with relative offsets
in the current thrift definition. Perhaps it is a good chance to improve
this with the current flatbuffer experiment?

Best,
Gang

On Wed, Oct 9, 2024 at 8:51 AM Antoine Pitrou <[email protected]> wrote:

>
> Hello,
>
> The Hugging Face developers published this insightful blog post about
> their attempts to deduplicate Parquet files when they have similar
> contents. They offer a couple suggestions for improvement at the end:
> https://huggingface.co/blog/improve_parquet_dedupe
>
> Regards
>
> Antoine.
>
>
>

Reply via email to