Re: Improving Parquet Dedupe

Yucheng Low Thu, 10 Oct 2024 09:26:00 -0700

Yep! Definitely something we can look into. I have not thought about too much 
yet...
(Sounds tricky but who knows.) I was talking to Kenny Daniel (hyparquet), and 
he 
noticed that the offsets in the Parquet format are signed integers. So an
"inplace substitution" could be to replace them with negative integers to 
denote 
offset...


Anyway, in the meantime, there are a couple of ideas floating here.

1: An content-based heuristic for row group chunking. Is this of interest? If 
this is
I can work on plumbing it as an option into the parquet writers.
2: If there is anything else with relative offsets I can help with.

Thanks,
Yucheng

> On Oct 10, 2024, at 2:58 AM, Andrew Lamb <[email protected]> wrote:
> 
> (I see now this is probably what you mean by "implement a specialized
> format pre-process to modify those offsets for storage purposes")
> 
> On Thu, Oct 10, 2024 at 5:57 AM Andrew Lamb <[email protected]> wrote:
> 
>>> It is not inherently aware of the Parquet file structure, and strive to
>> store
>> things byte-for-byte identical so we don't have to deal with parsers,
>> malformed
>> files, new formats, etc.
>> 
>> I see -- thank you -- this is the key detail I didn't understand.
>> 
>> I wonder if you could apply some normalization to the file prior to
>> deduplicating them (aka could you update your hash calculation so it
>> zero'ed out the relative offsets in a parquet files before checking for
>> equality? That would require applying some special case based on file
>> format, but the code is likely relatively simple
>> 
>> 
>> 
>> 
>> On Wed, Oct 9, 2024 at 9:05 PM Yucheng Low <[email protected]> wrote:
>> 
>>> Hi Andrew! Have not seen you in a while!
>>> 
>>> Back on topic,
>>> 
>>> The deduplication procedure we are using is file-type independent and
>>> simply chunks the file into variable-sized chunks averaging ~ 64KB.
>>> It is not inherently aware of the Parquet file structure, and strive to
>>> store
>>> things byte-for-byte identical so we don't have to deal with parsers,
>>> malformed
>>> files, new formats, etc.
>>> 
>>> Also, we operate (like git) on a snapshot basis. We are not storing
>>> information
>>> about how a file changed as we do not have that information, nor do we
>>> want
>>> to try to derive it. If we know the operations that changed the file,
>>> Iceberg
>>> will be the ideal solution I imagine. As such we need to try to identify
>>> common byte sequences which already exist "somewhere" in our system
>>> and dedupe accordingly. In a sense, what we are trying to do is
>>> orthogonal
>>> to Iceberg. (Deltas vs snapshots).
>>> 
>>> However, the "file_offset" fields in RowGroup and ColumnChunk are not
>>> position independent in the file and so result in significant
>>> fragmentation,
>>> and for files with small row groups, poor deduplication.
>>> 
>>> We could of course implement a specialized format pre-process to modify
>>> those offsets for storage purposes, but in my mind that is probably
>>> remarkably
>>> difficult to make resilient, where the goal is byte-for-byte identical.
>>> 
>>> While we may just have to accept it for the current Parquet format, (we
>>> have
>>> some tricks to deal with fragmentation). if there are plans on updating
>>> the
>>> Parquet format, addressing the issue at the format layer (switch all
>>> absolute
>>> offsets to relative) will be great and may have future benefits here as
>>> well.
>>> 
>>> Thanks,
>>> Yucheng
>>> 
>>>> On Oct 9, 2024, at 5:38 PM, Andrew Lamb <[email protected]> wrote:
>>>> 
>>>> I am sorry for the likely dumb question, but I think I am missing
>>> something
>>>> 
>>>> The blog post says " This means that any modification is likely to
>>> rewrite
>>>> all the Column headers."
>>>> 
>>>> But my understanding of the parquet format is that the ColumnChunks[1]
>>> are
>>>> stored inline with the RowGroups which are stored in the footer.
>>>> 
>>>> Thus I would expect that a parquet deduplication process could copy the
>>>> data for each row group memcpy style, and write a new footer with
>>> updated
>>>> offsets. This doesn't require rewriting the entire file, simply
>>> adjusting
>>>> offsets and writing a new footer.
>>>> 
>>>> Andrew
>>>> 
>>>> 
>>>> [1]
>>>> 
>>> https://github.com/apache/parquet-format/blob/4f208158dba80ff4bff4afaa4441d7270103dff6/src/main/thrift/parquet.thrift#L918
>>>> 
>>>> 
>>>> On Wed, Oct 9, 2024 at 1:51 PM Yucheng Low <[email protected]> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I am the author of the blog here!
>>>>> Happy to answer any questions.
>>>>> 
>>>>> There are a couple of parts, one is regarding relative pointers and a
>>>>> second is the row group chunking system (which for performance purposes
>>>>> could benefit from being implemented in the C/C++ layer). I am happy to
>>>>> help where I can with the latter as that can be done with the current
>>>>> Parquet version too.
>>>>> 
>>>>> Thanks,
>>>>> Yucheng
>>>>> 
>>>>> On 2024/10/09 15:46:01 Julien Le Dem wrote:
>>>>>> I recommended to them that they join the dev list. I think that's the
>>>>>> easiest to discuss.
>>>>>> IMO, it's a good goal to have relative pointers in the metadata so
>>> that a
>>>>>> row group doesn't depend on where it is in a file.
>>>>>> It looks like some aspects of making the data updates more incremental
>>>>>> could leverage Iceberg.
>>>>>> 
>>>>>> On Wed, Oct 9, 2024 at 6:52 AM Antoine Pitrou <[email protected]>
>>> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> I have a contact at Hugging Face who actually notified me of the blog
>>>>>>> post. I can transmit any questions or suggestions if desired.
>>>>>>> 
>>>>>>> Regards
>>>>>>> 
>>>>>>> Antoine.
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, 9 Oct 2024 11:50:51 +0100
>>>>>>> Steve Loughran
>>>>>>> <[email protected]> wrote:
>>>>>>>> flatbuffer would be the obvious place that would be no compatibility
>>>>>>> issues
>>>>>>>> with existing readers.
>>>>>>>> 
>>>>>>>> Also: that looks like a large amount of information to capture
>>>>> statistics
>>>>>>>> on. Has anyone approached them yet?
>>>>>>>> 
>>>>>>>> On Wed, 9 Oct 2024 at 03:39, Gang Wu <
>>>>>>> [email protected]> wrote:
>>>>>>>> 
>>>>>>>>> Thanks Antoine for sharing the blog post!
>>>>>>>>> 
>>>>>>>>> I skimmed it quickly and it seems that the main issue is the
>>>>> absolute
>>>>>>>>> file offset used by metadata of page and column chunk. It may take
>>>>> a
>>>>>>>>> long time to migrate if we want to replace them with relative
>>>>> offsets
>>>>>>>>> in the current thrift definition. Perhaps it is a good chance to
>>>>>>> improve
>>>>>>>>> this with the current flatbuffer experiment?
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> Gang
>>>>>>>>> 
>>>>>>>>> On Wed, Oct 9, 2024 at 8:51 AM Antoine Pitrou <
>>>>>>> [email protected]> wrote:
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hello,
>>>>>>>>>> 
>>>>>>>>>> The Hugging Face developers published this insightful blog post
>>>>> about
>>>>>>>>>> their attempts to deduplicate Parquet files when they have
>>>>> similar
>>>>>>>>>> contents. They offer a couple suggestions for improvement at the
>>>>> end:
>>>>>>>>>> https://huggingface.co/blog/improve_parquet_dedupe
>>>>>>>>>> 
>>>>>>>>>> Regards
>>>>>>>>>> 
>>>>>>>>>> Antoine.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>> 
>>>

Re: Improving Parquet Dedupe

Reply via email to