Hi Fokko,

So, if I understand correctly, you have a small number of key-value
metadata entries, but the values may be large?

Also, you actually need those metadata values to do anything with the
data (because they tell you the actual Iceberg schema), so on-demand
decoding of these values would probably not help for you?

(I'm not sure large string values are a problem with Thrift; I would
hope not)

Regards

Antoine.


On Thu, 16 May 2024 22:45:02 +0200
Fokko Driesprong <fo...@apache.org> wrote:
> Hey Antoine,
> 
> First of all, love the recent uptake in activity on the Parquet side. I'm
> on holiday, but I'll definitly catch up when I return.
> 
> I wanted to respond to this particular mail since we do store various
> fields in the metadata for Apache Iceberg. For example:
> 
>    - The JSON serialized Iceberg schema that was used when writing the
>    file:
>    
> https://github.com/apache/iceberg/blob/bd046f844a1cbad6c98919d8ea63176aeae78d33/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L274
>    - I
>    
> <https://github.com/apache/iceberg/blob/bd046f844a1cbad6c98919d8ea63176aeae78d33/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L274>n
>    the case of delete files, we write the kind of file (positional or
>    equality), and in the case of equality, also the field IDs:
>    
> https://github.com/apache/iceberg/blob/bd046f844a1cbad6c98919d8ea63176aeae78d33/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L905-L910
> 
> This is mostly for debugging purposes. The schema could become quite big as
> it is proportional to the number of columns. The metadata is mostly set for
> debugging purposes and is not part of the official Iceberg spec.
> 
> I hope this helps!
> 
> Kind regards,
> Fokko
> 
> Op do 16 mei 2024 om 21:17 schreef Antoine Pitrou <anto...@python.org>:
> 
> >
> > Hello,
> >
> > In https://github.com/apache/parquet-format/pull/242 the question came
> > of the size and overhead of key-value metadata entries in real world
> > Parquet files (basically, user-defined metadata attached either to the
> > entire file or to individual columns). Do people have insight to share
> > about the typical number of metadata entries in a file or column, and
> > their typical byte size?
> >
> > Regards
> >
> > Antoine.
> >
> >
> >  
> 



Reply via email to