Hi Fokko,

I am aware of catalogs such as iceberg, my question was if in the design of 
parquet we can assume the existence of such a catalog.

Kind Regards,

Raphael

On 18 May 2024 16:18:22 BST, Fokko Driesprong <fo...@apache.org> wrote:
>Hey Raphael,
>
>Thanks for reaching out here. Have you looked into table formats such as Apache
>Iceberg <https://iceberg.apache.org/docs/nightly/>? This seems to fix the
>problem that you're describing
>
>A table format adds an ACID layer to the file format and acts as a fully
>functional database. In the case of Iceberg, a catalog is required for
>atomicity, and alternatives like Delta Lake also seem to trend into that
>direction
><https://github.com/orgs/delta-io/projects/10/views/1?pane=issue&itemId=57584023>
>.
>
>I'm conscious that for many users this responsibility is instead delegated
>> to a catalog that maintains its own index structures and statistics, only 
>> relies
>> on the parquet metadata for very late stage pruning, and may therefore
>> see limited benefit from revisiting the parquet metadata structures.
>
>
>This is exactly what Iceberg offers, it provides additional metadata to
>speed up the planning process:
>https://iceberg.apache.org/docs/nightly/performance/
>
>Kind regards,
>Fokko
>
>Op za 18 mei 2024 om 16:40 schreef Raphael Taylor-Davies
><r.taylordav...@googlemail.com.invalid>:
>
>> Hi All,
>>
>> The recent discussions about metadata make me wonder where a storage
>> format ends and a database begins, as people seem to have differing
>> expectations of parquet here. In particular, one school of thought
>> posits that parquet should suffice as a standalone technology, where
>> users can write parquet files to a store and efficiently query them
>> directly with no additional technologies. However, others instead view
>> parquet as a storage format for use in conjunction with some sort of
>> catalog / metastore. These two approaches naturally place very different
>> demands on the parquet format. The former case incentivizes constructing
>> extremely large parquet files, potentially on the order of TBs [1], such
>> that the parquet metadata alone can efficiently be used to service a
>> query without lots of random I/O to separate files. However, the latter
>> case incentivizes relatively small parquet files (< 1GB) laid out in
>> such a way that the catalog metadata can be used to efficiently identify
>> a much smaller set of files for a given query, and write amplification
>> can be avoided for inserts.
>>
>> Having only ever used parquet in the context of data lake style systems,
>> the catalog approach comes more naturally to me and plays to parquet's
>> current strengths, however, this does not seem to be a universally held
>> expectation. I've frequently found people surprised when queries
>> performed in the absence of a catalog are slow, or who wish to
>> efficiently mutate or append to parquet files in place [2] [3] [4]. It
>> is possibly anecdotal but these expectations seem to be more common
>> where people are coming from python-based tooling such as pandas, and
>> might reflect weaker tooling support for catalog systems in this ecosystem.
>>
>> Regardless this mismatch appears to be at the core of at least some of
>> the discussions about metadata. I do not think it a controversial take
>> that the current metadata structures are simply not setup for files on
>> the order of >1TB, where the metadata balloons to 10s or 100s of MB and
>> takes 10s of milliseconds just to parse. If this is in scope it would
>> justify major changes to the parquet metadata, however, I'm conscious
>> that for many users this responsibility is instead delegated to a
>> catalog that maintains its own index structures and statistics, only
>> relies on the parquet metadata for very late stage pruning, and may
>> therefore see limited benefit from revisiting the parquet metadata
>> structures.
>>
>> I'd be very interested to hear other people's thoughts on this.
>>
>> Kind Regards,
>>
>> Raphael
>>
>> [1]: https://github.com/apache/arrow-rs/issues/5770
>> [2]: https://github.com/apache/datafusion/issues/9654
>> [3]:
>> https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/53
>> [4]: https://github.com/apache/arrow-rs/issues/557
>>
>>

Reply via email to