Hi Sivabalan,

Thanks for your response. The metadata we need to store is indeed per-file,
and it is leveraged primarily during reads. Currently, we are using the
extraMetadata field in the commit files, but this approach requires reading
both the active and archive timelines to extract the information during
reads.

We are exploring a solution where the metadata is stored in the
MetadataTable for faster retrieval and improved performance. This would
also help align with Hudi's internals, as the MetadataTable is primarily
used for storing indexes and other metadata-related information.

In our solution, we would aim to extend the metadata table schema and
include something like this:

|-- qbeastMetadata: struct (nullable = true)
    |-- fileName: string (nullable = false)
    |-- revision: integer (nullable = false)
    |-- blocks: struct (nullable = false)
        |-- id: integer (nullable = false)
        |-- min: integer (nullable = false)
        |-- max: integer (nullable = false)
        |-- elementCount: integer (nullable = false)

Looking at the code, it seems that the default schema is defined in the
HoodieMetadata.avsc file (
https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieMetadata.avsc),
and the classes that manage the table are generated automatically for this
schema.

Our question is: what would be the proper way to extend the default schema
to include the metadata we need and generate the classes to manage it?

Best regards,

-Josep

Reply via email to