Hello Parquet community, Unstructured data ingestion is getting extremely popular with the advances in Generative AI. Today, our only means of dealing with unstructured data is to store it as a byte array inside Parquet, or point to files that exist in some object store with a string. These solutions fail to address these use cases, because of scalability, usability, and governance issues.
We would like to introduce a new logical type annotation in Parquet called “File” for storing a struct that contains a path reference to a file with additional metadata. We propose that the struct contains the following fields: path STRING NOT NULL -- the opaque path to a file size INT64 -- the size of the file in bytes content_type STRING -- the mime/content type of the file etag STRING -- the eTag identifier of the file. Can be used to detect changes to a -- file The path will be stored as an opaque string; whatever the user provides. We don’t do any special encoding on it. The size will be the size of the file in bytes as long. We also store the content_type of the file, and its etag . We believe that these set of options are bare-bones and can be easily extended by new optional fields in the future if desired that wouldn’t impact the correctness of the file being read. We would like to introduce a versioning field to the specification in case we need new fields in the specification that may impact correctness, when accessing a file. We would represent this in parquet.thrift <https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift> as: /** * File logical type annotation */ struct FileType { // Versioning specification of the File struct contents. Can be used if a new field is introduced to the // struct representing the file, which may impact correctness when accessing the file. 1: optional i8 specification_version } We believe that by natively supporting File references in Parquet, it will become much simpler to build AI workloads on top of data stored in Parquet across table formats and data processing engines. Looking forward to your feedback!
