Do we have a proposal for this yet? I'm excited to go over it and I thought one was mentioned in the last sync but I haven't seen it.
On Wed, Apr 8, 2026 at 1:33 PM Burak Yavuz <[email protected]> wrote: > Hi all, > > Very sorry for the late reply, and thanks for the questions! The messages > were not landing in my inbox properly. > > @Antoine > > I feel like this is the kind of use case where a hypothetical extension > type mechanism would be a better fit than hardcoding dedicated logical > types in the Thrift definition. > > How would that look like? We wanted to introduce this logical type to > Parquet specifically, so that table formats such as Delta and Iceberg can > have a simpler protocol change, and that we could provide this as a > consistent format across multiple data processing engines. > > > @Rahil > > I wanted to better understand one point. Based on the current spec you > shared I see you have a parameter for the following: > > > size INT64 -- the size of the file in bytes > > Are you proposing that the "File" type always writes the binary content > of > something such as an image or video directly within the Parquet file (i.e., > "inlining")? Or would it make sense for the spec to have some field > distinguishing whether to store the content's bytes in the file itself vs > simply track a pointer to the actual file in storage (i.e., keeping it "out > of line"). > > This is a great question. When it comes to FileType, the data will > primarily be external to the parquet file, so the FileType would just store > the pointer to the data. > Now, can that data be inlined anyway? That is an optimization that can > certainly be done. However, that requires some benchmarks to see how much > the benefit would be. > If compute engines were to carry this struct without any column pruning > across all operations, having inline binary content would make operations > like sorting and shuffling a lot more expensive. > We couldn't instinctively justify whether this would be worth it just yet. > However, the current proposed spec doesn't prevent you from also storing > the content inline side by side with the pointer information. > > > > On Sun, Mar 8, 2026 at 5:54 PM Rahil C <[email protected]> wrote: > > > Hi Burak, > > > > Thanks for starting this discussion. I was also interested in raising > this > > topic within the Parquet community (unless it has already been discussed > in > > the past). > > For users working with unstructured data today such as large text, > images, > > or video, a data type such as a "file" or "blob" would be useful. > > > > I wanted to better understand one point. Based on the current spec you > > shared I see you have a parameter for the following: > > > size INT64 -- the size of the file in bytes > > > > Are you proposing that the "File" type always writes the binary content > of > > something such as an image or video directly within the Parquet file > (i.e., > > "inlining")? Or would it make sense for the spec to have some field > > distinguishing whether to store the content's bytes in the file itself vs > > simply track a pointer to the actual file in storage (i.e., keeping it > "out > > of line"). I would assume there are use cases where you would want to > store > > the binary content of something, like a small image within the Parquet > file > > instead of storing a pointer to a large video file in object storage. > > > > Regards, > > Rahil Chertara > > > > On Sat, Mar 7, 2026 at 1:19 AM Antoine Pitrou <[email protected]> > wrote: > > > > > > > > Hello, > > > > > > I feel like this is the kind of use case where a hypothetical extension > > > type mechanism would be a better fit than hardcoding dedicated logical > > > types in the Thrift definition. > > > > > > Regards > > > > > > Antoine. > > > > > > > > > Le 07/03/2026 à 01:57, Burak Yavuz a écrit : > > > > Hello Parquet community, > > > > > > > > Unstructured data ingestion is getting extremely popular with the > > > advances > > > > in Generative AI. Today, our only means of dealing with unstructured > > data > > > > is to store it as a byte array inside Parquet, or point to files that > > > exist > > > > in some object store with a string. These solutions fail to address > > these > > > > use cases, because of scalability, usability, and governance issues. > > > > > > > > We would like to introduce a new logical type annotation in Parquet > > > called > > > > “File” for storing a struct that contains a path reference to a file > > with > > > > additional metadata. > > > > > > > > We propose that the struct contains the following fields: > > > > > > > > path STRING NOT NULL -- the opaque path to a file > > > > > > > > size INT64 -- the size of the file in bytes > > > > > > > > content_type STRING -- the mime/content type of the file > > > > > > > > etag STRING -- the eTag identifier of the file. Can be used to detect > > > > changes to a > > > > > > > > -- file > > > > > > > > The path will be stored as an opaque string; whatever the user > > provides. > > > We > > > > don’t do any special encoding on it. The size will be the size of the > > > file > > > > in bytes as long. We also store the content_type of the file, and its > > > etag > > > > . > > > > > > > > We believe that these set of options are bare-bones and can be easily > > > > extended by new optional fields in the future if desired that > wouldn’t > > > > impact the correctness of the file being read. We would like to > > > introduce a > > > > versioning field to the specification in case we need new fields in > the > > > > specification that may impact correctness, when accessing a file. > > > > > > > > We would represent this in parquet.thrift > > > > < > > > > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift > > > > > > > > as: > > > > > > > > /** > > > > > > > > * File logical type annotation > > > > > > > > */ > > > > > > > > struct FileType { > > > > > > > > // Versioning specification of the File struct contents. Can be > used > > > if a > > > > new field is introduced to the > > > > > > > > // struct representing the file, which may impact correctness when > > > > accessing the file. > > > > > > > > 1: optional i8 specification_version > > > > > > > > } > > > > > > > > We believe that by natively supporting File references in Parquet, it > > > will > > > > become much simpler to build AI workloads on top of data stored in > > > Parquet > > > > across table formats and data processing engines. Looking forward to > > your > > > > feedback! > > > > > > > > > > > > > > > >
