Hi all, Thank you all for the great discussion on the document! I made another pass on the doc. During the Parquet sync, there was alignment around keeping the field as simple and minimalistic as possible. I updated the doc in that way (removed content_type from the field) to ensure that the fields available are all functional fields for correctly reading a file.
Please let me know if you have more feedback! If there are no strong arguments against the current proposal, may I follow up with a pull request to apache/parquet-format <https://github.com/apache/parquet-format>? What would be the next steps? Or would I need to start a vote first? Thanks, Burak On Wed, May 27, 2026 at 10:31 AM Burak Yavuz <[email protected]> wrote: > Hello all, > > I'm sharing the design document for File Type here > <https://docs.google.com/document/d/1AiwrstqkwkBoOZqgOkm9JGwSMcNeHyLR7EEj1CVqpZQ/edit?usp=sharing>. > Please let me know what you think! > Wanted to thank Micah Kornfield, Divjot Arora, and Daniel Weeks for their > feedback working on this document. > > Steve, regarding your questions, my thoughts are inline: > > 1. small inline blob somewhere within the parquet file (|data| = > bytes) > We have a lot of design options here. Does it need to be part of "File"? > That's debatable. Engines/table formats can decide to coalesce a File > reference with an inline value when available for example. Carrying an > inline binary blob may make analytics workloads more inefficient, > specifically if you have to carry them around as baggage through sorts and > shuffles. > > > 2. Medium blob: data stored range limited within a larger file (|data| = > kilo to megabytes) > Again, can be up to a table format to decide creating sidecar files, where > the sidecar may be built on top of these file references. > > > 3. completely separate file (GB +), or somehow the data lifecycle isn't > managed with parquet file. > > This file reference solves this problem as well. > > > lifecycle management you don't want to discover that your photo > collection has been deleted by accident, and a data rewrite such as > applying DVs shouldn't mandate rebuilding of external binary files. > > security, esp when providing credential access to tables. Credential > providers would also need to provide file access, so have to know which > binary files are associated with parquet files, somehow. > > These all sound like problems that should be handled at different layers > of: > - table format > - engine > - catalog > to me. > > > Looking forward to your feedback! Also @Antoine, I put in a blurb around > the extension framework in there. Would love your thoughts on that. > > Best, > Burak > > > On Wed, May 27, 2026 at 3:09 AM Steve Loughran <[email protected]> > wrote: > >> I do think FILE would be good, even though it gets complicate fast. >> >> It'd be good to support all of >> >> 1. small inline blob somewhere within the parquet file (|data| = bytes) >> 2. Medium blob: data stored range limited within a larger file (|data| >> = >> kilo to megabytes) >> 3. completely separate file (GB +), or somehow the data lifecycle isn't >> managed with parquet file. >> >> Issues I can see >> >> - lifecycle management you don't want to discover that your photo >> collection has been deleted by accident, and a data rewrite such as >> applying DVs shouldn't mandate rebuilding of external binary files. >> - security, esp when providing credential access to tables. Credential >> providers would also need to provide file access, so have to know which >> binary files are associated with parquet files, somehow. >> >> What have other formats done here? >> >> On Thu, 21 May 2026 at 22:13, Ryan Blue <[email protected]> wrote: >> >> > For some reason, the original email never came through for me. This >> thread >> > starts with Rahil's email. In case other people are having the same >> > problem, here's the thread Burak is talking about: >> > https://lists.apache.org/thread/od9hxfssjgnmsh23o18q78hszowq7pcy >> > >> > Ryan >> > >> > On Thu, May 21, 2026 at 1:30 PM Burak Yavuz <[email protected]> wrote: >> > >> > > I'll share something early next week. The original proposal is in the >> > first >> > > email in this thread. >> > > >> > > Best, >> > > Burak >> > > >> > > On Thu, May 21, 2026, 1:15 PM Russell Spitzer < >> [email protected] >> > > >> > > wrote: >> > > >> > > > Do we have a proposal for this yet? I'm excited to go over it and I >> > > thought >> > > > one was mentioned in the last sync but I haven't seen it. >> > > > >> > > > On Wed, Apr 8, 2026 at 1:33 PM Burak Yavuz <[email protected]> >> wrote: >> > > > >> > > > > Hi all, >> > > > > >> > > > > Very sorry for the late reply, and thanks for the questions! The >> > > messages >> > > > > were not landing in my inbox properly. >> > > > > >> > > > > @Antoine >> > > > > > I feel like this is the kind of use case where a hypothetical >> > > extension >> > > > > type mechanism would be a better fit than hardcoding dedicated >> > logical >> > > > > types in the Thrift definition. >> > > > > >> > > > > How would that look like? We wanted to introduce this logical >> type to >> > > > > Parquet specifically, so that table formats such as Delta and >> Iceberg >> > > can >> > > > > have a simpler protocol change, and that we could provide this as >> a >> > > > > consistent format across multiple data processing engines. >> > > > > >> > > > > >> > > > > @Rahil >> > > > > > I wanted to better understand one point. Based on the current >> spec >> > > you >> > > > > shared I see you have a parameter for the following: >> > > > > > > size INT64 -- the size of the file in bytes >> > > > > > Are you proposing that the "File" type always writes the binary >> > > > content >> > > > > of >> > > > > something such as an image or video directly within the Parquet >> file >> > > > (i.e., >> > > > > "inlining")? Or would it make sense for the spec to have some >> field >> > > > > distinguishing whether to store the content's bytes in the file >> > itself >> > > vs >> > > > > simply track a pointer to the actual file in storage (i.e., >> keeping >> > it >> > > > "out >> > > > > of line"). >> > > > > >> > > > > This is a great question. When it comes to FileType, the data will >> > > > > primarily be external to the parquet file, so the FileType would >> just >> > > > store >> > > > > the pointer to the data. >> > > > > Now, can that data be inlined anyway? That is an optimization that >> > can >> > > > > certainly be done. However, that requires some benchmarks to see >> how >> > > much >> > > > > the benefit would be. >> > > > > If compute engines were to carry this struct without any column >> > pruning >> > > > > across all operations, having inline binary content would make >> > > operations >> > > > > like sorting and shuffling a lot more expensive. >> > > > > We couldn't instinctively justify whether this would be worth it >> just >> > > > yet. >> > > > > However, the current proposed spec doesn't prevent you from also >> > > storing >> > > > > the content inline side by side with the pointer information. >> > > > > >> > > > > >> > > > > >> > > > > On Sun, Mar 8, 2026 at 5:54 PM Rahil C <[email protected]> >> wrote: >> > > > > >> > > > > > Hi Burak, >> > > > > > >> > > > > > Thanks for starting this discussion. I was also interested in >> > raising >> > > > > this >> > > > > > topic within the Parquet community (unless it has already been >> > > > discussed >> > > > > in >> > > > > > the past). >> > > > > > For users working with unstructured data today such as large >> text, >> > > > > images, >> > > > > > or video, a data type such as a "file" or "blob" would be >> useful. >> > > > > > >> > > > > > I wanted to better understand one point. Based on the current >> spec >> > > you >> > > > > > shared I see you have a parameter for the following: >> > > > > > > size INT64 -- the size of the file in bytes >> > > > > > >> > > > > > Are you proposing that the "File" type always writes the binary >> > > > content >> > > > > of >> > > > > > something such as an image or video directly within the Parquet >> > file >> > > > > (i.e., >> > > > > > "inlining")? Or would it make sense for the spec to have some >> field >> > > > > > distinguishing whether to store the content's bytes in the file >> > > itself >> > > > vs >> > > > > > simply track a pointer to the actual file in storage (i.e., >> keeping >> > > it >> > > > > "out >> > > > > > of line"). I would assume there are use cases where you would >> want >> > to >> > > > > store >> > > > > > the binary content of something, like a small image within the >> > > Parquet >> > > > > file >> > > > > > instead of storing a pointer to a large video file in object >> > storage. >> > > > > > >> > > > > > Regards, >> > > > > > Rahil Chertara >> > > > > > >> > > > > > On Sat, Mar 7, 2026 at 1:19 AM Antoine Pitrou < >> [email protected]> >> > > > > wrote: >> > > > > > >> > > > > > > >> > > > > > > Hello, >> > > > > > > >> > > > > > > I feel like this is the kind of use case where a hypothetical >> > > > extension >> > > > > > > type mechanism would be a better fit than hardcoding dedicated >> > > > logical >> > > > > > > types in the Thrift definition. >> > > > > > > >> > > > > > > Regards >> > > > > > > >> > > > > > > Antoine. >> > > > > > > >> > > > > > > >> > > > > > > Le 07/03/2026 à 01:57, Burak Yavuz a écrit : >> > > > > > > > Hello Parquet community, >> > > > > > > > >> > > > > > > > Unstructured data ingestion is getting extremely popular >> with >> > the >> > > > > > > advances >> > > > > > > > in Generative AI. Today, our only means of dealing with >> > > > unstructured >> > > > > > data >> > > > > > > > is to store it as a byte array inside Parquet, or point to >> > files >> > > > that >> > > > > > > exist >> > > > > > > > in some object store with a string. These solutions fail to >> > > address >> > > > > > these >> > > > > > > > use cases, because of scalability, usability, and governance >> > > > issues. >> > > > > > > > >> > > > > > > > We would like to introduce a new logical type annotation in >> > > Parquet >> > > > > > > called >> > > > > > > > “File” for storing a struct that contains a path reference >> to a >> > > > file >> > > > > > with >> > > > > > > > additional metadata. >> > > > > > > > >> > > > > > > > We propose that the struct contains the following fields: >> > > > > > > > >> > > > > > > > path STRING NOT NULL -- the opaque path to a file >> > > > > > > > >> > > > > > > > size INT64 -- the size of the file in bytes >> > > > > > > > >> > > > > > > > content_type STRING -- the mime/content type of the >> file >> > > > > > > > >> > > > > > > > etag STRING -- the eTag identifier of the file. Can be used >> to >> > > > detect >> > > > > > > > changes to a >> > > > > > > > >> > > > > > > > -- file >> > > > > > > > >> > > > > > > > The path will be stored as an opaque string; whatever the >> user >> > > > > > provides. >> > > > > > > We >> > > > > > > > don’t do any special encoding on it. The size will be the >> size >> > of >> > > > the >> > > > > > > file >> > > > > > > > in bytes as long. We also store the content_type of the >> file, >> > and >> > > > its >> > > > > > > etag >> > > > > > > > . >> > > > > > > > >> > > > > > > > We believe that these set of options are bare-bones and can >> be >> > > > easily >> > > > > > > > extended by new optional fields in the future if desired >> that >> > > > > wouldn’t >> > > > > > > > impact the correctness of the file being read. We would >> like to >> > > > > > > introduce a >> > > > > > > > versioning field to the specification in case we need new >> > fields >> > > in >> > > > > the >> > > > > > > > specification that may impact correctness, when accessing a >> > file. >> > > > > > > > >> > > > > > > > We would represent this in parquet.thrift >> > > > > > > > < >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift >> > > > > > > > >> > > > > > > > as: >> > > > > > > > >> > > > > > > > /** >> > > > > > > > >> > > > > > > > * File logical type annotation >> > > > > > > > >> > > > > > > > */ >> > > > > > > > >> > > > > > > > struct FileType { >> > > > > > > > >> > > > > > > > // Versioning specification of the File struct contents. >> Can >> > > be >> > > > > used >> > > > > > > if a >> > > > > > > > new field is introduced to the >> > > > > > > > >> > > > > > > > // struct representing the file, which may impact >> > correctness >> > > > when >> > > > > > > > accessing the file. >> > > > > > > > >> > > > > > > > 1: optional i8 specification_version >> > > > > > > > >> > > > > > > > } >> > > > > > > > >> > > > > > > > We believe that by natively supporting File references in >> > > Parquet, >> > > > it >> > > > > > > will >> > > > > > > > become much simpler to build AI workloads on top of data >> stored >> > > in >> > > > > > > Parquet >> > > > > > > > across table formats and data processing engines. Looking >> > forward >> > > > to >> > > > > > your >> > > > > > > > feedback! >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> >
