Hi all, After updating the document, I didn't get much additional feedback. As the next step I submitted PRs for the reference implementation and changes: - parquet-format: https://github.com/apache/parquet-format/pull/585 - parquet-java: https://github.com/apache/parquet-java/pull/3608 - arrow-rs: https://github.com/apache/arrow-rs/pull/10109
Look forward to feedback on these changes as well! Thanks, Burak On Fri, Jun 5, 2026 at 9:01 AM Daniel Weeks <[email protected]> wrote: > Hey everyone, > > I had an action item to follow up and provide more context based on the > short discussion during the sync (some is recap of what Burak already said > above). > > I don't seem to have access to the video, so I can't provide a timestamped > link, but can share the high-level takeaways: > > There was a fair bit of discussion back and forth in the doc around some of > the fields (especially metadata and content_type). In the end, what I feel > resonated most with everyone is that if we're creating new primitive types, > we should define them as narrowly as possible (don't include a bunch of > extra fields with hypothetical use cases). We also looked across other > implementations, and while there was some variation, Burak's updated > proposal seems consistent where most of the representations. > > If users want to include additional information, it makes more sense to > carry that information in neighboring fields as it quickly shifts to more > specific use cases. > > Thanks Burak for the quick turnaround! > > -Dan > > On Fri, Jun 5, 2026 at 8:37 AM Micah Kornfield <[email protected]> > wrote: > > > > > > > If there are no strong arguments against the current proposal, may I > > follow > > > up with a pull request to apache/parquet-format > > > <https://github.com/apache/parquet-format>? What would be the next > > steps? > > > Or would I need to start a vote first? > > > > Hi Burak, > > New feature steps are listed in the format contributors guide [1]. If > > there are no objections we can move to step 2 (completeness): A PR > against > > parquet-format and updates to the reference implementations (hopefully > > these are pretty trivial for this case). > > > > I think we can probably start the PRs next week to give people a chance > to > > digest the current proposal and speakup if there are hard objections. > > > > Cheers, > > Micah > > > > > > [1] > > > > > https://github.com/apache/parquet-format/blob/master/CONTRIBUTING.md#additionschanges-to-the-format > > > > > > On Fri, Jun 5, 2026 at 8:25 AM Burak Yavuz <[email protected]> wrote: > > > > > Hi all, > > > > > > Thank you all for the great discussion on the document! I made another > > pass > > > on the doc. During the Parquet sync, there was alignment around keeping > > the > > > field as simple and minimalistic as possible. I updated the doc in that > > way > > > (removed content_type from the field) to ensure that the fields > available > > > are all functional fields for correctly reading a file. > > > > > > Please let me know if you have more feedback! > > > > > > If there are no strong arguments against the current proposal, may I > > follow > > > up with a pull request to apache/parquet-format > > > <https://github.com/apache/parquet-format>? What would be the next > > steps? > > > Or would I need to start a vote first? > > > > > > Thanks, > > > Burak > > > > > > On Wed, May 27, 2026 at 10:31 AM Burak Yavuz <[email protected]> wrote: > > > > > > > Hello all, > > > > > > > > I'm sharing the design document for File Type here > > > > < > > > > > > https://docs.google.com/document/d/1AiwrstqkwkBoOZqgOkm9JGwSMcNeHyLR7EEj1CVqpZQ/edit?usp=sharing > > > >. > > > > Please let me know what you think! > > > > Wanted to thank Micah Kornfield, Divjot Arora, and Daniel Weeks for > > their > > > > feedback working on this document. > > > > > > > > Steve, regarding your questions, my thoughts are inline: > > > > > 1. small inline blob somewhere within the parquet file (|data| = > > > > bytes) > > > > We have a lot of design options here. Does it need to be part of > > "File"? > > > > That's debatable. Engines/table formats can decide to coalesce a File > > > > reference with an inline value when available for example. Carrying > an > > > > inline binary blob may make analytics workloads more inefficient, > > > > specifically if you have to carry them around as baggage through > sorts > > > and > > > > shuffles. > > > > > > > > > 2. Medium blob: data stored range limited within a larger file > > (|data| > > > = > > > > kilo to megabytes) > > > > Again, can be up to a table format to decide creating sidecar files, > > > where > > > > the sidecar may be built on top of these file references. > > > > > > > > > 3. completely separate file (GB +), or somehow the data lifecycle > > isn't > > > > managed with parquet file. > > > > > > > > This file reference solves this problem as well. > > > > > > > > > lifecycle management you don't want to discover that your photo > > > > collection has been deleted by accident, and a data rewrite such > as > > > > applying DVs shouldn't mandate rebuilding of external binary > files. > > > > > security, esp when providing credential access to tables. > Credential > > > > providers would also need to provide file access, so have to know > > > which > > > > binary files are associated with parquet files, somehow. > > > > > > > > These all sound like problems that should be handled at different > > layers > > > > of: > > > > - table format > > > > - engine > > > > - catalog > > > > to me. > > > > > > > > > > > > Looking forward to your feedback! Also @Antoine, I put in a blurb > > around > > > > the extension framework in there. Would love your thoughts on that. > > > > > > > > Best, > > > > Burak > > > > > > > > > > > > On Wed, May 27, 2026 at 3:09 AM Steve Loughran <[email protected]> > > > > wrote: > > > > > > > >> I do think FILE would be good, even though it gets complicate fast. > > > >> > > > >> It'd be good to support all of > > > >> > > > >> 1. small inline blob somewhere within the parquet file (|data| = > > > bytes) > > > >> 2. Medium blob: data stored range limited within a larger file > > > (|data| > > > >> = > > > >> kilo to megabytes) > > > >> 3. completely separate file (GB +), or somehow the data lifecycle > > > isn't > > > >> managed with parquet file. > > > >> > > > >> Issues I can see > > > >> > > > >> - lifecycle management you don't want to discover that your photo > > > >> collection has been deleted by accident, and a data rewrite such > as > > > >> applying DVs shouldn't mandate rebuilding of external binary > files. > > > >> - security, esp when providing credential access to tables. > > > Credential > > > >> providers would also need to provide file access, so have to know > > > which > > > >> binary files are associated with parquet files, somehow. > > > >> > > > >> What have other formats done here? > > > >> > > > >> On Thu, 21 May 2026 at 22:13, Ryan Blue <[email protected]> wrote: > > > >> > > > >> > For some reason, the original email never came through for me. > This > > > >> thread > > > >> > starts with Rahil's email. In case other people are having the > same > > > >> > problem, here's the thread Burak is talking about: > > > >> > https://lists.apache.org/thread/od9hxfssjgnmsh23o18q78hszowq7pcy > > > >> > > > > >> > Ryan > > > >> > > > > >> > On Thu, May 21, 2026 at 1:30 PM Burak Yavuz <[email protected]> > > wrote: > > > >> > > > > >> > > I'll share something early next week. The original proposal is > in > > > the > > > >> > first > > > >> > > email in this thread. > > > >> > > > > > >> > > Best, > > > >> > > Burak > > > >> > > > > > >> > > On Thu, May 21, 2026, 1:15 PM Russell Spitzer < > > > >> [email protected] > > > >> > > > > > >> > > wrote: > > > >> > > > > > >> > > > Do we have a proposal for this yet? I'm excited to go over it > > and > > > I > > > >> > > thought > > > >> > > > one was mentioned in the last sync but I haven't seen it. > > > >> > > > > > > >> > > > On Wed, Apr 8, 2026 at 1:33 PM Burak Yavuz <[email protected]> > > > >> wrote: > > > >> > > > > > > >> > > > > Hi all, > > > >> > > > > > > > >> > > > > Very sorry for the late reply, and thanks for the questions! > > The > > > >> > > messages > > > >> > > > > were not landing in my inbox properly. > > > >> > > > > > > > >> > > > > @Antoine > > > >> > > > > > I feel like this is the kind of use case where a > > hypothetical > > > >> > > extension > > > >> > > > > type mechanism would be a better fit than hardcoding > dedicated > > > >> > logical > > > >> > > > > types in the Thrift definition. > > > >> > > > > > > > >> > > > > How would that look like? We wanted to introduce this > logical > > > >> type to > > > >> > > > > Parquet specifically, so that table formats such as Delta > and > > > >> Iceberg > > > >> > > can > > > >> > > > > have a simpler protocol change, and that we could provide > this > > > as > > > >> a > > > >> > > > > consistent format across multiple data processing engines. > > > >> > > > > > > > >> > > > > > > > >> > > > > @Rahil > > > >> > > > > > I wanted to better understand one point. Based on the > > current > > > >> spec > > > >> > > you > > > >> > > > > shared I see you have a parameter for the following: > > > >> > > > > > > size INT64 -- the size of the file in bytes > > > >> > > > > > Are you proposing that the "File" type always writes the > > > binary > > > >> > > > content > > > >> > > > > of > > > >> > > > > something such as an image or video directly within the > > Parquet > > > >> file > > > >> > > > (i.e., > > > >> > > > > "inlining")? Or would it make sense for the spec to have > some > > > >> field > > > >> > > > > distinguishing whether to store the content's bytes in the > > file > > > >> > itself > > > >> > > vs > > > >> > > > > simply track a pointer to the actual file in storage (i.e., > > > >> keeping > > > >> > it > > > >> > > > "out > > > >> > > > > of line"). > > > >> > > > > > > > >> > > > > This is a great question. When it comes to FileType, the > data > > > will > > > >> > > > > primarily be external to the parquet file, so the FileType > > would > > > >> just > > > >> > > > store > > > >> > > > > the pointer to the data. > > > >> > > > > Now, can that data be inlined anyway? That is an > optimization > > > that > > > >> > can > > > >> > > > > certainly be done. However, that requires some benchmarks to > > see > > > >> how > > > >> > > much > > > >> > > > > the benefit would be. > > > >> > > > > If compute engines were to carry this struct without any > > column > > > >> > pruning > > > >> > > > > across all operations, having inline binary content would > make > > > >> > > operations > > > >> > > > > like sorting and shuffling a lot more expensive. > > > >> > > > > We couldn't instinctively justify whether this would be > worth > > it > > > >> just > > > >> > > > yet. > > > >> > > > > However, the current proposed spec doesn't prevent you from > > also > > > >> > > storing > > > >> > > > > the content inline side by side with the pointer > information. > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > On Sun, Mar 8, 2026 at 5:54 PM Rahil C <[email protected] > > > > > >> wrote: > > > >> > > > > > > > >> > > > > > Hi Burak, > > > >> > > > > > > > > >> > > > > > Thanks for starting this discussion. I was also interested > > in > > > >> > raising > > > >> > > > > this > > > >> > > > > > topic within the Parquet community (unless it has already > > been > > > >> > > > discussed > > > >> > > > > in > > > >> > > > > > the past). > > > >> > > > > > For users working with unstructured data today such as > large > > > >> text, > > > >> > > > > images, > > > >> > > > > > or video, a data type such as a "file" or "blob" would be > > > >> useful. > > > >> > > > > > > > > >> > > > > > I wanted to better understand one point. Based on the > > current > > > >> spec > > > >> > > you > > > >> > > > > > shared I see you have a parameter for the following: > > > >> > > > > > > size INT64 -- the size of the file in bytes > > > >> > > > > > > > > >> > > > > > Are you proposing that the "File" type always writes the > > > binary > > > >> > > > content > > > >> > > > > of > > > >> > > > > > something such as an image or video directly within the > > > Parquet > > > >> > file > > > >> > > > > (i.e., > > > >> > > > > > "inlining")? Or would it make sense for the spec to have > > some > > > >> field > > > >> > > > > > distinguishing whether to store the content's bytes in the > > > file > > > >> > > itself > > > >> > > > vs > > > >> > > > > > simply track a pointer to the actual file in storage > (i.e., > > > >> keeping > > > >> > > it > > > >> > > > > "out > > > >> > > > > > of line"). I would assume there are use cases where you > > would > > > >> want > > > >> > to > > > >> > > > > store > > > >> > > > > > the binary content of something, like a small image within > > the > > > >> > > Parquet > > > >> > > > > file > > > >> > > > > > instead of storing a pointer to a large video file in > object > > > >> > storage. > > > >> > > > > > > > > >> > > > > > Regards, > > > >> > > > > > Rahil Chertara > > > >> > > > > > > > > >> > > > > > On Sat, Mar 7, 2026 at 1:19 AM Antoine Pitrou < > > > >> [email protected]> > > > >> > > > > wrote: > > > >> > > > > > > > > >> > > > > > > > > > >> > > > > > > Hello, > > > >> > > > > > > > > > >> > > > > > > I feel like this is the kind of use case where a > > > hypothetical > > > >> > > > extension > > > >> > > > > > > type mechanism would be a better fit than hardcoding > > > dedicated > > > >> > > > logical > > > >> > > > > > > types in the Thrift definition. > > > >> > > > > > > > > > >> > > > > > > Regards > > > >> > > > > > > > > > >> > > > > > > Antoine. > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > Le 07/03/2026 à 01:57, Burak Yavuz a écrit : > > > >> > > > > > > > Hello Parquet community, > > > >> > > > > > > > > > > >> > > > > > > > Unstructured data ingestion is getting extremely > popular > > > >> with > > > >> > the > > > >> > > > > > > advances > > > >> > > > > > > > in Generative AI. Today, our only means of dealing > with > > > >> > > > unstructured > > > >> > > > > > data > > > >> > > > > > > > is to store it as a byte array inside Parquet, or > point > > to > > > >> > files > > > >> > > > that > > > >> > > > > > > exist > > > >> > > > > > > > in some object store with a string. These solutions > fail > > > to > > > >> > > address > > > >> > > > > > these > > > >> > > > > > > > use cases, because of scalability, usability, and > > > governance > > > >> > > > issues. > > > >> > > > > > > > > > > >> > > > > > > > We would like to introduce a new logical type > annotation > > > in > > > >> > > Parquet > > > >> > > > > > > called > > > >> > > > > > > > “File” for storing a struct that contains a path > > reference > > > >> to a > > > >> > > > file > > > >> > > > > > with > > > >> > > > > > > > additional metadata. > > > >> > > > > > > > > > > >> > > > > > > > We propose that the struct contains the following > > fields: > > > >> > > > > > > > > > > >> > > > > > > > path STRING NOT NULL -- the opaque path to a file > > > >> > > > > > > > > > > >> > > > > > > > size INT64 -- the size of the file in bytes > > > >> > > > > > > > > > > >> > > > > > > > content_type STRING -- the mime/content type of > > the > > > >> file > > > >> > > > > > > > > > > >> > > > > > > > etag STRING -- the eTag identifier of the file. Can be > > > used > > > >> to > > > >> > > > detect > > > >> > > > > > > > changes to a > > > >> > > > > > > > > > > >> > > > > > > > -- file > > > >> > > > > > > > > > > >> > > > > > > > The path will be stored as an opaque string; whatever > > the > > > >> user > > > >> > > > > > provides. > > > >> > > > > > > We > > > >> > > > > > > > don’t do any special encoding on it. The size will be > > the > > > >> size > > > >> > of > > > >> > > > the > > > >> > > > > > > file > > > >> > > > > > > > in bytes as long. We also store the content_type of > the > > > >> file, > > > >> > and > > > >> > > > its > > > >> > > > > > > etag > > > >> > > > > > > > . > > > >> > > > > > > > > > > >> > > > > > > > We believe that these set of options are bare-bones > and > > > can > > > >> be > > > >> > > > easily > > > >> > > > > > > > extended by new optional fields in the future if > desired > > > >> that > > > >> > > > > wouldn’t > > > >> > > > > > > > impact the correctness of the file being read. We > would > > > >> like to > > > >> > > > > > > introduce a > > > >> > > > > > > > versioning field to the specification in case we need > > new > > > >> > fields > > > >> > > in > > > >> > > > > the > > > >> > > > > > > > specification that may impact correctness, when > > accessing > > > a > > > >> > file. > > > >> > > > > > > > > > > >> > > > > > > > We would represent this in parquet.thrift > > > >> > > > > > > > < > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift > > > >> > > > > > > > > > > >> > > > > > > > as: > > > >> > > > > > > > > > > >> > > > > > > > /** > > > >> > > > > > > > > > > >> > > > > > > > * File logical type annotation > > > >> > > > > > > > > > > >> > > > > > > > */ > > > >> > > > > > > > > > > >> > > > > > > > struct FileType { > > > >> > > > > > > > > > > >> > > > > > > > // Versioning specification of the File struct > > > contents. > > > >> Can > > > >> > > be > > > >> > > > > used > > > >> > > > > > > if a > > > >> > > > > > > > new field is introduced to the > > > >> > > > > > > > > > > >> > > > > > > > // struct representing the file, which may impact > > > >> > correctness > > > >> > > > when > > > >> > > > > > > > accessing the file. > > > >> > > > > > > > > > > >> > > > > > > > 1: optional i8 specification_version > > > >> > > > > > > > > > > >> > > > > > > > } > > > >> > > > > > > > > > > >> > > > > > > > We believe that by natively supporting File references > > in > > > >> > > Parquet, > > > >> > > > it > > > >> > > > > > > will > > > >> > > > > > > > become much simpler to build AI workloads on top of > data > > > >> stored > > > >> > > in > > > >> > > > > > > Parquet > > > >> > > > > > > > across table formats and data processing engines. > > Looking > > > >> > forward > > > >> > > > to > > > >> > > > > > your > > > >> > > > > > > > feedback! > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > > > > >
