Re: Interest in Parquet V3

2024-05-28 Thread Micah Kornfield
Hi Jan, > However, this whole discussion is moot if we in the end decide we do not > want extensibility in the first place. So, do we want it? And if so, what > extension points do we want? Should we create a separate discussion thread > for this as well? Great questions. I started a new

Re: Interest in Parquet V3

2024-05-28 Thread Jan Finis
Thanks Micah for driving the effort! One thing that I didn't see in your 3 points is a discussion about extensibility, and, depending on whether we will be quite extensible, a way to specify which features are used by a Parquet file. Somewhere down the discussion, it was mentioned that we can

Re: Interest in Parquet V3

2024-05-27 Thread Micah Kornfield
Hi Everyone, Just to follow up, conversations on the summary doc [1] have largely slowed down. In my mind I think of roughly three different tracks, and I'll start threads to get a sense of who is interested (please be on the lookout for discussion threads). I think as those conversations branch

Re: Interest in Parquet V3

2024-05-22 Thread Weston Pace
> *A row group can consist of one or many column chunks per column* (while > before it could consist of only one)*.* Yes, that would work. It's not a breaking change and makes good sense as an evolution. I was talking with Jacques Nadeau earlier and he mentioned another solution, which is to

Re: Interest in Parquet V3

2024-05-22 Thread Jan Finis
Thanks Weston for the answers, very insightful! I appreciate your input very much. Some follow ups :): My point is that one of > these (compression chunk size) is a file format concern and the other one > (encoding size) is an encoding concern. The fact that Lance V2 still can have multiple

Re: Interest in Parquet V3

2024-05-22 Thread Steve Loughran
On Tue, 21 May 2024 at 22:40, Jan Finis wrote: > Thanks Weston for posting here! > > I appreciate this a lot, as it gives us the opportunity to discuss modern > formats in depth with the authors themselves, who probably know the design > trade-offs they took best and thus can give us a deeper

Re: Interest in Parquet V3

2024-05-21 Thread Weston Pace
> My point is that one of these (compression chunk size) is a file format concern and the other one (encoding size) is an encoding concern. Slight typo :) I mean "page size" is a file format concern and "compression chunk size" is an encoding concern. On Tue, May 21, 2024 at 6:40 PM Weston Pace

Re: Interest in Parquet V3

2024-05-21 Thread Weston Pace
Thank you for your questions! I think your understanding is very solid. > Do I understand correctly that you basically replace row groups with > files. Thus, the task for reading row groups in parallel boils down to > reading files in parallel. Partly. I recommend files for inter-process

Re: Interest in Parquet V3

2024-05-21 Thread Jan Finis
Thanks Weston for posting here! I appreciate this a lot, as it gives us the opportunity to discuss modern formats in depth with the authors themselves, who probably know the design trade-offs they took best and thus can give us a deeper understanding what certain features would mean for Parquet.

Re: Interest in Parquet V3

2024-05-21 Thread Weston Pace
As the author of one of these new formats I'll chime in. The main issues I have with parquet are: A. Pages in a column chunk must be contiguous (this is Lance's biggest issue with parquet) B. Encodings should be extensible C. Flexibility in what is considered data / metadata I outline my

Re: Interest in Parquet V3

2024-05-21 Thread lukas nalezenec
I am also in. I would focus on making Parquet more compatible – we have had this issue from the beginning. There shouldn't be a reason to have tools generate different flavors of the format. Lukas po 20. 5. 2024 v 20:06 odesílatel Parth Chandra napsal: > Hi Parquet team, > > It is very

Re: Interest in Parquet V3

2024-05-20 Thread Parth Chandra
Hi Parquet team, It is very exciting to see this effort. Thanks Micah for starting this. For most use case that our team sees the broad areas for improvement appear to be - 1) Optimizing for cloud storage (latency is high, seeks are expensive) 2) Optimized metadata reading - we've seen

Re: Interest in Parquet V3

2024-05-19 Thread Xinli shang
Sorry I am late to the party! It's great to see this discussion! Thank you everyone for the many good points and thank you, Micah, for starting the discussion and putting it together into a document, which is very helpful! I agree with most of the points we discussed above, and we need to improve

Re: Interest in Parquet V3

2024-05-17 Thread Rok Mihevc
Hi all, I've discussed with my colleagues and we would dedicate two engineers for 4-6 months on tasks related to implementing the format changes. We're already active in design discussions and can help with C++, Rust and C# implementations. I thought it'd be good to state this explicitly FWIW.

Re: Interest in Parquet V3

2024-05-16 Thread Antoine Pitrou
Hi Wes, On Wed, 15 May 2024 18:56:42 -0500 Wes McKinney wrote: > -- I am not sure how you fully make this problem go away in generality > without doing away with Thrift at the footer level, but at that point you > are making such a disruptive change that why not try to fix some other >

Re: Interest in Parquet V3

2024-05-16 Thread Edward Seidl
the finish line. Cheers, Ed [1] https://github.com/apache/parquet-format/pull/197 From: Julien Le Dem Sent: Wednesday, May 15, 2024 9:23 PM To: dev@parquet.apache.org Cc: d...@parquet.incubator.apache.org Subject: Re: Interest in Parquet V3 Thank you Wes

Re: Interest in Parquet V3

2024-05-15 Thread Micah Kornfield
The conversation seems to be going ahead very quickly. I tried to summarize some of the points at: https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit I had wanted to take more time to refine some ideas in but given the pace of the conversation, I thought it

Re: Interest in Parquet V3

2024-05-15 Thread Julien Le Dem
Thank you Wes for the great summary and Jan for the thoughtful reply. I think those are very valid points and areas for improvement. There is clear pattern on a few areas that IMO we can work on building a consensus on independently: - metadata: a easier way to read metadata that doesn't require

Re: Interest in Parquet V3

2024-05-15 Thread Jan Finis
Thanks for bringing up this topic! This is an important topic to me and my team, as we maintain a proprietary implementation of Parquet in addition to our own proprietary format [1] that was designed around the same time as Parquet, so we always had comparisons between formats. I also had

Re: Interest in Parquet V3

2024-05-15 Thread Wes McKinney
hi all, Just to add some of my perspective (and I would like to write up some longer form thoughts since I've been collaborating / talking with the Nimble and Lance folks -- and as a result I know a lot about the details of Nimble, BtrBlocks, and also the recent Bullion research format from

Re: Interest in Parquet V3

2024-05-15 Thread Steve Loughran
On Tue, 14 May 2024 at 17:48, Julien Le Dem wrote: > +1 on Micah starting a doc and following up by commenting in it. > +maybe some conf call where people of interest can talk about it. > > @Raphael, Wish Maple: agreed that changing the metadata representation is > less important. Most

Re: Interest in Parquet V3

2024-05-14 Thread Gang Wu
> I would hazard that simply storing statistics separately might > be sufficient for the wide column use-cases, without requiring > switching to something like flatbuffers? I agree with Raphael. Column chunks and pages can be referenced by offset and length. To avoid compatibility issues, we can

Re: Interest in Parquet V3

2024-05-14 Thread Martin Loncaric
I think Parquet's metadata and encoding/compression setup are problematic, but I don't see a reason to make Parquet V3 if it's just going to be another BtrBlocks or Nimble look-alike. Some people in the thread have expressed the view that Parquet's metadata is fine, and that people can achieve

Re: Interest in Parquet V3

2024-05-14 Thread Julien Le Dem
+1 on Micah starting a doc and following up by commenting in it. @Raphael, Wish Maple: agreed that changing the metadata representation is less important. Most engines can externalize and index metadata in some way. It is an option to propose a standard way to do it without changing the format.

Re: Interest in Parquet V3

2024-05-14 Thread Antoine Pitrou
On Mon, 13 May 2024 16:10:24 +0100 Raphael Taylor-Davies wrote: > > I guess I wonder if rather than having a parquet format version 2, or > even a parquet format version 3, we could just document what features a > given parquet implementation actually supports. I believe Andrew intends > to

Re: Interest in Parquet V3

2024-05-14 Thread Micah Kornfield
Thanks everyone for their perspectives. I think as a concrete next step, I'll try to pull together a Google doc that covers the topics covered here as I think that might be a more productive way to further the conversation (I don't want threads to get split too much). On Tue, May 14, 2024 at

Re: Interest in Parquet V3

2024-05-14 Thread wish maple
I also think most of the proposed benefits from these new formats can be achieved using the current parquet format and improved implementations. My concern is that: 1. For encoding, though so many interesting encoding is introduced, most implementation now just uses and implements PLAIN and

Re: Interest in Parquet V3

2024-05-14 Thread Raphael Taylor-Davies
Just to double check we're all on the same page w.r.t metadata, I presume we're referring to FileMetadata [1]? If so this contains information on the schema and locations of the column chunks. All statistics information, including that of column chunks, can be referenced solely by offset and

Re: Interest in Parquet V3

2024-05-14 Thread Steve Loughran
BTW, has everyone read "An Empirical Evaluation of Columnar Storage Formats"? https://arxiv.org/abs/2304.05028 good review of how things could be better with real numbers. Highlights that encoding plugins may be inefficient, based on the ORC experience. w.r.t metadata 1. could the old and

Re: Interest in Parquet V3

2024-05-13 Thread Julien Le Dem
It's great to see this thread. Thank you Micah for facilitating the discussion. my 2cts: 1. I like the idea of having feature checks rather than an absolute version number. I am sorry for the confusion created by the V2 moniker. Those were indeed incremental and backwards compatible additions to

Re: Interest in Parquet V3

2024-05-13 Thread Steve Loughran
call it parquet.ml then. which is what I've had in my head as I was thinking about this last week. as the datatypes and the library uses (GPUs, ...) would be targeted at this. I'd also like a design optimised for high-latency cloud storage where seek sucks but parallel reads are easy, and we can

Re: Interest in Parquet V3

2024-05-13 Thread Micah Kornfield
Thanks everybody for the input. I'll try to summarize some main points and my thoughts below. 1. "V3" branding is problematic and getting adoption is difficult with V2. I agree, we should not lump all potential improvements into a single V3 milestone (I used V3 to indicate that at least some

Re: Interest in Parquet V3

2024-05-13 Thread Ed Seidl
I think the whole "V1" vs "V2" mess is unfortunate. IMO there is only one version of the Parquet file format. At its core, the data layout (row groups composed of column chunks composed of Dremel encoded pages) has never changed. Encodings/codecs/structures have been added to that core, but

Re: Interest in Parquet V3

2024-05-13 Thread Curt Hagenlocher
There must be something in the water: Nimble and Lance: The Parquet Killers - by Chris Riccomini (materializedview.io) On Mon, May 13, 2024 at 10:01 AM Rok Mihevc

Re: Interest in Parquet V3

2024-05-13 Thread Rok Mihevc
I would be quite interested in working on data skipping and metadata bottlenecks (points 1. and 2.). On Mon, May 13, 2024 at 5:28 PM Curt Hagenlocher wrote: > One of the things they've done in the Delta table format which I think is > smart is to stop using version numbers and instead start

Re: Interest in Parquet V3

2024-05-13 Thread Curt Hagenlocher
One of the things they've done in the Delta table format which I think is smart is to stop using version numbers and instead start identifying specific features used by the table in a generic fashion. So instead of checking an opaque version number, a reader looks at the list of features and can

Re: Interest in Parquet V3

2024-05-13 Thread Raphael Taylor-Davies
Further to what has already been said, I have likewise found the v2 branding quite hard to follow, but more fundamentally I have struggled to understand its purpose. As far as I understand it, version 2 groups together a number of disjoint features from new data pages to different encodings,

Re: Interest in Parquet V3

2024-05-13 Thread Antoine Pitrou
Same as Andrew. 1) the "v3" messaging is intuitively a turn-off as it's already not obvious whether Parquet "v2" is usable with implementations currenly found in the wild. Concretely, the "v2" branding is commonly confused with the Parquet format version, and it's almost impossible to explain

Re: Interest in Parquet V3

2024-05-12 Thread Vinoo Ganesh
I don't have strong feelings about this one way or the other, but would gladly put my hand up to help collaborate on proposals/implementation as we figure this out. On Sun, May 12, 2024 at 5:31 AM Andrew Lamb wrote: > My opinion is that most (if not all) of the proposed benefits from these

Re: Interest in Parquet V3

2024-05-12 Thread Andrew Lamb
My opinion is that most (if not all) of the proposed benefits from these new formats can be achieved using the currrent parquet format and improved implementations (possibly with some minor extensions such as user defined encoding schemes)[1] Another reason people propose replacing parquet I

Re: Interest in Parquet V3

2024-05-12 Thread Gang Wu
Hi Micah, I have also noticed the emergence of these new file formats which are challenging the popularity of Apache Parquet. It would always be good to evolve Parquet to be competitive. Personally I'm +1 on this. I'm also proposing adding a new geometry type to the specs: [1]. This seems to

Interest in Parquet V3

2024-05-11 Thread Micah Kornfield
Hi Parquet Dev, I wanted to start a conversation within the community about working on a new revision of Parquet. For context there have been a bunch of new formats [1][2][3] that show there is decent room for improvement across data encodings and how metadata is organized. Specifically, in a