Re: [PR] Remove staging [parquet-site]

2024-05-14 Thread via GitHub
wgtmac commented on PR #58: URL: https://github.com/apache/parquet-site/pull/58#issuecomment-2111443901 cc @gszadovszky @julienledem -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: Interest in Parquet V3

2024-05-14 Thread Gang Wu
> I would hazard that simply storing statistics separately might > be sufficient for the wide column use-cases, without requiring > switching to something like flatbuffers? I agree with Raphael. Column chunks and pages can be referenced by offset and length. To avoid compatibility issues, we can d

Re: Better announcement message [Apache Parquet release 1.14.0]

2024-05-14 Thread Gang Wu
Hi Joris, Thanks for the advice! We should definitely eliminate the confusion. Let me update the wording on the releasing parquet page on the website [1]. [1] https://github.com/apache/parquet-site/blob/c605f977eef61e0d5024fc3d89f1e9aaeaa5c9b4/content/en/docs/Contribution%20Guidelines/releasing.m

Re: [DISCUSS] Parquet Reference Implementation ?

2024-05-14 Thread Ed Seidl
Given the breadth of the parquet community at this point, I don't think we should be singling out one or two "reference" implementations. Even parquet-mr, AFAIK, still doesn't implement DELTA_LENGTH_BYTE_ARRAY encoding in a user-accessible way (it's only available as part of the DELTA_BYTE_ARRAY w

Re: Interest in Parquet V3

2024-05-14 Thread Martin Loncaric
I think Parquet's metadata and encoding/compression setup are problematic, but I don't see a reason to make Parquet V3 if it's just going to be another BtrBlocks or Nimble look-alike. Some people in the thread have expressed the view that Parquet's metadata is fine, and that people can achieve goo

Better announcement message [Apache Parquet release 1.14.0]

2024-05-14 Thread Joris Van den Bossche
In light of the discussions around better distinguishing Apache Parquet the format and Apache Parquet-MR the Java implementation (https://github.com/apache/parquet-site/pull/53), I wanted to give the feedback that also the template of this release announcement could use some improvement. Because as

Re: [C++] Parquet and Arrow overlap

2024-05-14 Thread Julien Le Dem
1. I think we should make it easy for people contributing to the C++ codebase. (which is why I voted for the move at the time) 2. If merging repos removes the need to deal with the circular dependency between repos issue for the C++ code bases, it does it at the expense of making it easy to evolve

Re: Interest in Parquet V3

2024-05-14 Thread Julien Le Dem
+1 on Micah starting a doc and following up by commenting in it. @Raphael, Wish Maple: agreed that changing the metadata representation is less important. Most engines can externalize and index metadata in some way. It is an option to propose a standard way to do it without changing the format. Ad

Re: [DISCUSS] Parquet Reference Implementation ?

2024-05-14 Thread Julien Le Dem
I agree that parquet-mr implementation is a requirement to evolve the spec. It makes sense to me that we call parquet-mr the reference implementation and make it a requirement to evolve the spec. I would add the requirement to implement it in the parquet cpp implementation that lives in apache Arro

Re: [DISCUSS] Parquet Reference Implementation ?

2024-05-14 Thread Antoine Pitrou
AFAIK, the only Parquet implementation under the Apache Parquet project is parquet-mr :-) On Tue, 14 May 2024 10:58:58 +0200 Rok Mihevc wrote: > Second Raphael's point. > Would it be reasonable to say specification change requires implementation > in two parquet implementations within Apache P

Re: Interest in Parquet V3

2024-05-14 Thread Antoine Pitrou
On Mon, 13 May 2024 16:10:24 +0100 Raphael Taylor-Davies wrote: > > I guess I wonder if rather than having a parquet format version 2, or > even a parquet format version 3, we could just document what features a > given parquet implementation actually supports. I believe Andrew intends > to pi

Re: [C++] Parquet and Arrow overlap

2024-05-14 Thread Antoine Pitrou
Moving Parquet C++ out of Arrow C++ would basically recreate the problems that motivated the integration of Parquet C++ into Arrow C++ :-) Regards Antoine. On Tue, 14 May 2024 13:52:15 +0800 Gang Wu wrote: > IMO, moving parquet-cpp out of arrow is challenging as the dependency > chain looks

Re: Interest in Parquet V3

2024-05-14 Thread Micah Kornfield
Thanks everyone for their perspectives. I think as a concrete next step, I'll try to pull together a Google doc that covers the topics covered here as I think that might be a more productive way to further the conversation (I don't want threads to get split too much). On Tue, May 14, 2024 at 8:33

Re: Interest in Parquet V3

2024-05-14 Thread wish maple
I also think most of the proposed benefits from these new formats can be achieved using the current parquet format and improved implementations. My concern is that: 1. For encoding, though so many interesting encoding is introduced, most implementation now just uses and implements PLAIN and Di

Community over Code EU 2024: The countdown has started!

2024-05-14 Thread Ryan Skraba
[Note: You're receiving this email because you are subscribed to one or more project dev@ mailing lists at the Apache Software Foundation.] We are very close to Community Over Code EU -- check out the amazing program and the special discounts that we have for you. Special discounts You still hav

Re: Interest in Parquet V3

2024-05-14 Thread Raphael Taylor-Davies
Just to double check we're all on the same page w.r.t metadata, I presume we're referring to FileMetadata [1]? If so this contains information on the schema and locations of the column chunks. All statistics information, including that of column chunks, can be referenced solely by offset and no

Re: [DISCUSS] Parquet Reference Implementation ?

2024-05-14 Thread Steve Loughran
1., yes. IMO parquet-mr should be the RI, though a feature could only be declared as "done" when there is >1 implementation 2. What about interoperability and compliance testing? that is, rather than an RI, a set of test suites which somehow every impl has to pass. Tricky cross-platform though. On

Re: Interest in Parquet V3

2024-05-14 Thread Steve Loughran
BTW, has everyone read "An Empirical Evaluation of Columnar Storage Formats"? https://arxiv.org/abs/2304.05028 good review of how things could be better with real numbers. Highlights that encoding plugins may be inefficient, based on the ORC experience. w.r.t metadata 1. could the old and t

Re: [PR] First draft of docs about parquet format vs mr [parquet-site]

2024-05-14 Thread via GitHub
alamb commented on code in PR #53: URL: https://github.com/apache/parquet-site/pull/53#discussion_r1599985436 ## content/en/docs/Overview/_index.md: ## @@ -7,3 +7,41 @@ description: > --- Apache Parquet is a columnar storage format available to any project in the Hadoop eco

Re: [PR] PARQUET-2470: Update website with larger ecosystem emphasis [parquet-site]

2024-05-14 Thread via GitHub
crepererum commented on code in PR #59: URL: https://github.com/apache/parquet-site/pull/59#discussion_r1599844055 ## content/en/docs/Overview/_index.md: ## @@ -6,4 +6,7 @@ description: > All about Parquet. --- -Apache Parquet is a columnar storage format available to any

Re: [PR] PARQUET-2470: Update website with larger ecosystem emphasis [parquet-site]

2024-05-14 Thread via GitHub
alamb commented on code in PR #59: URL: https://github.com/apache/parquet-site/pull/59#discussion_r1599769911 ## content/en/_index.md: ## @@ -9,7 +9,10 @@ title: Parquet Download -Apache Parquet is a columnar storage format available to any project in the Hadoop ecosyst

Re: [DISCUSS] Parquet Reference Implementation ?

2024-05-14 Thread Gang Wu
I agree with Andrew. Recent parquet specs changes have followed the same practice: - https://lists.apache.org/thread/gyvqcx9ssxkjlrwogqwy7n4z6ofdm871 - https://lists.apache.org/thread/wgobz41mfldbhqpg9q4mdwypghg2cxg2 - https://lists.apache.org/thread/nlsj0ftxy7y4ov1678rgy5zc7dmogg6q On Tue, May

Re: [DISCUSS] Parquet Reference Implementation ?

2024-05-14 Thread Andrew Lamb
> Would it be reasonable to say specification change requires implementation > in two parquet implementations within Apache Parquet project? I believe this approach is how the Apache Arrow project handles spec changes[1] and that process has worked well in my opinion. Andrew [1] https://arrow.ap

Re: [DISCUSS] Parquet Reference Implementation ?

2024-05-14 Thread Rok Mihevc
Second Raphael's point. Would it be reasonable to say specification change requires implementation in two parquet implementations within Apache Parquet project? Rok On Tue, May 14, 2024 at 10:50 AM Gang Wu wrote: > IMHO, it looks more reasonable if a reference implementation is required > to su

Re: [DISCUSS] Parquet Reference Implementation ?

2024-05-14 Thread Gang Wu
IMHO, it looks more reasonable if a reference implementation is required to support most (not all) elements from the specification. Another question is: should we discuss (and vote for) each candidate one by one? We can start with parquet-mr which is most well-known implementation. Best, Gang On

Re: [DISCUSS] Parquet Reference Implementation ?

2024-05-14 Thread Raphael Taylor-Davies
Potentially it would be helpful to flip the question around. As Andrew articulates, a reference implementation is required to implement all elements from the specification, and therefore the major consequence of labeling parquet-mr thusly would be that any specification change would have to be