wgtmac commented on PR #58:
URL: https://github.com/apache/parquet-site/pull/58#issuecomment-2111443901
cc @gszadovszky @julienledem
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific
> I would hazard that simply storing statistics separately might
> be sufficient for the wide column use-cases, without requiring
> switching to something like flatbuffers?
I agree with Raphael. Column chunks and pages can be referenced by
offset and length. To avoid compatibility issues, we can d
Hi Joris,
Thanks for the advice! We should definitely eliminate the confusion.
Let me update the wording on the releasing parquet page on the
website [1].
[1]
https://github.com/apache/parquet-site/blob/c605f977eef61e0d5024fc3d89f1e9aaeaa5c9b4/content/en/docs/Contribution%20Guidelines/releasing.m
Given the breadth of the parquet community at this point, I don't think
we should be singling out one or two "reference" implementations. Even
parquet-mr, AFAIK, still doesn't implement DELTA_LENGTH_BYTE_ARRAY
encoding in a user-accessible way (it's only available as part of the
DELTA_BYTE_ARRAY w
I think Parquet's metadata and encoding/compression setup are problematic,
but I don't see a reason to make Parquet V3 if it's just going to be
another BtrBlocks or Nimble look-alike.
Some people in the thread have expressed the view that Parquet's metadata
is fine, and that people can achieve goo
In light of the discussions around better distinguishing Apache
Parquet the format and Apache Parquet-MR the Java implementation
(https://github.com/apache/parquet-site/pull/53), I wanted to give the
feedback that also the template of this release announcement could use
some improvement. Because as
1. I think we should make it easy for people contributing to the C++
codebase. (which is why I voted for the move at the time)
2. If merging repos removes the need to deal with the circular dependency
between repos issue for the C++ code bases, it does it at the expense of
making it easy to evolve
+1 on Micah starting a doc and following up by commenting in it.
@Raphael, Wish Maple: agreed that changing the metadata representation is
less important. Most engines can externalize and index metadata in some
way. It is an option to propose a standard way to do it without changing
the format. Ad
I agree that parquet-mr implementation is a requirement to evolve the spec.
It makes sense to me that we call parquet-mr the reference implementation
and make it a requirement to evolve the spec.
I would add the requirement to implement it in the parquet cpp
implementation that lives in apache Arro
AFAIK, the only Parquet implementation under the Apache Parquet project
is parquet-mr :-)
On Tue, 14 May 2024 10:58:58 +0200
Rok Mihevc wrote:
> Second Raphael's point.
> Would it be reasonable to say specification change requires implementation
> in two parquet implementations within Apache P
On Mon, 13 May 2024 16:10:24 +0100
Raphael Taylor-Davies
wrote:
>
> I guess I wonder if rather than having a parquet format version 2, or
> even a parquet format version 3, we could just document what features a
> given parquet implementation actually supports. I believe Andrew intends
> to pi
Moving Parquet C++ out of Arrow C++ would basically recreate the
problems that motivated the integration of Parquet C++ into Arrow C++
:-)
Regards
Antoine.
On Tue, 14 May 2024 13:52:15 +0800
Gang Wu wrote:
> IMO, moving parquet-cpp out of arrow is challenging as the dependency
> chain looks
Thanks everyone for their perspectives.
I think as a concrete next step, I'll try to pull together a Google doc
that covers the topics covered here as I think that might be a more
productive way to further the conversation (I don't want threads to get
split too much).
On Tue, May 14, 2024 at 8:33
I also think most of the proposed benefits from these new formats can be
achieved using the current parquet format and improved implementations.
My concern is that:
1. For encoding, though so many interesting encoding is introduced, most
implementation now just uses and implements PLAIN and Di
[Note: You're receiving this email because you are subscribed to one
or more project dev@ mailing lists at the Apache Software Foundation.]
We are very close to Community Over Code EU -- check out the amazing
program and the special discounts that we have for you.
Special discounts
You still hav
Just to double check we're all on the same page w.r.t metadata, I
presume we're referring to FileMetadata [1]? If so this contains
information on the schema and locations of the column chunks. All
statistics information, including that of column chunks, can be
referenced solely by offset and no
1., yes. IMO parquet-mr should be the RI, though a feature could only be
declared as "done" when there is >1 implementation
2. What about interoperability and compliance testing? that is, rather than
an RI, a set of test suites which somehow every impl has to pass. Tricky
cross-platform though.
On
BTW, has everyone read "An Empirical Evaluation of Columnar Storage
Formats"?
https://arxiv.org/abs/2304.05028
good review of how things could be better with real numbers. Highlights
that encoding plugins may be inefficient, based on the ORC experience.
w.r.t metadata
1. could the old and t
alamb commented on code in PR #53:
URL: https://github.com/apache/parquet-site/pull/53#discussion_r1599985436
##
content/en/docs/Overview/_index.md:
##
@@ -7,3 +7,41 @@ description: >
---
Apache Parquet is a columnar storage format available to any project in the
Hadoop eco
crepererum commented on code in PR #59:
URL: https://github.com/apache/parquet-site/pull/59#discussion_r1599844055
##
content/en/docs/Overview/_index.md:
##
@@ -6,4 +6,7 @@ description: >
All about Parquet.
---
-Apache Parquet is a columnar storage format available to any
alamb commented on code in PR #59:
URL: https://github.com/apache/parquet-site/pull/59#discussion_r1599769911
##
content/en/_index.md:
##
@@ -9,7 +9,10 @@ title: Parquet
Download
-Apache Parquet is a columnar storage format available to
any project in the Hadoop ecosyst
I agree with Andrew. Recent parquet specs changes have followed
the same practice:
- https://lists.apache.org/thread/gyvqcx9ssxkjlrwogqwy7n4z6ofdm871
- https://lists.apache.org/thread/wgobz41mfldbhqpg9q4mdwypghg2cxg2
- https://lists.apache.org/thread/nlsj0ftxy7y4ov1678rgy5zc7dmogg6q
On Tue, May
> Would it be reasonable to say specification change requires implementation
> in two parquet implementations within Apache Parquet project?
I believe this approach is how the Apache Arrow project handles spec
changes[1] and that process has worked well in my opinion.
Andrew
[1] https://arrow.ap
Second Raphael's point.
Would it be reasonable to say specification change requires implementation
in two parquet implementations within Apache Parquet project?
Rok
On Tue, May 14, 2024 at 10:50 AM Gang Wu wrote:
> IMHO, it looks more reasonable if a reference implementation is required
> to su
IMHO, it looks more reasonable if a reference implementation is required
to support most (not all) elements from the specification.
Another question is: should we discuss (and vote for) each candidate
one by one? We can start with parquet-mr which is most well-known
implementation.
Best,
Gang
On
Potentially it would be helpful to flip the question around. As Andrew
articulates, a reference implementation is required to implement all
elements from the specification, and therefore the major consequence of
labeling parquet-mr thusly would be that any specification change would
have to be
26 matches
Mail list logo