[ANNOUNCE] Apache Parquet-Java release 1.14.3

2024-10-05 Thread Gang Wu
Hi, I'm pleased to announce the release of Apache Parquet-Java 1.14.3! Parquet is a general-purpose columnar file format for nested data. It uses space-efficient encodings and a compressed and splittable structure for processing frameworks like Hadoop. Changes are listed at: https://github.com/a

[VOTE][RESULT] Release Apache Parquet-Java 1.14.3 RC2

2024-10-05 Thread Gang Wu
This vote passed with the following result: +1 (binding): Gábor Szádovszky, Xinli Shang, Gidon Gershinsky, Gang Wu +1 (non-binding): Vinoo Ganesh, Jean-Baptiste Onofré Thanks everyone! Kind regards, Gang

Re: [VOTE] Release Apache Parquet-Java 1.14.3 RC2

2024-10-05 Thread Gang Wu
; > > > > > > Checked tarball content, checksum and signature, executed the unit > > tests. > > > > All pass. > > > > > > > > +1 (binding) > > > > > > > > Gang Wu ezt írta (időpont: 2024. okt. 2., Sze, > > > 17:06):

[VOTE] Release Apache Parquet-Java 1.14.3 RC2

2024-10-02 Thread Gang Wu
Hi everyone, I propose the following RC to be released as the official Apache Parquet-Java 1.14.3 release. The commit ID is b5e376a2caee767a11e75b783512b14cf8ca90ec * This corresponds to the tag: apache-parquet-1.14.3-rc2 * https://github.com/apache/parquet-mr/tree/b5e376a2caee767a11e75b783512b14

Re: [VOTE] Release Apache Parquet-Java 1.14.3 RC0

2024-10-02 Thread Gang Wu
my vote to -1 (binding) in favor > of including https://github.com/apache/parquet-java/issues/3021 in this > release. > > Cheers, > Gabor > > Gang Wu ezt írta (időpont: 2024. szept. 30., H, 15:02): > > > Hi Gabor, > > > > I think that we can remove the CHANGES.

Re: [VOTE] Release Apache Parquet-Java 1.14.3 RC0

2024-09-30 Thread Gang Wu
f the CHANGES.md file from the > release process. I am fine with it but do we want to keep it in the repo, > then? > > Cheers, > Gabor > > Gang Wu ezt írta (időpont: 2024. szept. 30., H, 9:55): > > > Hi everyone, > > > > I propose the following RC to be released

[VOTE] Release Apache Parquet-Java 1.14.3 RC0

2024-09-30 Thread Gang Wu
Hi everyone, I propose the following RC to be released as the official Apache Parquet-Java 1.14.3 release. The commit ID is cf1efcc932a39dad8c47bd113f03c4848b3b1ed5 * This corresponds to the tag: apache-parquet-1.14.3-rc0 * https://github.com/apache/parquet-mr/tree/cf1efcc932a39dad8c47bd113f03c48

Re: Minor release request for major issue

2024-09-26 Thread Gang Wu
Hi, I can help with the release. It might be good to release 1.14.3 which is based on the current 1.14.2 plus https://github.com/apache/parquet-java/pull/3017 Is there anything else that we need to backport to 1.14.3? Best, Gang On Fri, Sep 27, 2024 at 2:16 AM Ted Jenks wrote: > Hi, > > > > T

Re: Variant spec and its security implications

2024-09-19 Thread Gang Wu
+ dev@spark because authors of variant may not subscribe to dev@parquet On Mon, Sep 16, 2024 at 7:33 PM Antoine Pitrou wrote: > > Hello, > > I've been reading the spec in more detail here: > > https://github.com/apache/spark/blob/d84f1a3575c4125009374521d2f179089ebd71ad/common/variant/README.md#

Re: [VOTE] Adopt Variant from Spark

2024-09-17 Thread Gang Wu
gt; > wrote: > > > > > > > > > > > +0 on accepting Variant into the Parquet *project*, but that's not an > > > > approval for sharing repos with the current Parquet format and > > > > implementations. > > > > > > &g

Re: [VOTE] Adopt Variant from Spark

2024-09-11 Thread Gang Wu
's doc [1] + discussion thread on the nitty gritty of what this > > > proposal actually means. > > > > > > [1] > > > > > > > > > https://docs.google.com/document/d/1guEzBQjzOEEZvvibeZjNraKmZHWtxQR95O_DvtZU0xw/edit#heading=h.5ad5x

Re: [VOTE] Adopt Variant from Spark

2024-09-11 Thread Gang Wu
t spec to Parquet. I'm looking forward to > > working > > > > on the addition of shredding. > > > > > > > > As for the details, I think I also prefer a separate repository, > > > > `parquet-variant`, but I don't think we necessaril

Re: [VOTE] Adopt Variant from Spark

2024-09-10 Thread Gang Wu
; parties from considering this standalone, non-Parquet, data format. > > (but for the same reason, I would recommand a separate project as well > :-)) > > Regards > > Antoine. > > > On Tue, 10 Sep 2024 09:48:03 +0800 > Gang Wu wrote: > > Hi all, > >

Re: [VOTE] Adopt Variant from Spark

2024-09-09 Thread Gang Wu
2024 at 10:10 AM Micah Kornfield wrote: > Did we actually close on this, I thought some people were in favor of a > separate repo? I think this might be important in terms of release > cadence? > > On Monday, September 9, 2024, Gang Wu wrote: > > > Hi all, > > >

Re: [DISCUSS] Adopt Variant Spec from Spark?

2024-09-09 Thread Gang Wu
onable to put the java implementation in the > parquet-java > > > > > > > > > I also agree with that, it should be just a module in the Maven > project. > > > > > > Kind regards, > > > Fokko > > > > > > Op ma 26 aug 2024 om

Re: [VOTE] Parquet binary protocol extensions

2024-08-28 Thread Gang Wu
+1 for the proposal Best, Gang On Wed, Aug 28, 2024 at 8:03 AM Corwin Joy wrote: > +1 > > On Tue, Aug 27, 2024, 3:07 PM Julien Le Dem wrote: > > > +1 > > (for reference, discussion thread: > > https://lists.apache.org/thread/63mtbq7mydrxd0b9nc5kwgqnhkmp7684 ) > > > > On Mon, Aug 26, 2024 at 11

Re: [DISCUSS] Adopt Variant Spec from Spark?

2024-08-26 Thread Gang Wu
s sense we can have a separate module in parquet-java > that may only depend on other low level parquet modules (like > parquet-format but surely not hadoop). This way any java-based projects can > easily use it. > What do you think? > > Gabor > > Gang Wu ezt írta (időpont

Re: [DISCUSS] Adopt Variant Spec from Spark?

2024-08-25 Thread Gang Wu
> > > the > >> > > > > > > > > future > >> > > > > > > > > goals was to integrate more closely with Parquet, and > >> having > >> > > the > >> > > > > spec > >> > > > > > > at > >>

Re: [VOTE] Apache Parquet Java 1.14.2 RC2

2024-08-24 Thread Gang Wu
+1 Verified the artifacts by running build and test on my mac. BTW, should we fix the link https://github.com/apache/parquet-java/releases/tag/apache-parquet-1.14.2-rc1 to refer to RC2? Best, Gang On Sun, Aug 25, 2024 at 2:05 AM Fokko Driesprong wrote: > Oops, I stopped at step 4 >

[DISCUSS] Adopt Variant Spec from Spark?

2024-08-23 Thread Gang Wu
Hi, Apache Iceberg is adding variant type support [1][2] by adopting the variant spec [3] from Apache Spark. As the proposal is getting mature, both Iceberg [4] and Spark [5] communities are discussing moving the variant type to Parquet repo to avoid divergence. Moving it into Parquet makes the va

Re: [VOTE] Apache Parquet Java 1.14.2 RC1

2024-08-20 Thread Gang Wu
Hi Julien, I can successfully build both the master branch and RC1 on mac. I installed thrift 0.19 with the help from [1]. My JDK version is as below: openjdk version "1.8.0_322" OpenJDK Runtime Environment (Zulu 8.60.0.21-CA-macos-aarch64) (build 1.8.0_322-b06) OpenJDK 64-Bit Server VM (Zulu 8.6

Re: [DISCUSS] Clarify num_nulls(null_counts) and distinct_counts in Parquet statistics

2024-08-17 Thread Gang Wu
+1 to suggestion from Xuwei. That is a common practice to work around a bug if a specific writer version can be detected, though the fix might not look elegant. Best, Gang On Sat, Aug 17, 2024 at 6:24 PM Andrew Lamb wrote: > Got it -- makes sense -- thank you > > On Sat, Aug 17, 2024 at 6:11 AM

Re: [DISCUSS] Parquet 1.14.2 release

2024-08-14 Thread Gang Wu
Thanks Fokko for raising this! I just checked the commits and want to discuss if following ones should be backported: - https://github.com/apache/parquet-java/pull/2949 - https://github.com/apache/parquet-java/pull/1376 They are not blocking issues and if there is any concern we can ignore them.

Re: [DISCUSS] Extension types in Parquet?

2024-08-14 Thread Gang Wu
xtensionType, the order is > defined by the type itself. > > Cheers, > Jan > > Am Mi., 29. Mai 2024 um 09:10 Uhr schrieb Antoine Pitrou < > anto...@python.org > >: > > > On Wed, 29 May 2024 10:27:02 +0800 > > Gang Wu wrote: > > > I think adding

[ANNOUNCE] New Parquet PMC Member: Micah Kornfield

2024-07-18 Thread Gang Wu
On behalf of the Parquet PMC, I'm pleased to announce that Micah has been invited to be a Parquet PMC member and he has accepted. Welcome, and thank you for your contributions! Cheers, Gang

[ANNOUNCE] New Parquet PMC Member: Antoine Pitrou

2024-07-18 Thread Gang Wu
On behalf of the Parquet PMC, I'm pleased to announce that Antoine has been invited to be a Parquet PMC member and he has accepted. Welcome, and thank you for your contributions! Cheers, Gang

[ANNOUNCE] New Parquet Committer: Xuwei Fu

2024-07-10 Thread Gang Wu
On behalf of the Apache Parquet PMC, I'm happy to announce that Xuwei Fu has accepted an invitation to become a committer on Apache Parquet. Welcome, and thank you for your contributions! Thanks, Gang

Re: [DISCUSS] Parquet sync day and time

2024-07-09 Thread Gang Wu
Thanks for the discussion! I'm in GMT+8 so I would prefer 8am-10am PT, though it is already midnight. I will try my best to chime in. Best, Gang On Tue, Jul 9, 2024 at 5:09 PM Fokko Driesprong wrote: > Hey Julien, > > Thanks for bringing this up. The PyIceberg sync is on the last Tuesday of >

Re: [VOTE] Adopt proposal on new features for parquet-format and release for Parquet Java

2024-07-03 Thread Gang Wu
Generally +1 on the proposal. Thanks for finalizing it! I have left a comment regarding the next major release of parquet-java. Best, Gang On Thu, Jul 4, 2024 at 1:55 AM Micah Kornfield wrote: > This vote is whether to adopt and merge [1][2] a proposal for providing > formal guidance on new fe

Re: Congrats to Julien Le Dem for being next PMC Chair

2024-07-03 Thread Gang Wu
Thanks Xinli and welcome back Julien! Best, Gang On Thu, Jul 4, 2024 at 1:10 AM Parth Chandra wrote: > Thanks Xinli for your leadership! And welcome back Julien! > > -Parth > > On Wed, Jul 3, 2024 at 5:13 AM Rok Mihevc wrote: > > > Congrats Julien and thanks Xinli! > > > > Rok > > > > On Wed,

Re: [DISCUSS] Deprecate file_offset in ColumnChunk struct

2024-06-25 Thread Gang Wu
I think the main argument here is whether the behavioral change to this field will actually break any reader implementation. Considering the current state of parquet-java, I would guess no. But I agree that let's modify the comment of the spec to make it clear and do the right thing on the writer's

Re: [DISCUSS] Migration of parquet-* issues from Jira to GitHub

2024-06-23 Thread Gang Wu
1 > > > spark 1 > > > nullpointerexception1 > > > hive 1 > > > hadoop 1 > > > OOM 1 > > > question

Re: [VOTE] Migration of parquet-* issues from Jira to GitHub

2024-06-19 Thread Gang Wu
I think you can do this directly. Thanks! Gang On Thu, Jun 20, 2024 at 9:30 AM Rok Mihevc wrote: > Thanks for the reminder Gang. Should a PMC conclude the vote or can I do > it? > > Rok > > On Thu, Jun 20, 2024 at 3:14 AM Gang Wu wrote: > > > Thanks Rok! We might

Re: [VOTE] Migration of parquet-* issues from Jira to GitHub

2024-06-19 Thread Gang Wu
Thanks Rok! We might need to conclude this vote and send an email with [VOTE][RESULT] title before any follow-up action. Best, Gang On Thu, Jun 20, 2024 at 7:09 AM Rok Mihevc wrote: > Thanks for the feedback Steve! > > > * the ability cross reference stuff from other jira projects > > * the si

Re: [DISCUSS] Can FIXED_LEN_BYTE_ARRAY be annotated with STRING?

2024-06-18 Thread Gang Wu
tures. The first optimizes encoding by not storing lengths and the > latter says the binary is valid UTF8. > > On Tue, Jun 18, 2024 at 8:35 AM Gang Wu wrote: > > > FYI, both parquet-cpp [1] and parquet-java [2] do not allow FLBA. > > > > [1] > > > > > https:

Re: [DISCUSS] Can FIXED_LEN_BYTE_ARRAY be annotated with STRING?

2024-06-17 Thread Gang Wu
FYI, both parquet-cpp [1] and parquet-java [2] do not allow FLBA. [1] https://github.com/apache/arrow/blob/eec6f17c8879b469dc3370dad4a7f68f11705a6b/cpp/src/parquet/types.cc#L829-L842 [2] https://github.com/apache/parquet-java/blob/fbe13d89ae4193be12c164d4bb5342c5eba3963f/parquet-column/src/main/ja

[ANNOUNCE] Apache Parquet-Java release 1.14.1

2024-06-15 Thread Gang Wu
Hi, I'm pleased to announce the release of Apache Parquet-Java 1.14.1! Parquet is a general-purpose columnar file format for nested data. It uses space-efficient encodings and a compressed and splittable structure for processing frameworks like Hadoop. Changes are listed at: https://github.com/a

[VOTE][RESULT] Release Apache Parquet-Java 1.14.1 RC0

2024-06-15 Thread Gang Wu
With three +1 binding votes, this release vote passes. +1 votes: Gábor Szádovszky (binding) Gidon Gershinsky (binding) Gang Wu (binding) -1 votes: None Thank you all who have voted. Cheers, Gang

Re: [VOTE] Release Apache Parquet-Java 1.14.1 RC0

2024-06-15 Thread Gang Wu
Checked tarball content, signature and checksum. Executed unit tests. All > > pass. > > +1 (binding) > > > > Gang Wu ezt írta (időpont: 2024. jún. 13., Cs, 8:43): > > > > > Hi everyone, > > > > > > I propose the following RC to be

Re: [VOTE] Migration of parquet-* issues from Jira to GitHub

2024-06-13 Thread Gang Wu
+1 (binding) Best, Gang On Fri, Jun 14, 2024 at 2:26 AM Ed Seidl wrote: > +1 (non-binding) > > Thanks! > Ed > > On 6/13/24 11:20 AM, Micah Kornfield wrote: > > +1 (non-binding) > > > > On Thu, Jun 13, 2024 at 11:14 AM Rok Mihevc > wrote: > > > >> Hi all, > >> > >> Following the ML discussion [

Re: [DISCUSS] Migration of parquet-* issues from Jira to GitHub

2024-06-13 Thread Gang Wu
+1 on this BTW, I created following PRs to enable github issues to these repos: - https://github.com/apache/parquet-format/pull/255 - https://github.com/apache/parquet-java/pull/1362 - https://github.com/apache/parquet-testing/pull/50 I will not merge them until the formal vote passes. Best, Ga

[VOTE] Release Apache Parquet-Java 1.14.1 RC0

2024-06-12 Thread Gang Wu
Hi everyone, I propose the following RC to be released as the official Apache Parquet-Java 1.14.1 release. The commit id is 97ede968377400d1d79e3196636ba3de392196ba * This corresponds to the tag: apache-parquet-1.14.1-rc0 * https://github.com/apache/parquet-java/tree/97ede968377400d1d79e3196636ba

Re: [DISCUSS] Patch release for parquet-java 1.14.1?

2024-06-12 Thread Gang Wu
don't have > any fixes that can go in, so from my end, we're good for starting the > release process. > > Kind regards, > Fokko > > Op di 4 jun 2024 om 09:03 schreef Gang Wu : > > > Hi, > > > > It seems that we need a patch release 1.14.1 to fix [1]

Re: [DISCUSS] Migration of parquet-cpp issues to GitHub

2024-06-12 Thread Gang Wu
non-parquet-cpp repos before the > action. > > Agreed. Did we discuss this enough to call for a vote yet? > > On Wed, Jun 12, 2024 at 5:23 PM Gang Wu wrote: > > > Thanks Rok for the update! > > > > Yes, the copied issues look good to me. Perhaps we need a separate

Re: [DISCUSS] Migration of parquet-cpp issues to GitHub

2024-06-12 Thread Gang Wu
t; On Fri, May 31, 2024 at 10:04 AM Rok Mihevc wrote: > > > Would we also want to add issue templates to encourage some structure? > See > > [1] for inspiration. > > > > [1] https://github.com/apache/arrow/blob/main/.github/ISSUE_TEMPLATE > > > > On Fri, May 31,

[DISCUSS] Patch release for parquet-java 1.14.1?

2024-06-04 Thread Gang Wu
Hi, It seems that we need a patch release 1.14.1 to fix [1]. All new commits in branch 1.14.x can be viewed at [2]. If there is any additional fix to be included, please let me know. If the community believes the release is necessary, I can volunteer to be the release manager. [1] https://issues.

Re: ColumnMetaData location

2024-06-03 Thread Gang Wu
> modifying the spec to state that the ColumnMetaData following > the chunk data is also optional +1 on this > adding language to the effect that if the value of file_offset is 0, > then no such metadata is present in the file. What about marking this as deprecated and discouraged to use it? B

Re: [DISCUSS] Migration of parquet-cpp issues to GitHub

2024-05-30 Thread Gang Wu
che/arrow-nanoarrow/blob/81711045e8bb4ded1cb3b5a6fa354b35f18aa4e7/.asf.yaml#L24-L25 > > On Wed, May 29, 2024 at 10:39 PM Gang Wu wrote: > > > > Just want to mention that these apache/parquet-* Github repositories > > have not yet enabled issues and INFRA tickets are required before > > mi

Re: [DISCUSS] Extensibility of Parquet

2024-05-30 Thread Gang Wu
This is similar to what we do internally to provide non-standard encoding by duplicating data in the customized index pages. It is free to vendor's choice to pay extra storage cost for better encoding support. So I like this idea to support encoding extensions. Best, Gang On Thu, May 30, 2024 at

Re: [DISCUSS] Encoding improvements (follow-up from Parquet "V3" discussion)

2024-05-29 Thread Gang Wu
I'm interested in experimenting and implementing new encodings. Will follow up with concrete proposals or findings. Best, Gang On Thu, May 30, 2024 at 3:29 AM Ed Seidl wrote: > Maybe this is putting the cart too far in front of the horse, but I'd be > willing to implement an encoding like this

Re: [DISCUSS] Migration of parquet-cpp issues to GitHub

2024-05-29 Thread Gang Wu
Just want to mention that these apache/parquet-* Github repositories have not yet enabled issues and INFRA tickets are required before migration. Best, Gang On Thu, May 30, 2024 at 1:55 AM Micah Kornfield wrote: > SGTM +1 > > On Wed, May 29, 2024 at 10:50 AM Rok Mihevc wrote: > > > On Wed, May

Re: [DISCUSS] Unify Record / Row terminology (to Row)

2024-05-29 Thread Gang Wu
Hi, I agree that row sounds clearer than record, however we have a class RecordReader in the parquet cpp: [1]. Not sure if we need to rename it and it is still considered an internal class. [1] https://github.com/apache/arrow/blob/4a2df663bc88c73b863e0c0036160f7f936574c2/cpp/src/parquet/column_re

Re: [VOTE] Migration of parquet-cpp issues to Arrow's issue tracker

2024-05-29 Thread Gang Wu
+1 (binding for Parquet) Thanks! Gang On Wed, May 29, 2024 at 10:47 PM Fokko Driesprong wrote: > +1 (non-binding) > > Op wo 29 mei 2024 om 16:46 schreef Felipe Oliveira Carvalho < > felipe...@gmail.com>: > > > +1 (non-binding) > > > > On Wed, 29 May 2024 at 11:30 Micah Kornfield > > wrote: > >

Re: [DISCUSS] Extension types in Parquet?

2024-05-28 Thread Gang Wu
I think adding extension type support will make it easier for adding tensor or vector type, which is [1] trying to target. However, the geometry type seems not easy to fit to the imagination of the extension type. It would be better to explicitly define geospatial statistics in the spec, otherwise

Re: [DISCUSS] Extensibility of Parquet

2024-05-28 Thread Gang Wu
I'm supportive of most of the points in this thread. For 2), making encodings pluggable does not eliminate the work on implementation and interoperability. If people are worried about the lengthy process to promote a new encoding to the spec, perhaps we can preserve an encoding type for each new c

Re: [DISCUSS] Migration of parquet-cpp issues to GitHub

2024-05-28 Thread Gang Wu
+1 on this. IIUC, I didn't see any objection to this in the discussion [1]. Perhaps we can directly proceed to a vote? Sorry that I was intended to initialize the vote but got distracted by other stuff. [1] https://lists.apache.org/thread/jf9wos3t6xxk6xdyx2dof1jlkbpkr56p Best, Gang On Wed, May

Re: BYTE_ARRAY vs binary in Parquet specification

2024-05-26 Thread Gang Wu
Hi Ed, Sorry for the late reply. I agree that we need to replace BINARY with BYTE_ARRAY to avoid confusion because FIXED_LENGTH_BYTE_ARRAY may also be regarded as BINARY. Best, Gang On Fri, May 24, 2024 at 2:01 AM Ed Seidl wrote: > Hi all, > > A question came up in the discussion of PARQUET-24

Re: Repeated fields spec clarification

2024-05-21 Thread Gang Wu
BTW, it seems totally valid to create page index for a subset of all columns. Does it mean columns without page index can have their records spanning more than one page? Best, Gang On Tue, May 21, 2024 at 7:26 PM Gang Wu wrote: > I would like to ask if it is valid to create only ColumnIn

Re: Repeated fields spec clarification

2024-05-21 Thread Gang Wu
I would like to ask if it is valid to create only ColumnIndex but omit OffsetIndex? My answer is NO according to [1]. If agreed, my inclination is option 1. [1] https://github.com/apache/parquet-format/blob/079a2dff06e32b7d1ad8c9aa67f2e2128fb5ffa5/src/main/thrift/parquet.thrift#L1019-L1022 On T

Re: [DISCUSS] Parquet C++ under which PMC?

2024-05-16 Thread Gang Wu
gt; > > > > > > > > On Tue, 14 May 2024 10:58:58 +0200 > > > Rok Mihevc wrote: > > >> Second Raphael's point. > > >> Would it be reasonable to say specification change requires > implementation > > >> in two parquet implem

Re: [DISCUSS] rename parquet-mr to parquet-java?

2024-05-15 Thread Gang Wu
+1 on renaming the repo to reduce confusion. However, the java library still uses the "parquet-mr" prefix to write its application version [1] and it is consumed by downstream projects like parquet-cpp [2] as well. [1] https://github.com/search?q=repo%3Aapache%2Fparquet-mr+parquet-mr+language%3AJ

Re: [DISCUSSION] Introduce FIXED_SIZE_LIST logical type

2024-05-15 Thread Gang Wu
Hi Rok, Happy to see you here :) According to my past experience, it would be more helpful to open a PR against the parquet-format repository and post it here. Best, Gang On Wed, May 15, 2024 at 7:25 PM Rok Mihevc wrote: > Hi all, > > Arrow recently introduced FixedShapeTensor and VariableSha

Re: Interest in Parquet V3

2024-05-14 Thread Gang Wu
> I would hazard that simply storing statistics separately might > be sufficient for the wide column use-cases, without requiring > switching to something like flatbuffers? I agree with Raphael. Column chunks and pages can be referenced by offset and length. To avoid compatibility issues, we can d

Re: Better announcement message [Apache Parquet release 1.14.0]

2024-05-14 Thread Gang Wu
essage does not mention "mr" or "Java" at > all (except in the url, and that there are Java artifacts available). > > Cheers, > Joris > > On Wed, 8 May 2024 at 05:26, Gang Wu wrote: > > > > Hi, > > > > I'm pleased to announce the r

Re: [DISCUSS] Parquet Reference Implementation ?

2024-05-14 Thread Gang Wu
res > implementation > > in two parquet implementations within Apache Parquet project? > > > > Rok > > > > On Tue, May 14, 2024 at 10:50 AM Gang Wu wrote: > > > > > IMHO, it looks more reasonable if a reference implementation is > required > &g

Re: [DISCUSS] Parquet Reference Implementation ?

2024-05-14 Thread Gang Wu
IMHO, it looks more reasonable if a reference implementation is required to support most (not all) elements from the specification. Another question is: should we discuss (and vote for) each candidate one by one? We can start with parquet-mr which is most well-known implementation. Best, Gang On

Re: Fwd: [C++] Parquet and Arrow overlap

2024-05-13 Thread Gang Wu
> > > > > > Thank you, that sounds great! On first glance some seem to be rather > > old > > > > and probably don't apply anymore. > > > > > > > > > BTW, do we really need to make a full copy of them to have a mirror

Re: Fwd: [C++] Parquet and Arrow overlap

2024-05-12 Thread Gang Wu
> > > > > > > Thank you, that sounds great! On first glance some seem to be rather > > old > > > > and probably don't apply anymore. > > > > > > > > > BTW, do we really need to make a full copy of them to have a mirror > >

Re: Interest in Parquet V3

2024-05-12 Thread Gang Wu
Hi Micah, I have also noticed the emergence of these new file formats which are challenging the popularity of Apache Parquet. It would always be good to evolve Parquet to be competitive. Personally I'm +1 on this. I'm also proposing adding a new geometry type to the specs: [1]. This seems to align

[DISCUSS] Add geometry logical type

2024-05-12 Thread Gang Wu
Hi, Apache Iceberg community is proposing to add geospatial support [1]. It would be good if Apache Parquet can support native geometry type to implement more efficient encoding, statistics and filtering. Therefore, I'd like to propose a format change to add a new geometry logical type: [2]. It is

Re: [DISCUSS] Propose changing the default branch of the parquet-site repo

2024-05-12 Thread Gang Wu
+1 This makes sense. I was also confused when I had access to parquet-site for the first time. Thanks Andrew! Best, Gang On Sun, May 12, 2024 at 3:15 AM Vinoo Ganesh wrote: > +1, this would be great. It's something Xinli and I discussed when we first > made the website updates, but it ended u

Re: [ANNOUNCE] New Parquet PMC Member: Gang Wu

2024-05-12 Thread Gang Wu
gt;>> On Sat, May 11, 2024 at 10:34 AM Andrew Lamb < > >> andrewlam...@gmail.com > >>>>>> wrote: > >>>>>> > >>>>>>> Congratulations Gang! That is very exciting. > >>>>>>> > >>>>>>&

Re: Archival of parquet-cpp repository

2024-05-11 Thread Gang Wu
Update: parquet-cpp has been archived by ASF via https://issues.apache.org/jira/browse/INFRA-25766 and now https://github.com/apache/parquet-cpp is read-only. On Sun, May 12, 2024 at 12:15 PM Micah Kornfield wrote: > I think this is a great idea, thanks for driving it Uwe. > > On Mon, May 6, 202

Re: Fwd: [C++] Parquet and Arrow overlap

2024-05-10 Thread Gang Wu
gt; Thanks, > Jacob > > Arrow committer > > On 2024/04/25 05:31:18 Gang Wu wrote: > > I know we have some non-Java committers and PMCs. But after the > parquet-cpp > > donation, it seems that no one worked on Parquet from arrow (cpp, rust, > go, > > etc.) >

[ANNOUNCE] Apache Parquet release 1.14.0

2024-05-07 Thread Gang Wu
Hi, I'm pleased to announce the release of Apache Parquet 1.14.0! Parquet is a general-purpose columnar file format for nested data. It uses space-efficient encodings and a compressed and splittable structure for processing frameworks like Hadoop. Changes are listed at: https://github.com/apache

[VOTE][RESULT] Release Apache Parquet 1.14.0 RC1

2024-05-07 Thread Gang Wu
With three +1 binding votes and additional +2 votes this release vote passes. +1 votes: Fokko Driesprong Gang Wu Gábor Szádovszky (binding) Gidon Gershinsky (binding) Xinli shang (binding) -1 votes: None Thank you all who have voted. Cheers, Gang

Re: [VOTE] Release Apache Parquet 1.14.0 RC1

2024-05-07 Thread Gang Wu
gt; - ran with the Iceberg encryption code > > > > Cheers, Gidon > > > > > > On Tue, May 7, 2024 at 4:28 AM Gang Wu wrote: > > > > > Hi, > > > > > > It has been open for more than 72 hours already. We still need 2 more > &g

Re: [VOTE] Release Apache Parquet 1.14.0 RC1

2024-05-06 Thread Gang Wu
> > Since I've never used CHANGES.md to actually check a release content, I > > don't feel this issue is so crucial to fail this vote. I would let the > > other voters decide. > > +1 (binding) > > > > Gang Wu ezt írta (időpont: 2024. máj. 6., H, 3:33

Re: Parquet feature matrix

2024-05-06 Thread Gang Wu
Hi, There was an effort on this: https://github.com/apache/parquet-site/pull/34 It would be good if we can have something like what Apache Arrow does: - https://arrow.apache.org/docs/status.html - https://arrow.apache.org/docs/cpp/parquet.html#supported-parquet-features But I do have concern tha

Re: [VOTE] Release Apache Parquet 1.14.0 RC1

2024-05-05 Thread Gang Wu
+1 (non-binding) Verified signature, checksum and build. Thanks Fokko for doing this! Let me take care of the rest. Best, Gang On Mon, May 6, 2024 at 4:36 AM Fokko Driesprong wrote: > Hey everyone, > > +1 (non-binding) > > - Checked against Trino and the RC1 runs cleanly >

[VOTE][RESULT] Release Apache Parquet 1.14.0 RC0

2024-05-03 Thread Gang Wu
Hi, The vote for parquet 1.14.0 RC0 release is FAILED for a possible compatibility issue. We will fix the issue before preparing the next 1.14.0 RC1. Thanks everyone! Best regards, Gang

Re: [VOTE] Release Apache Parquet 1.14.0 RC0

2024-05-03 Thread Gang Wu
PSHOT in Spark and ran a few tests too > > > > > > > > > > > > > > > > > > On Tue, Apr 30, 2024 at 10:20 AM Xinli shang > > > wrote: > > > > > > > +1 (binding) > > > > > > > > Validated the KE

Re: [VOTE] Release Apache Parquet 1.14.0 RC0

2024-04-30 Thread Gang Wu
Thank you! On Tue, Apr 30, 2024 at 4:16 PM Gábor Szádovszky wrote: > By importing the KEYS file under [1] the check of the .asc file passed! > So, I went forward and updated the KEYS file under [2] with your new one. > > Giving +1 (binding) for the release > > Cheers, > G

Re: [VOTE] Release Apache Parquet 1.14.0 RC0

2024-04-30 Thread Gang Wu
[2] https://dist.apache.org/repos/dist/release/parquet/KEYS On Tue, Apr 30, 2024 at 3:45 PM Gábor Szádovszky wrote: > Sure, please add your new public key to the referenced KEYS file then we > should be good. (The previous one would still be required to check the > previous releases, so do not remo

Re: [VOTE] Release Apache Parquet 1.14.0 RC0

2024-04-30 Thread Gang Wu
file. Could you double check if you signed it with the correct key? > No other issues were discovered, so no RC1 is required for now if you can > change the .asc file for the current tarball. > > Cheers, > Gabor > > Gang Wu ezt írta (időpont: 2024. ápr. 30., K, 7:45): > >

[VOTE] Release Apache Parquet 1.14.0 RC0

2024-04-29 Thread Gang Wu
Hi everyone, I propose the following RC to be released as the official Apache Parquet 1.14.0 release. The commit id is af0740229929337e1395fd24253a4ed787df2db3 * This corresponds to the tag: apache-parquet-1.14.0-rc0 * https://github.com/apache/parquet-mr/tree/af0740229929337e1395fd24253a4ed787df

Re: Parquet Sync meeting notes - April 23 2024

2024-04-25 Thread Gang Wu
Let me take a look at the exclusions of japicmp. Will try to remove them as much as possible. Best, Gang On Thu, Apr 25, 2024 at 10:01 PM Gábor Szádovszky wrote: > Sorry, I was not able to attend the meeting. Let me put some notes here: > > 2. We have been fighting with compatibility issues for

Re: Fwd: [C++] Parquet and Arrow overlap

2024-04-24 Thread Gang Wu
r as Parquet > commuters? > > We are doing this (speaking as a Parquet PMC who didn't work on > parquet-mr, but parquet-cpp). > > Best > Uwe > > On Wed, Apr 24, 2024, at 2:38 PM, Gang Wu wrote: > > +1 for moving parquet-cpp issues from Apache Jira to Arrow's

Re: Fwd: [C++] Parquet and Arrow overlap

2024-04-24 Thread Gang Wu
+1 for moving parquet-cpp issues from Apache Jira to Arrow's GitHub issue. Besides, I want to echo Will's question in the thread. Should we consider Parquet developers from other projects than parquet-mr as Parquet commiters? Currently apache/parquet-format and apache/parquet-testing repositories

Re: How to differentiate between Parquet V1 and V2

2024-04-23 Thread Gang Wu
As I have said in another thread, Parquet V2 is a concept which contains a lot of features. FWIW, what are defined in the specs [1] are finalized and some of them have been implemented in various implementations. Any file that contains one or more of those features can be considered v2 but the comm

Re: Parquet Sync meeting notes - April 23 2024

2024-04-23 Thread Gang Wu
I would expect so. parquet-mr has a complete implementation of all v2 encodings and some other Parquet implementations (e.g. Apache Arrow C++ and arrow-rs) have already supported most (if not all) v2 encodings for a long time. Best, Gang On Tue, Apr 23, 2024 at 11:02 PM Prem Sahoo wrote: > Are

Re: Next release date

2024-04-21 Thread Gang Wu
Hi David, There are already some discussions about the 1.14.0 release and it seems that many users are expecting it to be released soon. I will go through all the pending PRs this week and see if we can move forward to the release process. I will volunteer as the release manager and try to get it

Re: which version parquet is supported my parquet-mr 1.2.1

2024-04-16 Thread Gang Wu
Hi, The release note is https://github.com/apache/parquet-mr/blob/master/CHANGES.md, which would be helpful to check what feature is supported in each release. IMO, parquet v2 is a vague concept which contains a lot of features. Hope it helps. Best, Gang On Tue, Apr 16, 2024 at 6:26 AM Prem Saho

Re: Re: [DISCUSS] Parquet 1.14.0 and looking forward

2024-04-11 Thread Gang Wu
great speed ups on native > > > storage, especially SSD -more from the ability to do parallel block > reads > > > than anything else. What does that mean? use the hadoop raw local fS > and > > > you get it. It also means that any non-hadoop java code should u

Re: Reading corrupted parquet files

2024-04-03 Thread Gang Wu
Hi Cindy, >From what I can tell, these were some discussions in the community on the next release: [1] and [2]. [1] https://lists.apache.org/thread/bgmpmrqqcsqlbgqd16cjryc0gvzj9kbx [2] https://lists.apache.org/thread/kttwbl5l7opz6nwb5bck2gghc2y3td0o Best, Gang On Wed, Apr 3, 2024 at 7:11 AM Cin

Re: Removal of deprecated code in parquet-format

2024-03-27 Thread Gang Wu
Thanks for the effort! +1 for removing these deprecated code if there is no objection. I took a glimpse at the public downstream of parquet-format at [1]. It seems the risk is low for the removal. [1] https://mvnrepository.com/artifact/org.apache.parquet/parquet-format/usages Best, Gang On Thu,

Re: Selecting format_version=2.6 ?

2024-03-17 Thread Gang Wu
293]\n > ---\n", 'F': > '../src/sys/xen_execute.cpp', 'L': '12414', 'R': 'pg_throw'} > > Is there any documentation on the configuration you mention below? Could > that have any

Re: Selecting format_version=2.6 ?

2024-03-15 Thread Gang Wu
Hi Stephen, Thanks for raising the issue! You are right that the version is always 1 written by parquet-mr. This is something we need to fix. However, IMHO, the community does not have a clear answer on the definition of parquet format v2. Which feature are you referring to specifically in the ver

Re: [VOTE] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64

2024-03-07 Thread Gang Wu
+1 (non-binding) Best, Gang On Fri, Mar 8, 2024 at 5:05 AM Edward Seidl wrote: > +1 (non-binding) > > Thanks for your work on this! > Ed > > From: Antoine Pitrou > Sent: Thursday, March 7, 2024 5:15 AM > To: d...@parquet.incubator.apache.org > Subject: [VOTE]

Re: parquet-format status

2024-03-05 Thread Gang Wu
Hi Vinoo, IMO, we cannot do this because the parquet-format repo serves as the dedicated place to hold the parquet specs, which includes the thrift definition file and a set of documents tagged for all versions. Some projects also directly reference the link of the markdown files, which will be br

  1   2   3   4   5   6   >