Re: [DISCUSS] Unify Record / Row terminology (to Row)

2024-05-31 Thread Julien Le Dem
I agree on making naming consistent. Row is a good choice. Also, agree on thrift ids only are part of the spec as that’s what ends up in the binary. On Fri, May 31, 2024 at 13:41 Andrew Lamb wrote: > I filed a JIRA[1] and a PR[2] to change parquet.thrift to use "row" > > Thanks, > Andrew > > [1]

Re: [DISCUSS] Extensibility of Parquet

2024-05-31 Thread Julien Le Dem
I think it would be a good idea to have an extension mechanism that allows embedding extra information in the format. Something akin to what Alkis is suggesting having a reserved extension point. - The file can still be read by a standard parquet implementation without extra libraries - Vendors can

Re: [DISCUSS] Encoding improvements (follow-up from Parquet "V3" discussion)

2024-05-31 Thread Julien Le Dem
Micah, would it make sense to start a google doc specifically to discuss: - the goals (there could be a few subsets) - the candidate encodings - the existing/future prototypes to validate candidates. On Thu, May 30, 2024 at 3:14 AM Steve Loughran wrote: > be good for a benchmark to be targetable

[DISCUSS] Improve Bloom Filter documentation?

2024-05-31 Thread Andrew Lamb
When we first looked into Parquet bloom filters[1] it was hard to understand how effective they would be for a given amount of space overhead. When we plugged our data's cardinality into the target ndv and fpp parameters, it implied 2MB bloom filters *per column* per row group which was unacceptab

Re: [DISCUSS] Unify Record / Row terminology (to Row)

2024-05-31 Thread Andrew Lamb
I filed a JIRA[1] and a PR[2] to change parquet.thrift to use "row" Thanks, Andrew [1]: https://issues.apache.org/jira/browse/PARQUET-2488 [2]: https://github.com/apache/parquet-format/pull/256 On Wed, May 29, 2024 at 8:45 AM Antoine Pitrou wrote: > > I agree that "row" is a more widespread te

Re: [DISCUSS] Unify Record / Row terminology (to Row)

2024-05-31 Thread Andrew Lamb
I think the names of classes in the code can different than how the spec refers to the concepts, if the maintainers don't mind. In my mind, changing the parquet.thrift file to use consistent terminology doesn't change the spec, nor will it require (or prevent) implementations from changing their in

Re: [DISCUSS] Arrow dropping Java 8 support

2024-05-31 Thread Fokko Driesprong
Thanks Westom for bringing this up. It's good to have everyone's attention. As mentioned in the thread that Weston refers to [1], a few projects are leaping Java 17 (Spark 4), and the rest of them are looking at each other to move first :) This was also discussed back in February on the Iceberg D

Re: [Parquet-java] Are there release instructions documented any place?

2024-05-31 Thread Steve Loughran
now is your chance to improve them. FWIW Hadoop has a separate module to automate as many of the operations which can be done https://github.com/apache/hadoop-release-support 1. this includes things like; patching the x86 tarball with the arm binaries and generating new checksums 2. GPG

Re: [DISCUSS] Arrow dropping Java 8 support

2024-05-31 Thread Steve Loughran
1. Hadoop doesn't use arrow. 2. The Hadoop team would love to drop java 8 and in the last release said "will happen soon" 3. all the client stuff is happy java17+, it's just some of the server side stuff which is a bit of pain point. 4. Hadoop 3.2.x is needed to move beyond java8.

Re: [DISCUSS] Migration of parquet-cpp issues to GitHub

2024-05-31 Thread Rok Mihevc
Would we also want to add issue templates to encourage some structure? See [1] for inspiration. [1] https://github.com/apache/arrow/blob/main/.github/ISSUE_TEMPLATE On Fri, May 31, 2024 at 3:50 AM Gang Wu wrote: > Thanks Dewey! > > Please note that parquet-site has already enabled GitHub issues