Re: [PR] Add instructions for footer donations [parquet-benchmark]

2024-08-28 Thread via GitHub
pitrou commented on PR #1: URL: https://github.com/apache/parquet-benchmark/pull/1#issuecomment-2314641264 Hmm, perhaps we should put all of this in a `footers` subdirectory? This repo may be useful for other kinds of data at some point. -- This is an automated message from the Apache Git

Re: [PR] Add instructions for footer donations [parquet-benchmark]

2024-08-28 Thread via GitHub
pitrou commented on PR #1: URL: https://github.com/apache/parquet-benchmark/pull/1#issuecomment-2314642799 Also, can the binaries be compressed somehow? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [DISCUSS] Adopt Variant Spec from Spark?

2024-08-28 Thread Antoine Pitrou
I would favor a dedicated repo, to avoid giving the impression that it is somehow tied to the Parquet file format. Regards Antoine. On Mon, 26 Aug 2024 09:39:49 -0700 Ryan Blue wrote: > I think it makes sense to either put it in parquet-format or its own repo. > I think the main thing is tha

Re: [DISCUSS] new Parquet footer experiments

2024-08-28 Thread Antoine Pitrou
I suppose you already know this, but you can use public datasets as a source of real-world Parquet footers. For example, the GeoParquet website lists a couple data providers: https://geoparquet.org/ Regards Antoine. On Sun, 18 Aug 2024 14:20:28 +0200 Alkis Evlogimenos wrote: > The biggest t

Re: flatbuffer metadata: work-in-progress

2024-08-28 Thread Antoine Pitrou
Do you gain much from limiting row groups to 2^31 values and bytes? I generally find 32-bit lengths to a bit an anti-pattern, as they require dedicated logic in the writer to ensure sufficient chunking. Regards Antoine. On Mon, 26 Aug 2024 10:35:38 +0200 Alkis Evlogimenos wrote: > At the top

Re: flatbuffer metadata: work-in-progress

2024-08-28 Thread Alkis Evlogimenos
Yes the gains are substantial. This is one of the biggest optimizations. They are between 25% to 75% (4x reduction) depending on how much other stuff the footer has. Footers without stats get about 4x smaller. With stats they are 2x smaller. On Wed, Aug 28, 2024 at 10:32 AM Antoine Pitrou wrote:

Re: [DISCUSS] new Parquet footer experiments

2024-08-28 Thread Alkis Evlogimenos
Yes once https://github.com/apache/parquet-benchmark/pull/1 is merged, I plan to pull a few of those footers in so that they can seed the project and at the same time show how donations should be made. On Wed, Aug 28, 2024 at 10:25 AM Antoine Pitrou wrote: > > I suppose you already know this, bu

Re: [PR] Add instructions for footer donations [parquet-benchmark]

2024-08-28 Thread via GitHub
alkis commented on PR #1: URL: https://github.com/apache/parquet-benchmark/pull/1#issuecomment-2314806344 > Also, can the binaries be compressed somehow? Are you referring to the executables? I can compress them but then they become less convenient to run. I would rather keep the conv

Re: [PR] Add instructions for footer donations [parquet-benchmark]

2024-08-28 Thread via GitHub
pitrou commented on PR #1: URL: https://github.com/apache/parquet-benchmark/pull/1#issuecomment-2314849447 Yes, I'm referring to the executables. It's better to avoid large files in git repos IMHO. -- This is an automated message from the Apache Git Service. To respond to the message, ple

Re: [PR] Add instructions for footer donations [parquet-benchmark]

2024-08-28 Thread via GitHub
alkis commented on PR #1: URL: https://github.com/apache/parquet-benchmark/pull/1#issuecomment-2314901935 I tried they compressed to 22mb (if I add both in a zip file). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use t

Re: [PR] Add instructions for footer donations [parquet-benchmark]

2024-08-28 Thread via GitHub
alkis commented on PR #1: URL: https://github.com/apache/parquet-benchmark/pull/1#issuecomment-2314916770 Ok I put them in a zip. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comm

Re: [VOTE] Apache Parquet Java 1.14.2 RC2

2024-08-28 Thread Fokko Driesprong
Thanks everyone for testing, and thanks Julien for pointing out that failing CI. I did some checks, and it has to do with the Maven version. We use the one that comes with the Github Actions, so they are not pinned to a specific version. I just merged the Maven Wrapper PR

Re: [VOTE] Apache Parquet Java 1.14.2 RC2

2024-08-28 Thread Fokko Driesprong
Could a PMC member run the following command? svn mv https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.14.2-rc2/ https://dist.apache.org/repos/dist/release/parquet/apache-parquet-1.14.2 -m "Parquet: Add release 1.14.2" Thanks! :) Op wo 28 aug 2024 om 13:53 schreef Fokko Driesprong

Parquet Sync Aug 28th

2024-08-28 Thread Julien Le Dem
The next Parquet Sync is happening today at 9:30am PT - 12:30pm ET - 6:30pm CET (in 30min) To join the invite: https://calendar.app.google/61H58BfhTbY82tuZ6 Everybody is welcome, bring your topic or just listen in. Best Julien

Re: [PR] Add instructions for footer donations [parquet-benchmark]

2024-08-28 Thread via GitHub
emkornfield commented on code in PR #1: URL: https://github.com/apache/parquet-benchmark/pull/1#discussion_r1735005498 ## README.md: ## @@ -1 +1,28 @@ -# Apache Parquet Benchmarking +# Parquet benchmark data + +This repository contains Parquet benchmark data. Such data is useful

Re: [PR] Add instructions for footer donations [parquet-benchmark]

2024-08-28 Thread via GitHub
emkornfield commented on PR #1: URL: https://github.com/apache/parquet-benchmark/pull/1#issuecomment-2315842890 @julienledem let me know if you still want to review, otherwise I'll merge tonight or tomorrow. -- This is an automated message from the Apache Git Service. To respond to the me

Notes Parquet Sync Aug 28th

2024-08-28 Thread Julien Le Dem
Attendees: - Alkis: Databricks storage and IO. goals: make Parquet metadata better for wide schemas and in general. - Get pr in on parquet-benchmark - Extensions PR in review - Review ongoing footer experiments. - Micah: Google. Listen in

Re: [PR] Add instructions for footer donations [parquet-benchmark]

2024-08-28 Thread via GitHub
julienledem commented on PR #1: URL: https://github.com/apache/parquet-benchmark/pull/1#issuecomment-2315953157 > @julienledem let me know if you still want to review, otherwise I'll merge tonight or tomorrow. I just reviewed (but don't need/want to be a bottleneck), please merge, th

Re: [PR] Add instructions for footer donations [parquet-benchmark]

2024-08-28 Thread via GitHub
emkornfield merged PR #1: URL: https://github.com/apache/parquet-benchmark/pull/1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parque

[PR] Add huggingface public dataset footers [parquet-benchmark]

2024-08-28 Thread via GitHub
alkis opened a new pull request, #3: URL: https://github.com/apache/parquet-benchmark/pull/3 These are footers from some public datasets from [Hugging Face](http://huggingface.co/). * https://huggingface.co/datasets/polinaeterna/amazon_apparel * https://huggingface.co/datasets/GitB

Re: [PR] Add Hugging Face public dataset footers [parquet-benchmark]

2024-08-28 Thread via GitHub
emkornfield commented on PR #3: URL: https://github.com/apache/parquet-benchmark/pull/3#issuecomment-2316082060 @alkis I don't think can include parquet data from the first two links, as I can't find a license for the data. The second two links seem to be explicitly licensed under Apache 2

Re: [PR] Add Hugging Face public dataset footers [parquet-benchmark]

2024-08-28 Thread via GitHub
emkornfield commented on PR #3: URL: https://github.com/apache/parquet-benchmark/pull/3#issuecomment-2316082681 Can we add a LICENSE file with attribute the files we end up including along with there license? -- This is an automated message from the Apache Git Service. To respond to the m

Re: [PR] Add Hugging Face public dataset footers [parquet-benchmark]

2024-08-28 Thread via GitHub
Fokko commented on PR #3: URL: https://github.com/apache/parquet-benchmark/pull/3#issuecomment-2316099456 I don't think we're going to make releases of these files, but I agree with @emkornfield that is important to only use [compatible licenses](https://www.apache.org/legal/resolved.html#c

Re: [PR] Add Hugging Face public dataset footers [parquet-benchmark]

2024-08-28 Thread via GitHub
alkis commented on PR #3: URL: https://github.com/apache/parquet-benchmark/pull/3#issuecomment-2316119461 Removed ones without licenses and added those with an explicit open source one. Added a README in the subdirectory with a table of licenses. -- This is an automated message from the A

Re: [VOTE] Parquet binary protocol extensions

2024-08-28 Thread Gang Wu
+1 for the proposal Best, Gang On Wed, Aug 28, 2024 at 8:03 AM Corwin Joy wrote: > +1 > > On Tue, Aug 27, 2024, 3:07 PM Julien Le Dem wrote: > > > +1 > > (for reference, discussion thread: > > https://lists.apache.org/thread/63mtbq7mydrxd0b9nc5kwgqnhkmp7684 ) > > > > On Mon, Aug 26, 2024 at 11