Attendees:
-
Micah: Google
-
Ryan: Databricks, Variant
-
Andrew: Influx
-
Gene: Databricks, Variant
-
Ashish
-
Alkis: Databricks, Metadata v3
-
Aihua: Snowflake, Variant
-
Rok: Fintech
-
Riza: Cloudera, Impala
-
Steve: Cloudera
-
Julien: Datadog, discussion around Metadata v3 + Variant
Agenda:
-
Variant type
-
Moving Variant to the Parquet Project
<https://docs.google.com/document/d/1guEzBQjzOEEZvvibeZjNraKmZHWtxQR95O_DvtZU0xw/edit#heading=h.5ad5xy8ox6bp>
-
Overview
-
Spec in /parquet-format
-
Java impl in /parquet-java
-
Need to rapidly release changes
-
Have a build scoped to variant in parquet-java, to iterate faster
-
Disclaimer on the spec to start with (work in progress)
-
Need to define some logical types
-
Next steps:
-
Gene: Open a PR on /parquet-format with disclaimer on work in
progress
-
Next: parquet-java implementation. TODO: figure out actual build
delineation
-
Eventually we will vote to remove the disclaimer and make it
official
-
Alkis: update on Metadata v3
-
New improvements since a PR was opened with benchmarks
0/amazon_polarity: num-rgs=900 num-cols=3 thrift=1049k flatbuf=230k
packed=139k
1/amazon_reviews_books: num-rgs=159 num-cols=43 thrift=750k flatbuf=240k
packed=160k
2/cmrc2018: num-rgs=4 num-cols=10 thrift=16k flatbuf=3.8k packed=2.6k
3/dbr-fleet-example-0: num-rgs=4 num-cols=2950 thrift=2.1M flatbuf=1035k
packed=709k
4/dbr-fleet-example-1: num-rgs=1 num-cols=2987 thrift=818k flatbuf=554k
packed=420k
5/everyday_conversations: num-rgs=3 num-cols=12 thrift=14k flatbuf=5.2k
packed=3.1k
-
Perf improvement to thrift needs review:
https://github.com/apache/thrift/pull/3037
-
We need committers to respond timely to PRs in
https://github.com/apache/parquet-benchmark/
-
Possibly Daniel Weeks, Nong, Ryan Blue can help
-
For reference:
-
https://www.influxdata.com/blog/how-good-parquet-wide-tables/ — TLDR
in the Rust implementation at least there is at least a 4x
improvement that
could be had with no format changes, just software engineering
-
https://www.vldb.org/pvldb/vol16/p2769-durner.pdf is a great read
about how to size object store requests
On Wed, Sep 25, 2024 at 8:00 AM Julien Le Dem <[email protected]> wrote:
> The Parquet Sync is happening today at 9:30am PT - 12:30pm ET - 6:30pm CET
> (in 90 mins)
> To join the invite:
> https://calendar.app.google/uM78Qf3YiTAaPm5g8
>
> Everybody is welcome, bring your topic or just listen in.
> Best
> Julien
>