Parquet Sync Notes July 17th 2024

Julien Le Dem Wed, 17 Jul 2024 16:22:41 -0700

Parquet Sync Notes July 17th 2024

Attendees:

Parth: contributor to Apache Drill, Arrow-java. Currently working on
DataFusion-Comet (Spark impl). Topics: What is the direction of Parquet
“V3”. Can we make Parquet on S3 easier and better?
-

Micah: OSS Data formats for BigQuery at Google. Topics: Interested in
keeping Parquet current. How do we manage incompatible changes and releases
for Parquet java.
-

Gabor: Dremio, Parquet community member for a long time. Topics:
Interested in the “Parquet V3” initiative
-

Alkis: Databricks storage stacks (scans, clouds). Worked at Google on
fancy encodings in the past. Topics: Discussing Parquet metadata handling
and new encodings. Specifically, the extension proposal: [EXTERNAL]
Parquet extensions

<https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6>
-

Fokko: Tabular/Databricks. Topics: deprecation of old Hadoop and Java
support. Jira -> github migration.
-

Ed: Lawrence Livermore: power user of Parquet. Topics: How to use
Parquet with GPUs. Cudf. Parquet improvement process.
-

Julien: Datadog, very early committer/founder. Topics: Interested in
making Parquet current, new encodings. How do we solve footer metadata
scaling. How to make incompatible changes without breaking everyone. How to
add new encodings but not too many. How to improve timeseries on top of
parquet. How to merge sorted files in an efficient way.
-

Nong: Databricks, very early committer/founder. Topics: Interested in
the V3 stuff.

Agenda: (built from topics above)

Direction of “Parquet V3”:
-

Parquet on S3?
-

Merging sorted files efficiently.
-

More efficient for GPUs?
-

Alkis’ extension proposal [EXTERNAL] Parquet extensions

<https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6>
-

Mika’s Parquet format changes process (we didn’t get to it)
-

Jira -> github migration (we didn’t get to it)

Notes:

Direction for Parquet improvement:
-

Rolling out incompatible changes:
-

Duplicate information until we’re satisfied: for metadata and for
encodings as well
-

Better metadata:
-

Support for wide schemas
-

Better encodings:
-

More efficient for GPUs? => parallelizable
-

Better for time series. Delta encodings?
-

BTR blocks…
-

Merge Sorted files on the fly:
-

Ability to read the file from the beginning as a whole. Since
we’re reading the whole file anyway
-

Side effect of being able to ignore the footer and still read the
file in case of corruption.
-

Log-structure merge of parquet files. Ed: interested in doing this
for GPUs. Need better prediction of size of data to load into
GPU memory.
-

[EXTERNAL] Parquet extensions

<https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6>
:
-

Need an extension mechanism to experiment with a new footer or new
metadata.
-

In a fully forward/backward manner.
-

Proposed:
-

Reserved a binary field in thrift that will not be used by the
spec.
-

Thrift will ignore field ids that are unknown and therefore old
readers will just ignore it.
-

Discussion:
-

2 use cases:
-

Proprietary extensions
-

New Parquet footer.
-

Remove the notion of vendor to focus on the path to migrate to the
new footer.
-

Action Items:
-

Alkis to integrate feedback in his proposal and follow up on the
list to finalize
-

Action Items:
-

Since we didn’t get to 2 items, we agreed to change the meeting to
bi-weekly for now until it’s not needed. The next meeting is
July 31st at 9:30am
PT - 12:30pm ET - 6:30pm CET
-

Join here: https://calendar.app.google/UMMmbUV1JMh7ffGt6
-

Follow up on the items we didn’t get to on the mailing list:
-

Mika’s proposal for the Parquet format changes process
-

Jira -> github migration

Parquet Sync Notes July 17th 2024

Reply via email to