Thank you for the email updates, we read them during the meeting, that was
quite useful.
Notes:
Attendees:
-
Gene - Databricks:
-
Micah - Google
-
Gabor - Dremio
-
Fokko - Databricks
-
Aihua - Snowflake
-
Raul - QuantStack
-
Neil - Snowflake
-
Kenny - HyperParam : author of HyParquet
-
Julien - Datadog
-
Antoine - QuantStack
-
Russell - Snowflake
-
Rok -
Agenda:
-
Variant update: Gene/Daniel/Andrew/Gabor/Fokko/Rok/Aihua
-
(email updates from people who could not attend)
-
Daniel:
With respect to the reference implementations for Variant, we had discussed
the possibility of Rust or C++, but those both have significant work. The
Java and native Python implementations are much closer and should cover the
concerns for verification of the spec. I still think there will be work on
the Rust side, but I don't think there's a C++ implementation that would be
in a state to open source. For the shredding spec, Micah, Ryan, Russel and
I met and are closing in on wording that everyone is happy with, so I
expect that will close out Shortly.
-
Andrew:
As a brief update, I am working on finding someone to help with the Rust
implementation of variant. Moving forward with Java and Python seems
reasonable to me, though I would truly love to get a Rust implementation to
ensure there is no potential gotcha's for a native implementation.
-
Gene update:
-
Variant binary encoding/decoding:
-
Java implementation under review in parquet-java repo
-
Python implementation in pyspark.
-
Pure-python in Spark repo.
-
Variant shredding:
-
Still working on the implementation in Spark.
-
What are the next steps?
-
TODO: follow up on the mailing list.
-
How are we releasing Variant?
-
Current plan: Release Variant + shredding together.
-
Releasing variant binary first would be a possibility.
-
Remaining questions:
-
New types added to the format
-
Nano timestamp: need clarification on actual semantics. A long
cannot store a nanosecond timestamp with a practical range.
Year 9999 often
used as a special value. (which does not fit in a long)
-
Avro, Arrow, and Iceberg have a 64bits nano ts.
-
Limited precision:
-
1677-09-21 00:12:43.145224193
-
2262-04-11 23:47:16.854775807].
-
Snowflake implementation: default to 8 byte, expandable to 16
bytes.
-
Variant could support 16 bytes version
-
Step1:
-
Support in variant:
-
64bits micros ts
-
64bits micros ts without timezone
-
64bits nano ts
-
64bits nanos ts without timezone
-
Next step:
-
Gene to follow up with Russel, Ryan, Antoine
-
Step2:
-
Add pico seconds ts to Parquet. Define how it’s mapped to
native types.
-
Constraints:
-
Number of type code
-
Using 20+ already (including: nano ts, time, UUID)
-
6 bits: 64 types maximum. (we might use the last to extend)
-
Interval: full range of SQL types.
-
Russell:
-
Someone at Snowflake will look into the parquet-cpp implementation of
Variant.
-
Binary variant and the shredding
-
Antoine:
-
Do we have official Variant test cases to test various
implementations?
-
It would be nice to provide a set of test cases for cross language
compatibility.
-
Variant implementations:
-
Java implementation (https://github.com/apache/parquet-java/pull/3117)
-
python PR? https://github.com/apache/spark/pull/49591
-
There seem to be roundtrip tests against Spark here:
https://github.com/apache/spark/blob/54a59b7f3ceb575e478650ab8ead01922595ea17/python/pyspark/sql/tests/test_types.py#L2060
-
Wide schema performance problem: [Antoine] new footer
-
Interest in this work
-
Russel also interested.
-
Need to talk about encryption in the new footer.
-
Opportunity to improve encryption handling in the footer.
-
TODO: follow up with Alkis
On Wed, Jan 22, 2025 at 9:05 AM Andrew Lamb <[email protected]> wrote:
> I also unfortunately will not be able to make it today.
>
> As a brief update, I am working on finding someone to help with the Rust
> implementation of variant. Moving forward with Java and Python seems
> reasonable to me, though I would truly love to get a Rust implementation to
> ensure there is no potential gotcha's for a native implementation
>
> Thanks,
> Andrew
>
> On Wed, Jan 22, 2025 at 11:41 AM Daniel Weeks <[email protected]> wrote:
>
> > Hey Julien,
> >
> > I'm not going to be able to attend today's meeting, but just wanted to
> > follow up on a few of the items from the last meeting.
> >
> > With respect to the reference implementations for Variant, we had
> > discussed the possibility of Rust or C++, but those both have significant
> > work. The Java and native Python implementations are much closer and
> > should cover the concerns for verification of the spec. I still think
> > there will be work on the Rust side, but I don't think there's a C++
> > implementation that would be in a state to open source.
> >
> > For the shredding spec, Micah, Ryan, Russel and I met and are closing in
> on
> > wording that everyone is happy with, so I expect that will close out
> > shortly.
> >
> > -Dan
> >
> > On Wed, Jan 22, 2025 at 7:41 AM Julien Le Dem <[email protected]> wrote:
> >
> > > The next Parquet sync is today Jan 22nd at 9:30am PT - 12:30pm ET -
> > 6:30pm
> > > CET
> > > (in about 2hs)
> > > To join the invite:
> > > https://calendar.app.google/xXGgYU6evBArpzdZ9
> > > Please contact me to be added to the recurring invite.
> > > Everybody is welcome, bring your topic or just listen in.
> > > Best
> > > Julien
> > >
> >
>