Re: Next steps for Variant

Fokko Driesprong Wed, 19 Mar 2025 09:52:56 -0700

As mentioned earlier, it looks like the implementations
<https://lists.apache.org/thread/361035vf227w11q6r6w6t67618jlpfbx> are
moving forward. I think it would be good to have the format released, as
long as we're all comfortable with the current state to unblock the other
releases.


Kind regards,
Fokko



Op wo 19 mrt 2025 om 04:15 schreef Gang Wu <[email protected]>:

> I'm fine with releasing the parquet-format to unblock variant (and
> geometry) development.
>
> We can add a statement to the documentation and release note to warn that
> this is experimental and subject to change.
>
> Best,
> Gang
>
> On Wed, Mar 19, 2025 at 7:08 AM Aihua Xu <[email protected]> wrote:
>
> > Hi Ryan,
> >
> > Thanks a lot for explaining the process to add Variant type annotation in
> > Java. It would be great if we can add the logical type now to unblock
> some
> > code paths which rely on such logical types.
> >
> > On Tue, Mar 18, 2025 at 1:22 PM Ryan Blue <[email protected]> wrote:
> >
> > > Hi everyone,
> > > Last community sync, we had a few Variant topics that I want to also
> > raise
> > > here.
> > >
> > > First, I want to highlight that we talked about spec versioning and
> there
> > > was agreement to avoid unnecessary complexity by maintaining one
> version
> > > that covers both encoding and shredding. The two parts of Variant are
> in
> > > separate markdown docs, but we don’t want to have fragmentation where
> an
> > > implementation follows the encoding but not shredding. We want to keep
> > the
> > > two fully compatible and ensure that you can rely on shredding for data
> > > skipping. Note that this doesn’t add many requirements for writers
> > > (shredding is optional), but ensures that shredding can be used by
> > > requiring support in readers. This wasn’t very controversial, but
> please
> > > reply if you have concerns about it.
> > >
> > > Second, in the sync I also brought up the need to get the Variant type
> > > annotation into a release. There was some pushback on this, I think,
> > > because non-Java projects have a simpler build process. I don’t think
> > many
> > > people realized how the Java side works:
> > >
> > >    - The thrift file is released in the parquet-format Jar
> > >    - The parquet-java maven build has a parquet-format-structures
> module
> > >    that downloads the parquet-format Jar and generates Java classes
> from
> > > the
> > >    Thrift definition
> > >    - Then there is a translation step in ParquetMetadataConverter that
> > >    converts to Parquet (rather than Thrift) classes
> > >    - The Parquet API adds utilities to use the converted metadata
> > classes,
> > >    like LogicalTypeAnnotationVisitor
> > >
> > > We discussed creating sample files to verify Variant compatibility, but
> > in
> > > order to produce those files from the Variant implementation I’m
> > building,
> > > I would need to produce a one-off version of the thrift definition,
> > inject
> > > it into the parquet-thrift-format-structures build, update the Parquet
> > API
> > > to add a VariantLogicalTypeAnnotation, and then find a way to get that
> > > temporary build into a Maven repository so that PRs can depend on it.
> > Those
> > > PRs can’t be merged because they would rely on an unpublished/snapshot
> > Jar.
> > >
> > > This is becoming a big headache and I don’t think that it helps us
> > deliver
> > > Variant reliably. We know that we are going to support Variant data and
> > the
> > > type annotation to label that data is going to be needed. There is
> little
> > > risk in committing that annotation now. The argument against adding it
> > now
> > > was that it may create confusion, but I think it is unlikely that
> anyone
> > > would see the annotation in parquet-java or parquet-format and assume
> > > support guarantees. The spec is clearly marked experimental and the
> > > annotation is referenced already by LogicalTypes.md
> > > <
> > >
> >
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#variant
> > > >.
> > > Adding the annotation to the API doesn’t significantly increase the
> > number
> > > of people seeing it, and the people that do are already working with
> > > Parquet internals.
> > >
> > > The only other objection I know about is that we haven’t yet finalized
> > what
> > > goes in the annotation, which I agree is a prerequisite for moving
> > forward.
> > > I suggest that we add the encoding/shredding version (1), release
> > > parquet-format, and start working on the API extensions in
> parquet-java.
> > > Then we can actually work on a Java implementation that stores data in
> > > files in the parquet-java project.
> > >
> > > Does that sound reasonable?
> > >
> > > Ryan
> > >
> >
>

Re: Next steps for Variant

Reply via email to