Makes sense to me. Caveated annotation sounds good On Tue, Mar 18, 2025 at 10:15 PM Gang Wu <[email protected]> wrote:
> I'm fine with releasing the parquet-format to unblock variant (and > geometry) development. > > We can add a statement to the documentation and release note to warn that > this is experimental and subject to change. > > Best, > Gang > > On Wed, Mar 19, 2025 at 7:08 AM Aihua Xu <[email protected]> wrote: > > > Hi Ryan, > > > > Thanks a lot for explaining the process to add Variant type annotation in > > Java. It would be great if we can add the logical type now to unblock > some > > code paths which rely on such logical types. > > > > On Tue, Mar 18, 2025 at 1:22 PM Ryan Blue <[email protected]> wrote: > > > > > Hi everyone, > > > Last community sync, we had a few Variant topics that I want to also > > raise > > > here. > > > > > > First, I want to highlight that we talked about spec versioning and > there > > > was agreement to avoid unnecessary complexity by maintaining one > version > > > that covers both encoding and shredding. The two parts of Variant are > in > > > separate markdown docs, but we don’t want to have fragmentation where > an > > > implementation follows the encoding but not shredding. We want to keep > > the > > > two fully compatible and ensure that you can rely on shredding for data > > > skipping. Note that this doesn’t add many requirements for writers > > > (shredding is optional), but ensures that shredding can be used by > > > requiring support in readers. This wasn’t very controversial, but > please > > > reply if you have concerns about it. > > > > > > Second, in the sync I also brought up the need to get the Variant type > > > annotation into a release. There was some pushback on this, I think, > > > because non-Java projects have a simpler build process. I don’t think > > many > > > people realized how the Java side works: > > > > > > - The thrift file is released in the parquet-format Jar > > > - The parquet-java maven build has a parquet-format-structures > module > > > that downloads the parquet-format Jar and generates Java classes > from > > > the > > > Thrift definition > > > - Then there is a translation step in ParquetMetadataConverter that > > > converts to Parquet (rather than Thrift) classes > > > - The Parquet API adds utilities to use the converted metadata > > classes, > > > like LogicalTypeAnnotationVisitor > > > > > > We discussed creating sample files to verify Variant compatibility, but > > in > > > order to produce those files from the Variant implementation I’m > > building, > > > I would need to produce a one-off version of the thrift definition, > > inject > > > it into the parquet-thrift-format-structures build, update the Parquet > > API > > > to add a VariantLogicalTypeAnnotation, and then find a way to get that > > > temporary build into a Maven repository so that PRs can depend on it. > > Those > > > PRs can’t be merged because they would rely on an unpublished/snapshot > > Jar. > > > > > > This is becoming a big headache and I don’t think that it helps us > > deliver > > > Variant reliably. We know that we are going to support Variant data and > > the > > > type annotation to label that data is going to be needed. There is > little > > > risk in committing that annotation now. The argument against adding it > > now > > > was that it may create confusion, but I think it is unlikely that > anyone > > > would see the annotation in parquet-java or parquet-format and assume > > > support guarantees. The spec is clearly marked experimental and the > > > annotation is referenced already by LogicalTypes.md > > > < > > > > > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#variant > > > >. > > > Adding the annotation to the API doesn’t significantly increase the > > number > > > of people seeing it, and the people that do are already working with > > > Parquet internals. > > > > > > The only other objection I know about is that we haven’t yet finalized > > what > > > goes in the annotation, which I agree is a prerequisite for moving > > forward. > > > I suggest that we add the encoding/shredding version (1), release > > > parquet-format, and start working on the API extensions in > parquet-java. > > > Then we can actually work on a Java implementation that stores data in > > > files in the parquet-java project. > > > > > > Does that sound reasonable? > > > > > > Ryan > > > > > >
