I would favor a dedicated repo, to avoid giving the impression that it is somehow tied to the Parquet file format.
Regards Antoine. On Mon, 26 Aug 2024 09:39:49 -0700 Ryan Blue <b...@databricks.com.INVALID> wrote: > I think it makes sense to either put it in parquet-format or its own repo. > I think the main thing is that we want this to be self-contained so that it > can be used broadly. > > On Mon, Aug 26, 2024 at 12:56 AM Fokko Driesprong > <fokko-1odqgaof3lkdnm+yrof...@public.gmane.org> wrote: > > > I suggested a separate repo in another thread, but I prefer to merge it > > into parquet-format, for the reasons that Gábor already pointed out. > > > > > > > It seems reasonable to put the java implementation in the parquet-java > > > > > > I also agree with that, it should be just a module in the Maven project. > > > > Kind regards, > > Fokko > > > > Op ma 26 aug 2024 om 09:06 schreef Gang Wu > > <ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org>: > > > > > I thought a separate repo is considered for hosting variant > > > implementations for different languages. For the variant spec, > > > it makes sense to be moved to the parquet-format repository. > > > Considering the fact that parquet implementations are scattered > > > in different repos (parquet-java, arrow-cpp, arrow-rs, etc.), it seems > > > reasonable to put the java implementation in the parquet-java, if > > > we can manage the release cycle to meet the expectation of > > > downstream projects. > > > > > > Best, > > > Gang > > > > > > On Mon, Aug 26, 2024 at 2:59 PM Gábor Szádovszky <ga...@apache.org> > > wrote: > > > > > > > Sorry, I've created another head for the thread. Let me put it back > > here. > > > > > > > > I think Parquet-format is a good place for the spec of Variant. > > > > > > > > After having the specs in Parquet-format it does not have too much > > > > difference than any other Parquet features. The shredding depends on > > the > > > > related type system. It is currently specified for Parquet directly. Do > > > > > > > we > > > > think there will be significant amounts of code that would be > > independent > > > > from Parquet? If not, I don't think we'll need a separate repo for the > > > > implementations. We did not do similar things for other Parquet > > features. > > > > If we think it makes sense we can have a separate module in > > parquet-java > > > > that may only depend on other low level parquet modules (like > > > > parquet-format but surely not hadoop). This way any java-based projects > > > > > > > can > > > > easily use it. > > > > What do you think? > > > > > > > > Gabor > > > > > > > > Gang Wu <ust...@gmail.com> ezt írta (időpont: 2024. aug. 26., H, > > 8:51): > > > > > > > > > A separate repo for variant type makes sense to me. And I don't think > > > > > we need to have two reference implementations ready before the > > > > > adoption because it is already a released spec. > > > > > > > > > > > Is the intent to release it independently of the Parquet-format > > spec? > > > > > > I see the Variant type also has a version. > > > > > > > > > > IIUC, the version field in the variant spec advises how variant data > > is > > > > > encoded. If this is the case, we should bump parquet-format version > > > > > when a new encoding scheme is introduced. > > > > > > > > > > Best, > > > > > Gang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Aug 24, 2024 at 8:43 AM Julien Le Dem <jul...@apache.org> > > > wrote: > > > > > > > > > > > (Note: I am also catching up on the threads linked in the email) > > > > > > > > > > > > On Fri, Aug 23, 2024 at 5:38 PM Julien Le Dem <jul...@apache.org> > > > > wrote: > > > > > > > > > > > > > I am in favor of making this a separate artifact that other > > > projects > > > > > can > > > > > > > depend on without pulling extra dependencies they might not want. > > > > > > > What do others think about a separate repo? > > > > > > > Is the intent to release it independently of the Parquet-format > > > > spec? I > > > > > > > see the Variant type also has a version. > > > > > > > Julien > > > > > > > > > > > > > > On Fri, Aug 23, 2024 at 4:31 PM Daniel Weeks <dwe...@apache.org> > > > > > wrote: > > > > > > > > > > > > > >> Julien, > > > > > > >> > > > > > > >> I think there's interest in supporting multiple language > > > > > implementations > > > > > > >> for variant (java/scala/cpp/rust/etc), so we might what to > > > consider > > > > > > having > > > > > > >> a 'parquet-varient' repository to house the spec and language > > > > > > >> implementations. That might also help to keep them aligned, but > > > > > > >> > > > > open > > > > > to > > > > > > >> other suggestions. > > > > > > >> > > > > > > >> -Dan > > > > > > >> > > > > > > >> On Fri, Aug 23, 2024 at 3:07 PM Julien Le Dem < > > jul...@apache.org> > > > > > > wrote: > > > > > > >> > > > > > > >> > Hello, > > > > > > >> > I think it is great that we are converging on a Variant type. > > > > > > >> > For the parquet-java implementation, it looks like it could be > > > > > > >> > > > > as > > > > > easy > > > > > > >> as > > > > > > >> > importing the spark implementation [1]? > > > > > > >> > I'm not sure this is actually blocking anything as I'm > > assuming > > > > this > > > > > > >> gets > > > > > > >> > stored in a binary type today. > > > > > > >> > Is there an existing Cpp implementation? > > > > > > >> > Are there other existing types defined somewhere else solving > > > that > > > > > > same > > > > > > >> > need that we should be paying attention to? (or should become > > > > > > compatible > > > > > > >> > with this) > > > > > > >> > Best > > > > > > >> > Julien > > > > > > >> > [1] > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > https://github.com/apache/spark/tree/master/common/variant/src/main/java/org/apache/spark/types/variant > > > > > > > > >> > > > > > > > >> > > > > > > > >> > On Fri, Aug 23, 2024 at 2:17 PM Jacques Nadeau < > > > > jacq...@apache.org> > > > > > > >> wrote: > > > > > > >> > > > > > > > >> > > > Do we have volunteers to implement it in Parquet-java + > > > > another > > > > > > OSS > > > > > > >> > > implementation? > > > > > > >> > > > > > > > > >> > > I don't think that should be a blocker for incorporating. > > I'd > > > be > > > > > > >> inclined > > > > > > >> > > to do something like mark it as experimental or similar in > > the > > > > > spec > > > > > > >> until > > > > > > >> > > the reference impls are done. > > > > > > >> > > > > > > > > >> > > On Fri, Aug 23, 2024 at 10:32 AM Micah Kornfield < > > > > > > >> emkornfi...@gmail.com> > > > > > > >> > > wrote: > > > > > > >> > > > > > > > > >> > > > I'm in favor of this, but wondering on the logistics. Do > > we > > > > > have > > > > > > >> > > > volunteers to implement it in Parquet-java + another OSS > > > > > > >> implementation > > > > > > >> > > or > > > > > > >> > > > are we going to bypass this requirement for now? > > > > > > >> > > > > > > > > > >> > > > Thanks, > > > > > > >> > > > Micah > > > > > > >> > > > > > > > > > >> > > > On Friday, August 23, 2024, Ryan Blue > > > > > <b...@databricks.com.invalid > > > > > > > > > > > > > >> > > wrote: > > > > > > >> > > > > > > > > > >> > > > > +1 > > > > > > >> > > > > > > > > > > >> > > > > On Fri, Aug 23, 2024 at 12:30 PM Jacques Nadeau < > > > > > > >> jacq...@apache.org> > > > > > > >> > > > > wrote: > > > > > > >> > > > > > > > > > > >> > > > > > +1 > > > > > > >> > > > > > > > > > > > >> > > > > > On Fri, Aug 23, 2024 at 8:51 AM Nong Li < > > > non...@gmail.com > > > > > > > > > > > >> wrote: > > > > > > >> > > > > > > > > > > > >> > > > > > > +1. > > > > > > >> > > > > > > > > > > > > >> > > > > > > On Fri, Aug 23, 2024 at 12:57 PM Jan Finis < > > > > > > jpfi...@gmail.com > > > > > > >> > > > > > > > >> > > > wrote: > > > > > > >> > > > > > > > > > > > > >> > > > > > > > I would also appreciate having native Variant > > > support > > > > in > > > > > > >> > Parquet. > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > Am Fr., 23. Aug. 2024 um 12:10 Uhr schrieb Fokko > > > > > > Driesprong > > > > > > >> < > > > > > > >> > > > > > > > fo...@apache.org>: > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > Hey Gang, > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > Thanks for raising this. +1 from my end. > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > For context, as Gang mentioned, when proposing > > to > > > > add > > > > > a > > > > > > >> > Variant > > > > > > >> > > > > Type > > > > > > >> > > > > > to > > > > > > >> > > > > > > > > Iceberg < > > > > > https://github.com/apache/iceberg/issues/10392 > > > > > > >, > > > > > > >> > one > > > > > > >> > > of > > > > > > >> > > > > the > > > > > > >> > > > > > > > > future > > > > > > >> > > > > > > > > goals was to integrate more closely with > > Parquet, > > > > and > > > > > > >> having > > > > > > >> > > the > > > > > > >> > > > > spec > > > > > > >> > > > > > > at > > > > > > >> > > > > > > > > Parquet will help to speed this up. > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > Kind regards, > > > > > > >> > > > > > > > > Fokko > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > Op vr 23 aug 2024 om 11:37 schreef Gábor > > > Szádovszky > > > > < > > > > > > >> > > > > > ga...@apache.org > > > > > > >> > > > > > > >: > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > Hi Gang, > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > Thanks for bringing this up. > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > I think that if Variant type would have come > > up > > > > > > earlier > > > > > > >> > > (before > > > > > > >> > > > > > > > > > iceberg/arrow), its natural place would have > > > been > > > > at > > > > > > the > > > > > > >> > file > > > > > > >> > > > > > format > > > > > > >> > > > > > > > > level > > > > > > >> > > > > > > > > > as any other types. The communities started > > > > > discussing > > > > > > >> > where > > > > > > >> > > it > > > > > > >> > > > > > > should > > > > > > >> > > > > > > > be > > > > > > >> > > > > > > > > > placed because now we have different type > > > systems > > > > at > > > > > > >> > > different > > > > > > >> > > > > > > places. > > > > > > >> > > > > > > > > > Also, the current spec of Variant makes it > > more > > > or > > > > > > less > > > > > > >> > > > > independent > > > > > > >> > > > > > > > from > > > > > > >> > > > > > > > > > the Parquet file format. > > > > > > >> > > > > > > > > > However, even at Parquet level, we would need > > at > > > > > least > > > > > > >> an > > > > > > >> > > > > > additional > > > > > > >> > > > > > > > > > Logical type to help handle Variant type by > > the > > > > > > systems > > > > > > >> > > > > > > reading/writing > > > > > > >> > > > > > > > > > Parquet. > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > To summarize my opinion, +1 for having the > > whole > > > > > > Variant > > > > > > >> > spec > > > > > > >> > > > in > > > > > > >> > > > > > > > Parquet > > > > > > >> > > > > > > > > > format. > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > Cheers, > > > > > > >> > > > > > > > > > Gabor > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > Gang Wu > > > > > > >> > > > > > > > > > <ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org> > > > > > > >> > > > > > > > > > ezt írta (időpont: > > > > 2024. > > > > > > >> aug. > > > > > > >> > > 23., > > > > > > >> > > > P, > > > > > > >> > > > > > > > 11:18): > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > Hi, > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > Apache Iceberg is adding variant type > > support > > > > > [1][2] > > > > > > >> by > > > > > > >> > > > > adopting > > > > > > >> > > > > > > the > > > > > > >> > > > > > > > > > > variant > > > > > > >> > > > > > > > > > > spec [3] from Apache Spark. As the proposal > > is > > > > > > getting > > > > > > >> > > > mature, > > > > > > >> > > > > > both > > > > > > >> > > > > > > > > > Iceberg > > > > > > >> > > > > > > > > > > [4] > > > > > > >> > > > > > > > > > > and Spark [5] communities are discussing > > > moving > > > > > the > > > > > > >> > variant > > > > > > >> > > > > type > > > > > > >> > > > > > to > > > > > > >> > > > > > > > > > Parquet > > > > > > >> > > > > > > > > > > repo to avoid divergence. Moving it into > > > Parquet > > > > > > makes > > > > > > >> > the > > > > > > >> > > > > > variant > > > > > > >> > > > > > > > spec > > > > > > >> > > > > > > > > > > engine > > > > > > >> > > > > > > > > > > and table format agnostic, which may > > encourage > > > > > wider > > > > > > >> > > > adoption. > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > What do people from Parquet community think? > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > [1] > > > > > > >> > > > > > > > > > > > > >> > > https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 > > > > > > >> > > > > > > > > > > [2] > > > > > > >> > > > > > > > > > > > > >> > > https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq > > > > > > >> > > > > > > > > > > [3] > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > https://github.com/apache/spark/blob/d84f1a3575c4125009374521d2f179 > > > > > > >> > > > > 089ebd71ad/common/variant/README.md > > > > > > >> > > > > > > > > > > [4] > > > > > > >> > > > > > > > > > > > > >> > > https://lists.apache.org/thread/hopkr2f0ftoywwt9zo3jxb7n0ob5s5bw > > > > > > >> > > > > > > > > > > [5] > > > > > > >> > > > > > > > > > > > > >> > > https://lists.apache.org/thread/0k5oj3mn0049fcxoxm3gx3d7r28gw4rj > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > Best, > > > > > > >> > > > > > > > > > > Gang > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > -- > > > > > > >> > > > > Ryan Blue > > > > > > >> > > > > Databricks > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >