I would agree that Parquet seems like a reasonable option in terms of fit and neutrality.
I'd love to get any feedback from others, but assuming there's general consensus, I feel like we need to engage with those communities and have an open conversation about the discussions we've had and why we feel this is important to address any governance/neutrality concerns. Others already mentioned this, but I agree there's added value that other projects could benefit from variant, so standardizing at the Parquet level makes this less opaque to the rest of the ecosystem. -Dan On Thu, Aug 15, 2024 at 11:31 AM Russell Spitzer <[email protected]> wrote: > I support that whole-heartedly. Parquet would be a great neutral location > for the spec. > > On Thu, Aug 15, 2024 at 1:17 PM Ryan Blue <[email protected]> > wrote: > >> I think it's a good idea to reach out to the Spark community and make >> sure we are in agreement. Up until now I think we've been thinking more >> abstractly about what makes sense but before we make any decision we should >> definitely collaborate with the other communities. >> >> I'd also like to suggest an alternative for where this spec should be >> maintained that would hopefully allow us to avoid copying and maintaining >> multiple places. As we've already discussed, this is not an easy spec to >> find a home for because there are alternative projects that are all >> interested. Since this is a cross-engine type, Spark may not be ideal. At >> the same time, Delta already supports the variant spec so there's a similar >> problem maintaining this in Iceberg. >> >> I think that a reasonable and neutral option is to see if the Parquet >> community would be willing to host the spec and library. That fits with the >> spec because subcolumnarization is written assuming Parquet is the storage. >> It would also be the best place for broad compatibility because anyone >> using Parquet would have a strong motivation to standardize on the same >> encoding. >> >> Initially, I pushed for Iceberg instead of Parquet because we may want to >> have the same variant encoding in ORC, but what made me change my mind is >> that every layer (file format, table format, engine) has that problem and >> I've heard the concern about neutrality raised multiple times while >> discussing this question internally. >> >> I think the Parquet community is the most neutral option available. Would >> anyone else support asking the Spark and Parquet communities to maintain >> the variant spec in Parquet? >> >> Ryan >> >> On Thu, Aug 15, 2024 at 8:34 AM Xuanwo <[email protected]> wrote: >> >>> From the iceberg-rust perspective, it could be extremely challenging to >>> keep track of both the Spark and Iceberg specifications. Having a single >>> source of truth would be much better. I believe this change will also >>> benefit Delta Lake if they implement the same approach. Perhaps we can try >>> contacting them to initiate such a project? >>> >>> On Thu, Aug 15, 2024, at 23:17, Gang Wu wrote: >>> >>> +1 on posting this discussion to dev@spark ML >>> >>> > I don't think there is anything that would stop us from moving to a >>> joint project in the future >>> >>> My concern is that if we don't do this from day 1, we will never ever do >>> this. >>> >>> Best, >>> Gang >>> >>> On Thu, Aug 15, 2024 at 11:08 PM Russell Spitzer < >>> [email protected]> wrote: >>> >>> Thats fair @Micah, so far all the discussions have been direct and off >>> the dev list. Would you like to make the request on the public Spark Dev >>> list? I would be glad to co-sign, I can also draft up a quick email if you >>> don't have time. >>> >>> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield <[email protected]> >>> wrote: >>> >>> I agree that it would be beneficial to make a sub-project, the main >>> problem is political and not logistic. I've been asking for movement from >>> other relative projects for a month and we simply haven't gotten anywhere. >>> >>> >>> I just wanted to double check that these issues were brought directly to >>> the spark community (i.e. a discussion thread on the Spark developer >>> mailing list) and not via backchannels. >>> >>> I'm not sure the outcome would be different and I don't think this >>> should block forking the spec, but we should make sure that the decision is >>> publicly documented within both communities. >>> >>> Thanks, >>> Micah >>> >>> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer < >>> [email protected]> wrote: >>> >>> @Gang Wu >>> I agree that it would be beneficial to make a sub-project, the main >>> problem is political and not logistic. I've been asking for movement from >>> other relative projects for a month and we simply haven't gotten anywhere. >>> I don't think there is anything that would stop us from moving to a joint >>> project in the future and if you know of some way of encouraging that >>> movement from other relevant parties I would be glad to collaborate in >>> doing that. One thing that I don't want to do is have the Iceberg project >>> stay in a holding pattern without any clear roadmap as to how to proceed. >>> >>> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu <[email protected]> wrote: >>> >>> I’m on board with copying the spec into our repository. However, as >>> we’ve talked about, it’s not just a straightforward copy—there are already >>> some divergences. Some of them are under discussion. Iceberg is definitely >>> the best place for these specs. Engines like Trino and Flink can then rely >>> on the Iceberg specs as a solid foundation. >>> >>> Yufei >>> >>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <[email protected]> wrote: >>> >>> Sorry for chiming in late. >>> >>> From the discussion in >>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I >>> don't quite understand why it is logistically complicated to create a >>> sub-project to hold the variant spec and impl. >>> >>> IMHO, coping the variant type spec into Apache Iceberg has some >>> deficiencies: >>> - It is a burden to update two repos if there is a variant type spec >>> change and will likely result in deviation if some changes do not reach >>> agreement from both parties. >>> - Implementers are required to keep an eye on both specs (considering >>> proprietary engines where both Iceberg and Delta are supported). >>> - Putting the spec and impl of variant type in Iceberg repo does lose >>> the opportunity for better native support from file formats like Parquet >>> and ORC. >>> >>> I'm not sure if it is possible to create a separate project (e.g. >>> apache/variant-type) to make it a single point of truth. We can learn from >>> the experience of Apache Arrow. In this fashion, different engines, table >>> formats and file formats can follow the same spec and are free to depend on >>> the reference implementations from apache/variant-type or implement their >>> own. >>> >>> Best, >>> Gang >>> >>> >>> >>> >>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye <[email protected]> wrote: >>> >>> +1 for copying the spec into our repository, I think we need to own it >>> fully as a part of the table spec, and we can build compatibility through >>> tests. >>> >>> -Jack >>> >>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer < >>> [email protected]> wrote: >>> >>> I'm not really in favor of linking and annotating as that just makes >>> things more complicated and still is essentially forking just with more >>> steps. If we just track our annotations / modifications to a single >>> commit/version then we have the same issue again but now you have to go to >>> multiple sources to get the actual Spec. *In addition, our very copy of >>> the Spec is going to require new types which don't exist in the Spark Spec >>> which necessarily means diverging. *We will need to take up new >>> primitive id's (as noted in my first email) >>> >>> The other issue I have is I don't think the Spark Spec is really going >>> through a thorough review process from all members of the Spark community, >>> I believe it probably should have gone through the SPIP but instead seems >>> to have been merged without broad community involvement. >>> >>> The only way to truly avoid diverging is to only have a single copy of >>> the spec, in our previous discussions the vast majority of Apache Iceberg >>> community want it to exist here. >>> >>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks <[email protected]> wrote: >>> >>> I'm really excited about the introduction of variant type to Iceberg, >>> but I want to raise concerns about forking the spec. >>> >>> I feel like preemptively forking would create the situation where we end >>> up diverging because there's little reason to work with both communities to >>> evolve in a way that benefits everyone. >>> >>> I would much rather point to a specific version of the spec and annotate >>> any variance in Iceberg's handling. This would allow us to continue >>> without dividing the communities. >>> >>> If at any point there are irreconcilable differences, I would support >>> forking, but I don't feel like that should be the initial step. >>> >>> No one is excited about the possibility that the physical >>> representations end up diverging, but it feels like we're setting >>> ourselves up for that exact scenario. >>> >>> -Dan >>> >>> >>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong <[email protected]> >>> wrote: >>> >>> +1 to what's already being said here. It is good to copy the spec to >>> Iceberg and add context that's specific to Iceberg, but at the same time, >>> we should maintain compatibility. >>> >>> Kind regards, >>> Fokko >>> >>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang <[email protected]>: >>> >>> +1 to copy the spec into our repository. I think the best way to keep >>> compatibility is building integration tests. >>> >>> Thanks, >>> Manu >>> >>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry <[email protected]> >>> wrote: >>> >>> Thanks Russell and Aihua for pushing Variant support! >>> >>> Given the differences between the supported types and the lack of >>> interest from the other project, I think it is reasonable to duplicate the >>> specification to our repository. >>> I would give very strong emphasis on sticking to the Spark spec as much >>> as possible, to keep compatibility as much as possible. Maybe even revert >>> to a shared specification if the situation changes. >>> >>> Thanks, >>> Peter >>> >>> Aihua Xu <[email protected]> ezt írta (időpont: 2024. aug. 13., K, >>> 19:52): >>> >>> Thanks Russell for bringing this up. >>> >>> This is the main blocker to move forward with the Variant support in >>> Iceberg and hopefully we can have a consensus. To me, I also feel it makes >>> more sense to move the spec into Iceberg rather than Spark engine owns it >>> and we try to keep it compatible with Spark spec. >>> >>> Thanks, >>> Aihua >>> >>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer < >>> [email protected]> wrote: >>> >>> Hi Y’all, >>> >>> We’ve hit a bit of a roadblock with the Variant Proposal, while we were >>> hoping to move the Variant and Shredding specifications from Spark into >>> Iceberg there doesn’t seem to be a lot of interest in that. Unfortunately, >>> I think we have a number of issues with just linking to the Spark project >>> directly from within Iceberg and *I believe we need to copy the >>> specifications into our repository*. >>> >>> There are a few reasons why i think this is necessary >>> >>> First, we have a divergence of types already. The Spark Specification >>> already includes types which Iceberg has no definition for (19, 20 >>> <https://github.com/apache/spark/blob/master/common/variant/README.md#encoding-types> >>> - Interval Types) and Iceberg already has a type which is not included >>> within the Spark Specification (Time) and will soon have more with >>> TimestampNS, and Geo. >>> >>> Second, We would like to make sure that Spark is not a hard dependency >>> for other engines. We are working with several implementers of the Iceberg >>> spec and it has previously been agreed that it would be best if the source >>> of truth for Variant existed in an engine and file format neutral location. >>> The Iceberg project has a good open model of governance and, as we have >>> seen so far discussing Variant >>> <https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq>, >>> open and active collaboration. This would also help as we can strictly >>> version our changes in-line with the rest of the Iceberg spec. >>> >>> Third, The Shredding spec is not quite finished and requires some group >>> analysis and discussion before we commit it. I think again the Iceberg >>> community is probably the right place for this to happen as we have already >>> started discussions here on these topics. >>> >>> For these reasons I think we should go with a direct copy of the >>> existing specification from the Spark Project and move ahead with our >>> discussions and modifications within Iceberg. That said, *I do not want >>> to diverge if possible from the Spark proposal*. For example, although >>> we do not use the Interval types above, I think we should *not* reuse >>> those type ids within our spec. Iceberg's Variant Spec types 19 and 20 >>> would remain unused along with any other types we think are not applicable. >>> We should strive whenever possible to allow for compatibility. >>> >>> In the interest of moving forward with this proposal I am hoping to see >>> if anyone in the community objects to this plan going forward or has a >>> better alternative. >>> >>> As always I am thankful for your time and am eager to hear back from >>> everyone, >>> Russ >>> >>> Xuanwo >>> >>> https://xuanwo.io/ >>> >>> >> >> -- >> Ryan Blue >> Databricks >> >
