Re: [DISCUSS] Variant Spec Location

Steve Loughran Wed, 28 Aug 2024 12:50:31 -0700

 > I think Parquet is a better place for the variant spec than Arrow.
Parquet is upstream of nearly every project (other than ORC)


log4j is that -but it doesn't mean that it is the right place.

What is key is: what does it mean for parquet to have a variant type in
there? Does it actually make sense? And if so: how well do the proposals
actually match up with what suits that format.

Turning up to any project and saying "here is the format you have to add
for our reasons" doesn't seem right, even if there is a lot of overlap in
both committers and actual code.

That said, I do think it's the right place for it. Unless there is a large
body of log4j committers who have different opinions.

steve


On Wed, 28 Aug 2024 at 20:07, Aihua Xu <aihua...@snowflake.com.invalid>
wrote:

>  As the discussions in the Spark community (
> https://lists.apache.org/thread/0k5oj3mn0049fcxoxm3gx3d7r28gw4rj) and in
> the Parquet community (
> https://lists.apache.org/thread/6h58hj39lhqtcyd2hlsyvqm4lzdh4b9z)
> continue to decide the spec location, I would like to discuss some of the
> implementation details for shredded subcolumns in Iceberg and that part
> shouldn't be affected by the ongoing discussion on the spec location.
>
> I'm capturing in the doc Iceberg Variant Shredding#Iceberg-Impl
> <https://docs.google.com/document/d/1JeBt4NIju08jQ2AbludiK-U0M9ISIgysP7fUDWtv7rg/edit#heading=h.afmmnvloe5sz>.
> To summarize:
>
>    - When the subcolumns in a Variant are shredded, collect the subcolumn
>    statistics similar to regular columns.
>    - Add subcolumn stats fields in DataFile indexed by (columnId,
>    subColumnPath).
>    - Add NDV count which I don't see regular column stats have.
>    - Add the (columnId, subColumnPath) -> subcolumn_stats, compared to
>    columnId -> column_stat with multiple maps for regular columns.
>
>  Please help take a look. Also let me know if I should separate this to
> another thread instead.
>
> Thanks,
> Aihua
>
>
> On Fri, Aug 23, 2024 at 5:51 PM Julien Le Dem <jul...@apache.org> wrote:
>
>> Thank you Gang, that's sounds like a good idea to me as well
>>
>> On Fri, Aug 23, 2024 at 8:47 AM Aihua Xu <aihua...@snowflake.com.invalid>
>> wrote:
>>
>>> Thanks Gang for initiating the discussion.
>>>
>>> On Fri, Aug 23, 2024 at 2:22 AM Gang Wu <ust...@gmail.com> wrote:
>>>
>>>> Thanks Aihua!
>>>>
>>>> I've started the discussion in dev@parquet:
>>>> https://lists.apache.org/thread/6h58hj39lhqtcyd2hlsyvqm4lzdh4b9z
>>>>
>>>> Best,
>>>> Gang
>>>>
>>>> On Fri, Aug 23, 2024 at 12:53 PM Aihua Xu <aihua...@snowflake.com>
>>>> wrote:
>>>>
>>>>> From this thread
>>>>> https://lists.apache.org/thread/0k5oj3mn0049fcxoxm3gx3d7r28gw4rj,
>>>>> seems Spark community is leaning toward moving to Parquet.
>>>>>
>>>>> Gang, can you help start a discussion in the parquet community on
>>>>> adopting and maintaining such Variant spec?
>>>>>
>>>>> On Thu, Aug 22, 2024 at 8:08 AM Curt Hagenlocher <c...@hagenlocher.org>
>>>>> wrote:
>>>>>
>>>>>> This seems to straddle that line, in that you can also view this as a
>>>>>> way to represent semi-structured data in a manner that allows for more
>>>>>> efficient querying and computation by breaking out some of its components
>>>>>> into a more structured form.
>>>>>>
>>>>>> (I also happen to want a canonical Arrow representation for variant
>>>>>> data, as this type occurs in many databases but doesn't have a great
>>>>>> representation today in ADBC results. That's why I filed [Format]
>>>>>> Consider adding an official variant type to Arrow · Issue #42069 ·
>>>>>> apache/arrow (github.com)
>>>>>> <https://github.com/apache/arrow/issues/42069>. Of course, there's
>>>>>> no specific reason why a canonical Arrow representation for variants must
>>>>>> align with Spark and/or Iceberg.)
>>>>>>
>>>>>> -Curt
>>>>>>
>>>>>> On Thu, Aug 22, 2024 at 2:01 AM Antoine Pitrou <anto...@python.org>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> Ah, thanks. I've tried to find a rationale and ended up on
>>>>>>> https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 .
>>>>>>> Is it
>>>>>>> a good description of what you're after?
>>>>>>>
>>>>>>> If so, then I don't think Arrow is a good match. This seems mostly
>>>>>>> to be
>>>>>>> a marshalling format for semi-structured data (like Avro?). Arrow
>>>>>>> data
>>>>>>> types are meant to be in a representation ideal for querying and
>>>>>>> computation, rather than transport and storage.
>>>>>>>
>>>>>>> This could be developed separately and then be represented in Arrow
>>>>>>> using an extension type (perhaps a canonical one as in
>>>>>>> https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html).
>>>>>>>
>>>>>>> What do other Arrow developers think?
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Antoine.
>>>>>>>
>>>>>>>
>>>>>>> Le 22/08/2024 à 10:45, Gang Wu a écrit :
>>>>>>> > Sorry for the inconvenience.
>>>>>>> >
>>>>>>> > This is the permalink for the discussion:
>>>>>>> > https://lists.apache.org/thread/hopkr2f0ftoywwt9zo3jxb7n0ob5s5bw
>>>>>>> >
>>>>>>> > On Thu, Aug 22, 2024 at 3:51 PM Antoine Pitrou <anto...@python.org>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> >>
>>>>>>> >> Hi Gang,
>>>>>>> >>
>>>>>>> >> Sorry, but can you give a pointer to the start of this discussion
>>>>>>> thread
>>>>>>> >> in a readable format (for example a mailing-list archive)? It
>>>>>>> appears
>>>>>>> >> that dev@arrow wasn't cc'ed from the start and that can make it
>>>>>>> >> difficult to understand what this is about.
>>>>>>> >>
>>>>>>> >> Regards
>>>>>>> >>
>>>>>>> >> Antoine.
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> Le 22/08/2024 à 08:32, Gang Wu a écrit :
>>>>>>> >>> It seems that we have reached a consensus to some extent that
>>>>>>> there
>>>>>>> >>> should be a new home for the variant spec. The pending question
>>>>>>> >>> is whether Parquet or Arrow is a better choice. As a committer
>>>>>>> from
>>>>>>> >> Arrow,
>>>>>>> >>> Parquet and ORC communities, I am neutral to choose any and
>>>>>>> happy to
>>>>>>> >>> help with the movement once a decision has been made.
>>>>>>> >>>
>>>>>>> >>> Should we start a vote to move forward?
>>>>>>> >>>
>>>>>>> >>> Best,
>>>>>>> >>> Gang
>>>>>>> >>>
>>>>>>> >>> On Sat, Aug 17, 2024 at 8:34 AM Micah Kornfield <
>>>>>>> emkornfi...@gmail.com>
>>>>>>> >>> wrote:
>>>>>>> >>>
>>>>>>> >>>>>
>>>>>>> >>>>> That being said, I think the most important consideration for
>>>>>>> now is
>>>>>>> >>>> where
>>>>>>> >>>>> are the current maintainers / contributors to the variant
>>>>>>> type. If most
>>>>>>> >>>> of
>>>>>>> >>>>> them are already PMC members / committers on a project, it
>>>>>>> becomes a
>>>>>>> >> bit
>>>>>>> >>>>> easier. Otherwise if there isn't much overlap with a project's
>>>>>>> existing
>>>>>>> >>>>> governance, I worry there could be a bit of friction. How many
>>>>>>> active
>>>>>>> >>>>> contributors are there from Iceberg? And how about from Arrow?
>>>>>>> >>>>
>>>>>>> >>>>
>>>>>>> >>>> I think this is the key question. What are the requirements
>>>>>>> around
>>>>>>> >>>> governance?  I've seen some tangential messaging here but I'm
>>>>>>> not clear
>>>>>>> >> on
>>>>>>> >>>> what everyone expects.
>>>>>>> >>>>
>>>>>>> >>>> I think for a lot of the other concerns my view is that the
>>>>>>> exact
>>>>>>> >> project
>>>>>>> >>>> does not really matter (and choosing a project with mature cross
>>>>>>> >> language
>>>>>>> >>>> testing infrastructure or committing to building it is
>>>>>>> critical). IIUC
>>>>>>> >> we
>>>>>>> >>>> are talking about following artifacts:
>>>>>>> >>>>
>>>>>>> >>>> 1.  A stand alone specification document (this can be hosted
>>>>>>> anyplace)
>>>>>>> >>>> 2.  A set of language bindings with minimal dependencies can be
>>>>>>> consumed
>>>>>>> >>>> downstream (again, as long as dependencies are managed
>>>>>>> carefully any
>>>>>>> >>>> project can host these)
>>>>>>> >>>> 3.  Potential integration where appropriate into file format
>>>>>>> libraries
>>>>>>> >> to
>>>>>>> >>>> support shredding (but as of now this is being bypassed by using
>>>>>>> >>>> conventions anyways).  My impression is that at least for
>>>>>>> Parquet there
>>>>>>> >> has
>>>>>>> >>>> been a proliferation of vectorized readers across different
>>>>>>> projects, so
>>>>>>> >>>> I'm not clear how much standardization in parquet-java could
>>>>>>> help here.
>>>>>>> >>>>
>>>>>>> >>>> To respond to some other questions:
>>>>>>> >>>>
>>>>>>> >>>> Arrow is not used as Spark's in-memory model, nor Trino and
>>>>>>> others so
>>>>>>> >> those
>>>>>>> >>>>> existing relationships aren't there. I also worry that
>>>>>>> differences in
>>>>>>> >>>>> approaches would make it difficult later on.
>>>>>>> >>>>
>>>>>>> >>>>
>>>>>>> >>>> While Arrow is not in the core memory model, for Spark I
>>>>>>> believe it is
>>>>>>> >>>> still used for IPC for things like Java<->Python. Trino also
>>>>>>> consumes
>>>>>>> >> Arrow
>>>>>>> >>>> libraries today to support things like Snowflake/Bigquery
>>>>>>> federation.
>>>>>>> >> But I
>>>>>>> >>>> think this is minor because as mentioned above I think the
>>>>>>> functional
>>>>>>> >>>> libraries would be relatively stand-alone.
>>>>>>> >>>>
>>>>>>> >>>> Do we think it could be introduced as a canonical extension
>>>>>>> arrow type?
>>>>>>> >>>>
>>>>>>> >>>>
>>>>>>> >>>>    I believe it can be, I think there are probably different
>>>>>>> layouts
>>>>>>> >> that can
>>>>>>> >>>> be supported:
>>>>>>> >>>>
>>>>>>> >>>> 1.  A struct with two variable width bytes columns (metadata
>>>>>>> and value
>>>>>>> >> data
>>>>>>> >>>> are stored separately and each entry has a 1:1 relationship).
>>>>>>> >>>> 2.  Shredded (shredded according to the same convention as
>>>>>>> parquet), I
>>>>>>> >>>> would need to double check but I don't think Arrow would have
>>>>>>> problems
>>>>>>> >> here
>>>>>>> >>>> but REE would likely be required to make this efficient (i.e.
>>>>>>> sparse
>>>>>>> >> value
>>>>>>> >>>> support is important).
>>>>>>> >>>>
>>>>>>> >>>> In both cases the main complexity is providing the necessary
>>>>>>> functions
>>>>>>> >> for
>>>>>>> >>>> manipulation.
>>>>>>> >>>>
>>>>>>> >>>> Thanks,
>>>>>>> >>>> Micah
>>>>>>> >>>>
>>>>>>> >>>>
>>>>>>> >>>>
>>>>>>> >>>>
>>>>>>> >>>>
>>>>>>> >>>>
>>>>>>> >>>>
>>>>>>> >>>> On Fri, Aug 16, 2024 at 3:58 PM Will Jones <
>>>>>>> will.jones...@gmail.com>
>>>>>>> >>>> wrote:
>>>>>>> >>>>
>>>>>>> >>>>> In being more engine and format agnostic, I agree the Arrow
>>>>>>> project
>>>>>>> >> might
>>>>>>> >>>>> be a good host for such a specification. It seems like we want
>>>>>>> to move
>>>>>>> >>>> away
>>>>>>> >>>>> from hosting in Spark to make it engine agnostic. But moving
>>>>>>> into
>>>>>>> >> Iceberg
>>>>>>> >>>>> might make it less format agnostic, as I understand multiple
>>>>>>> formats
>>>>>>> >>>> might
>>>>>>> >>>>> want to implement this. I'm not intimately familiar with the
>>>>>>> state of
>>>>>>> >>>> this,
>>>>>>> >>>>> but I believe Delta Lake would like to be aligned with the
>>>>>>> same format
>>>>>>> >> as
>>>>>>> >>>>> Iceberg. In addition, the Lance format (which I work on), will
>>>>>>> >> eventually
>>>>>>> >>>>> be interesting as well. It seems equally bad to me to attach
>>>>>>> this
>>>>>>> >>>>> specification to a particular table format as it does a
>>>>>>> particular
>>>>>>> >> query
>>>>>>> >>>>> engine.
>>>>>>> >>>>>
>>>>>>> >>>>> That being said, I think the most important consideration for
>>>>>>> now is
>>>>>>> >>>> where
>>>>>>> >>>>> are the current maintainers / contributors to the variant
>>>>>>> type. If most
>>>>>>> >>>> of
>>>>>>> >>>>> them are already PMC members / committers on a project, it
>>>>>>> becomes a
>>>>>>> >> bit
>>>>>>> >>>>> easier. Otherwise if there isn't much overlap with a project's
>>>>>>> existing
>>>>>>> >>>>> governance, I worry there could be a bit of friction. How many
>>>>>>> active
>>>>>>> >>>>> contributors are there from Iceberg? And how about from Arrow?
>>>>>>> >>>>>
>>>>>>> >>>>> BTW, I'd add I'm interested in helping develop an Arrow
>>>>>>> extension type
>>>>>>> >>>> for
>>>>>>> >>>>> the binary variant type. I've been experimenting with a
>>>>>>> DataFusion
>>>>>>> >>>>> extension that operates on this [1], and already have some
>>>>>>> ideas on how
>>>>>>> >>>>> such an extension type might be defined. I'm not yet caught up
>>>>>>> on the
>>>>>>> >>>>> shredded specification, but I think having just the binary
>>>>>>> format would
>>>>>>> >>>> be
>>>>>>> >>>>> beneficial for in-memory analytics, which are most relevant to
>>>>>>> Arrow.
>>>>>>> >>>> I'll
>>>>>>> >>>>> be creating a seperate thread on the Arrow ML about this soon.
>>>>>>> >>>>>
>>>>>>> >>>>> Best,
>>>>>>> >>>>>
>>>>>>> >>>>> Will Jones
>>>>>>> >>>>>
>>>>>>> >>>>> [1]
>>>>>>> >>>>>
>>>>>>> >>>>
>>>>>>> >>
>>>>>>> https://github.com/datafusion-contrib/datafusion-functions-variant/issues
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>> On Thu, Aug 15, 2024 at 7:39 PM Gang Wu <ust...@gmail.com>
>>>>>>> wrote:
>>>>>>> >>>>>
>>>>>>> >>>>>> + dev@arrow
>>>>>>> >>>>>>
>>>>>>> >>>>>> Thanks for all the valuable suggestions! I am inclined to
>>>>>>> Micah's idea
>>>>>>> >>>>> that
>>>>>>> >>>>>> Arrow might be a better host compared to Parquet.
>>>>>>> >>>>>>
>>>>>>> >>>>>> To give more context, I am taking the initiative to add the
>>>>>>> geometry
>>>>>>> >>>> type
>>>>>>> >>>>>> to both Parquet and ORC. I'd like to do the same thing for
>>>>>>> variant
>>>>>>> >> type
>>>>>>> >>>>> in
>>>>>>> >>>>>> that variant type is engine and file format agnostic. This
>>>>>>> does mean
>>>>>>> >>>> that
>>>>>>> >>>>>> Parquet might not be the neutral place to hold the variant
>>>>>>> spec.
>>>>>>> >>>>>>
>>>>>>> >>>>>> Best,
>>>>>>> >>>>>> Gang
>>>>>>> >>>>>>
>>>>>>> >>>>>> On Fri, Aug 16, 2024 at 10:00 AM Jingsong Li <
>>>>>>> jingsongl...@gmail.com>
>>>>>>> >>>>>> wrote:
>>>>>>> >>>>>>
>>>>>>> >>>>>>> Thanks all for your discussion.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> The Apache Paimon community is also considering support for
>>>>>>> this
>>>>>>> >>>>>>> Variant type, without a doubt, we hope to maintain
>>>>>>> consistency with
>>>>>>> >>>>>>> Iceberg.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Not only the Paimon community, but also various computing
>>>>>>> engines
>>>>>>> >>>> need
>>>>>>> >>>>>>> to adapt to this type, such as Flink and StarRocks. We also
>>>>>>> hope to
>>>>>>> >>>>>>> promote them to adapt to this type.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> It is worth noting that we also need to standardize many
>>>>>>> functions
>>>>>>> >>>>>>> related to it.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> A neutral place to maintain it is a great choice.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> - As Gang Wu said, a standalone project is good, just like
>>>>>>> >>>>> RoaringBitmap
>>>>>>> >>>>>>> [1].
>>>>>>> >>>>>>> - As Ryan said, Parquet community is a neutral option too.
>>>>>>> >>>>>>> - As Micah said, Arrow is also an option too.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> [1] https://github.com/RoaringBitmap
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Best,
>>>>>>> >>>>>>> Jingsong
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> On Fri, Aug 16, 2024 at 7:18 AM Micah Kornfield <
>>>>>>> >>>> emkornfi...@gmail.com
>>>>>>> >>>>>>
>>>>>>> >>>>>>> wrote:
>>>>>>> >>>>>>>>>
>>>>>>> >>>>>>>>> Thats fair @Micah, so far all the discussions have been
>>>>>>> direct and
>>>>>>> >>>>> off
>>>>>>> >>>>>>> the dev list. Would you like to make the request on the
>>>>>>> public Spark
>>>>>>> >>>>> Dev
>>>>>>> >>>>>>> list? I would be glad to co-sign, I can also draft up a
>>>>>>> quick email
>>>>>>> >>>> if
>>>>>>> >>>>>> you
>>>>>>> >>>>>>> don't have time.
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> I think once we come to consensus, if you have bandwidth, I
>>>>>>> think
>>>>>>> >>>> the
>>>>>>> >>>>>>> message might be better coming from you, as you have more
>>>>>>> context on
>>>>>>> >>>>> some
>>>>>>> >>>>>>> of the non-public conversations, the requirements from an
>>>>>>> Iceberg
>>>>>>> >>>>>>> perspective on governance and the blockers that were
>>>>>>> encountered.  If
>>>>>>> >>>>>>> details on the conversations can't be shared, (i.e. we are
>>>>>>> starting
>>>>>>> >>>>> from
>>>>>>> >>>>>>> scratch) it seems like suggesting a new project via SPIP
>>>>>>> might be the
>>>>>>> >>>>> way
>>>>>>> >>>>>>> forward.  I'm happy to help with that if it is useful but I
>>>>>>> would
>>>>>>> >>>> guess
>>>>>>> >>>>>>> Aihua or Tyler might be in a better place to start as it
>>>>>>> seems they
>>>>>>> >>>>> have
>>>>>>> >>>>>>> done more serious thinking here.
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> If we decide to try to standardize on Parquet or Arrow I'm
>>>>>>> happy to
>>>>>>> >>>>>> help
>>>>>>> >>>>>>> support the effort in those communities.
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> Thanks,
>>>>>>> >>>>>>>> Micah
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> On Thu, Aug 15, 2024 at 8:09 AM Russell Spitzer <
>>>>>>> >>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>> >>>>>>>>>
>>>>>>> >>>>>>>>> Thats fair @Micah, so far all the discussions have been
>>>>>>> direct and
>>>>>>> >>>>> off
>>>>>>> >>>>>>> the dev list. Would you like to make the request on the
>>>>>>> public Spark
>>>>>>> >>>>> Dev
>>>>>>> >>>>>>> list? I would be glad to co-sign, I can also draft up a
>>>>>>> quick email
>>>>>>> >>>> if
>>>>>>> >>>>>> you
>>>>>>> >>>>>>> don't have time.
>>>>>>> >>>>>>>>>
>>>>>>> >>>>>>>>> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield <
>>>>>>> >>>>>> emkornfi...@gmail.com>
>>>>>>> >>>>>>> wrote:
>>>>>>> >>>>>>>>>>>
>>>>>>> >>>>>>>>>>> I agree that it would be beneficial to make a
>>>>>>> sub-project, the
>>>>>>> >>>>> main
>>>>>>> >>>>>>> problem is political and not logistic. I've been asking for
>>>>>>> movement
>>>>>>> >>>>> from
>>>>>>> >>>>>>> other relative projects for a month and we simply haven't
>>>>>>> gotten
>>>>>>> >>>>>> anywhere.
>>>>>>> >>>>>>>>>>
>>>>>>> >>>>>>>>>>
>>>>>>> >>>>>>>>>> I just wanted to double check that these issues were
>>>>>>> brought
>>>>>>> >>>>> directly
>>>>>>> >>>>>>> to the spark community (i.e. a discussion thread on the Spark
>>>>>>> >>>> developer
>>>>>>> >>>>>>> mailing list) and not via backchannels.
>>>>>>> >>>>>>>>>>
>>>>>>> >>>>>>>>>> I'm not sure the outcome would be different and I don't
>>>>>>> think
>>>>>>> >>>> this
>>>>>>> >>>>>>> should block forking the spec, but we should make sure that
>>>>>>> the
>>>>>>> >>>>> decision
>>>>>>> >>>>>> is
>>>>>>> >>>>>>> publicly documented within both communities.
>>>>>>> >>>>>>>>>>
>>>>>>> >>>>>>>>>> Thanks,
>>>>>>> >>>>>>>>>> Micah
>>>>>>> >>>>>>>>>>
>>>>>>> >>>>>>>>>> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer <
>>>>>>> >>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>> >>>>>>>>>>>
>>>>>>> >>>>>>>>>>> @Gang Wu
>>>>>>> >>>>>>>>>>>
>>>>>>> >>>>>>>>>>> I agree that it would be beneficial to make a
>>>>>>> sub-project, the
>>>>>>> >>>>> main
>>>>>>> >>>>>>> problem is political and not logistic. I've been asking for
>>>>>>> movement
>>>>>>> >>>>> from
>>>>>>> >>>>>>> other relative projects for a month and we simply haven't
>>>>>>> gotten
>>>>>>> >>>>>> anywhere.
>>>>>>> >>>>>>> I don't think there is anything that would stop us from
>>>>>>> moving to a
>>>>>>> >>>>> joint
>>>>>>> >>>>>>> project in the future and if you know of some way of
>>>>>>> encouraging that
>>>>>>> >>>>>>> movement from other relevant parties I would be glad to
>>>>>>> collaborate
>>>>>>> >>>> in
>>>>>>> >>>>>>> doing that. One thing that I don't want to do is have the
>>>>>>> Iceberg
>>>>>>> >>>>> project
>>>>>>> >>>>>>> stay in a holding pattern without any clear roadmap as to
>>>>>>> how to
>>>>>>> >>>>> proceed.
>>>>>>> >>>>>>>>>>>
>>>>>>> >>>>>>>>>>> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu <
>>>>>>> flyrain...@gmail.com
>>>>>>> >>>>>
>>>>>>> >>>>>>> wrote:
>>>>>>> >>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>> I’m on board with copying the spec into our repository.
>>>>>>> >>>> However,
>>>>>>> >>>>> as
>>>>>>> >>>>>>> we’ve talked about, it’s not just a straightforward
>>>>>>> copy—there are
>>>>>>> >>>>>> already
>>>>>>> >>>>>>> some divergences. Some of them are under discussion. Iceberg
>>>>>>> is
>>>>>>> >>>>>> definitely
>>>>>>> >>>>>>> the best place for these specs. Engines like Trino and Flink
>>>>>>> can then
>>>>>>> >>>>>> rely
>>>>>>> >>>>>>> on the Iceberg specs as a solid foundation.
>>>>>>> >>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>> Yufei
>>>>>>> >>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <
>>>>>>> ust...@gmail.com>
>>>>>>> >>>>> wrote:
>>>>>>> >>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>> Sorry for chiming in late.
>>>>>>> >>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>   From the discussion in
>>>>>>> >>>>>>>
>>>>>>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I
>>>>>>> >>>>>> don't
>>>>>>> >>>>>>> quite understand why it is logistically complicated to
>>>>>>> create a
>>>>>>> >>>>>> sub-project
>>>>>>> >>>>>>> to hold the variant spec and impl.
>>>>>>> >>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>> IMHO, coping the variant type spec into Apache Iceberg
>>>>>>> has
>>>>>>> >>>> some
>>>>>>> >>>>>>> deficiencies:
>>>>>>> >>>>>>>>>>>>> - It is a burden to update two repos if there is a
>>>>>>> variant
>>>>>>> >>>> type
>>>>>>> >>>>>>> spec change and will likely result in deviation if some
>>>>>>> changes do
>>>>>>> >>>> not
>>>>>>> >>>>>>> reach agreement from both parties.
>>>>>>> >>>>>>>>>>>>> - Implementers are required to keep an eye on both
>>>>>>> specs
>>>>>>> >>>>>>> (considering proprietary engines where both Iceberg and
>>>>>>> Delta are
>>>>>>> >>>>>>> supported).
>>>>>>> >>>>>>>>>>>>> - Putting the spec and impl of variant type in Iceberg
>>>>>>> repo
>>>>>>> >>>> does
>>>>>>> >>>>>>> lose the opportunity for better native support from file
>>>>>>> formats like
>>>>>>> >>>>>>> Parquet and ORC.
>>>>>>> >>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>> I'm not sure if it is possible to create a separate
>>>>>>> project
>>>>>>> >>>>> (e.g.
>>>>>>> >>>>>>> apache/variant-type) to make it a single point of truth. We
>>>>>>> can learn
>>>>>>> >>>>>> from
>>>>>>> >>>>>>> the experience of Apache Arrow. In this fashion, different
>>>>>>> engines,
>>>>>>> >>>>> table
>>>>>>> >>>>>>> formats and file formats can follow the same spec and are
>>>>>>> free to
>>>>>>> >>>>> depend
>>>>>>> >>>>>> on
>>>>>>> >>>>>>> the reference implementations from apache/variant-type or
>>>>>>> implement
>>>>>>> >>>>> their
>>>>>>> >>>>>>> own.
>>>>>>> >>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>> Best,
>>>>>>> >>>>>>>>>>>>> Gang
>>>>>>> >>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye <
>>>>>>> yezhao...@gmail.com
>>>>>>> >>>>>
>>>>>>> >>>>>>> wrote:
>>>>>>> >>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>> +1 for copying the spec into our repository, I think
>>>>>>> we need
>>>>>>> >>>> to
>>>>>>> >>>>>>> own it fully as a part of the table spec, and we can build
>>>>>>> >>>>> compatibility
>>>>>>> >>>>>>> through tests.
>>>>>>> >>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>> -Jack
>>>>>>> >>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer <
>>>>>>> >>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>> >>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>> I'm not really in favor of linking and annotating as
>>>>>>> that
>>>>>>> >>>> just
>>>>>>> >>>>>>> makes things more complicated and still is essentially
>>>>>>> forking just
>>>>>>> >>>>> with
>>>>>>> >>>>>>> more steps. If we just track our annotations /
>>>>>>> modifications  to a
>>>>>>> >>>>> single
>>>>>>> >>>>>>> commit/version then we have the same issue again but now you
>>>>>>> have to
>>>>>>> >>>> go
>>>>>>> >>>>>> to
>>>>>>> >>>>>>> multiple sources to get the actual Spec. In addition, our
>>>>>>> very copy
>>>>>>> >>>> of
>>>>>>> >>>>>> the
>>>>>>> >>>>>>> Spec is going to require new types which don't exist in the
>>>>>>> Spark
>>>>>>> >>>> Spec
>>>>>>> >>>>>>> which necessarily means diverging. We will need to take up
>>>>>>> new
>>>>>>> >>>>> primitive
>>>>>>> >>>>>>> id's (as noted in my first email)
>>>>>>> >>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>> The other issue I have is I don't think the Spark
>>>>>>> Spec is
>>>>>>> >>>>> really
>>>>>>> >>>>>>> going through a thorough review process from all members of
>>>>>>> the Spark
>>>>>>> >>>>>>> community, I believe it probably should have gone through
>>>>>>> the SPIP
>>>>>>> >>>> but
>>>>>>> >>>>>>> instead seems to have been merged without broad community
>>>>>>> >>>> involvement.
>>>>>>> >>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>> The only way to truly avoid diverging is to only
>>>>>>> have a
>>>>>>> >>>> single
>>>>>>> >>>>>>> copy of the spec, in our previous discussions the vast
>>>>>>> majority of
>>>>>>> >>>>> Apache
>>>>>>> >>>>>>> Iceberg community want it to exist here.
>>>>>>> >>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks <
>>>>>>> >>>>> dwe...@apache.org
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> wrote:
>>>>>>> >>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>> I'm really excited about the introduction of
>>>>>>> variant type
>>>>>>> >>>> to
>>>>>>> >>>>>>> Iceberg, but I want to raise concerns about forking the spec.
>>>>>>> >>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>> I feel like preemptively forking would create the
>>>>>>> situation
>>>>>>> >>>>>>> where we end up diverging because there's little reason to
>>>>>>> work with
>>>>>>> >>>>> both
>>>>>>> >>>>>>> communities to evolve in a way that benefits everyone.
>>>>>>> >>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>> I would much rather point to a specific version of
>>>>>>> the spec
>>>>>>> >>>>> and
>>>>>>> >>>>>>> annotate any variance in Iceberg's handling.  This would
>>>>>>> allow us to
>>>>>>> >>>>>>> continue without dividing the communities.
>>>>>>> >>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>> If at any point there are irreconcilable
>>>>>>> differences, I
>>>>>>> >>>> would
>>>>>>> >>>>>>> support forking, but I don't feel like that should be the
>>>>>>> initial
>>>>>>> >>>> step.
>>>>>>> >>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>> No one is excited about the possibility that the
>>>>>>> physical
>>>>>>> >>>>>>> representations end up diverging, but it feels like we're
>>>>>>> setting
>>>>>>> >>>>>> ourselves
>>>>>>> >>>>>>> up for that exact scenario.
>>>>>>> >>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>> -Dan
>>>>>>> >>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong <
>>>>>>> >>>>>>> fo...@apache.org> wrote:
>>>>>>> >>>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>> +1 to what's already being said here. It is good
>>>>>>> to copy
>>>>>>> >>>> the
>>>>>>> >>>>>>> spec to Iceberg and add context that's specific to Iceberg,
>>>>>>> but at
>>>>>>> >>>> the
>>>>>>> >>>>>> same
>>>>>>> >>>>>>> time, we should maintain compatibility.
>>>>>>> >>>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>> Kind regards,
>>>>>>> >>>>>>>>>>>>>>>>> Fokko
>>>>>>> >>>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang <
>>>>>>> >>>>>>> owenzhang1...@gmail.com>:
>>>>>>> >>>>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>>> +1 to copy the spec into our repository. I think
>>>>>>> the best
>>>>>>> >>>>> way
>>>>>>> >>>>>>> to keep compatibility is building integration tests.
>>>>>>> >>>>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>>> Thanks,
>>>>>>> >>>>>>>>>>>>>>>>>> Manu
>>>>>>> >>>>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry <
>>>>>>> >>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>> >>>>>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks Russell and Aihua for pushing Variant
>>>>>>> support!
>>>>>>> >>>>>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>>>> Given the differences between the supported
>>>>>>> types and
>>>>>>> >>>> the
>>>>>>> >>>>>>> lack of interest from the other project, I think it is
>>>>>>> reasonable to
>>>>>>> >>>>>>> duplicate the specification to our repository.
>>>>>>> >>>>>>>>>>>>>>>>>>> I would give very strong emphasis on sticking to
>>>>>>> the
>>>>>>> >>>> Spark
>>>>>>> >>>>>>> spec as much as possible, to keep compatibility as much as
>>>>>>> possible.
>>>>>>> >>>>>> Maybe
>>>>>>> >>>>>>> even revert to a shared specification if the situation
>>>>>>> changes.
>>>>>>> >>>>>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>> >>>>>>>>>>>>>>>>>>> Peter
>>>>>>> >>>>>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>>>> Aihua Xu <aihu...@gmail.com> ezt írta (időpont:
>>>>>>> 2024.
>>>>>>> >>>>> aug.
>>>>>>> >>>>>>> 13., K, 19:52):
>>>>>>> >>>>>>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks Russell for bringing this up.
>>>>>>> >>>>>>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>>>>> This is the main blocker to move forward with
>>>>>>> the
>>>>>>> >>>> Variant
>>>>>>> >>>>>>> support in Iceberg and hopefully we can have a consensus. To
>>>>>>> me, I
>>>>>>> >>>> also
>>>>>>> >>>>>>> feel it makes more sense to move the spec into Iceberg
>>>>>>> rather than
>>>>>>> >>>>> Spark
>>>>>>> >>>>>>> engine owns it and we try to keep it compatible with Spark
>>>>>>> spec.
>>>>>>> >>>>>>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>> >>>>>>>>>>>>>>>>>>>> Aihua
>>>>>>> >>>>>>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer
>>>>>>> <
>>>>>>> >>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>> >>>>>>>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi Y’all,
>>>>>>> >>>>>>>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>>>>>> We’ve hit a bit of a roadblock with the Variant
>>>>>>> >>>>> Proposal,
>>>>>>> >>>>>>> while we were hoping to move the Variant and Shredding
>>>>>>> specifications
>>>>>>> >>>>>> from
>>>>>>> >>>>>>> Spark into Iceberg there doesn’t seem to be a lot of
>>>>>>> interest in
>>>>>>> >>>> that.
>>>>>>> >>>>>>> Unfortunately, I think we have a number of issues with just
>>>>>>> linking
>>>>>>> >>>> to
>>>>>>> >>>>>> the
>>>>>>> >>>>>>> Spark project directly from within Iceberg and I believe we
>>>>>>> need to
>>>>>>> >>>>> copy
>>>>>>> >>>>>>> the specifications into our repository.
>>>>>>> >>>>>>>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>>>>>> There are a few reasons why i think this is
>>>>>>> necessary
>>>>>>> >>>>>>>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>>>>>> First, we have a divergence of types already.
>>>>>>> The
>>>>>>> >>>> Spark
>>>>>>> >>>>>>> Specification already includes types which Iceberg has no
>>>>>>> definition
>>>>>>> >>>>> for
>>>>>>> >>>>>>> (19, 20 - Interval Types) and Iceberg already has a type
>>>>>>> which is not
>>>>>>> >>>>>>> included within the Spark Specification (Time) and will soon
>>>>>>> have
>>>>>>> >>>> more
>>>>>>> >>>>>> with
>>>>>>> >>>>>>> TimestampNS, and Geo.
>>>>>>> >>>>>>>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>>>>>> Second, We would like to make sure that Spark
>>>>>>> is not a
>>>>>>> >>>>>> hard
>>>>>>> >>>>>>> dependency for other engines. We are working with several
>>>>>>> >>>> implementers
>>>>>>> >>>>> of
>>>>>>> >>>>>>> the Iceberg spec and it has previously been agreed that it
>>>>>>> would be
>>>>>>> >>>>> best
>>>>>>> >>>>>> if
>>>>>>> >>>>>>> the source of truth for Variant existed in an engine and
>>>>>>> file format
>>>>>>> >>>>>>> neutral location. The Iceberg project has a good open model
>>>>>>> of
>>>>>>> >>>>> governance
>>>>>>> >>>>>>> and, as we have seen so far discussing Variant, open and
>>>>>>> active
>>>>>>> >>>>>>> collaboration. This would also help as we can strictly
>>>>>>> version our
>>>>>>> >>>>>> changes
>>>>>>> >>>>>>> in-line with the rest of the Iceberg spec.
>>>>>>> >>>>>>>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>>>>>> Third, The Shredding spec is not quite
>>>>>>> finished and
>>>>>>> >>>>>>> requires some group analysis and discussion before we commit
>>>>>>> it. I
>>>>>>> >>>>> think
>>>>>>> >>>>>>> again the Iceberg community is probably the right place for
>>>>>>> this to
>>>>>>> >>>>>> happen
>>>>>>> >>>>>>> as we have already started discussions here on these topics.
>>>>>>> >>>>>>>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>>>>>> For these reasons I think we should go with a
>>>>>>> direct
>>>>>>> >>>>> copy
>>>>>>> >>>>>>> of the existing specification from the Spark Project and
>>>>>>> move ahead
>>>>>>> >>>>> with
>>>>>>> >>>>>>> our discussions and modifications within Iceberg. That said,
>>>>>>> I do not
>>>>>>> >>>>>> want
>>>>>>> >>>>>>> to diverge if possible from the Spark proposal. For example,
>>>>>>> although
>>>>>>> >>>>> we
>>>>>>> >>>>>> do
>>>>>>> >>>>>>> not use the Interval types above, I think we should not
>>>>>>> reuse those
>>>>>>> >>>>> type
>>>>>>> >>>>>>> ids within our spec. Iceberg's Variant Spec types 19 and 20
>>>>>>> would
>>>>>>> >>>>> remain
>>>>>>> >>>>>>> unused along with any other types we think are not
>>>>>>> applicable. We
>>>>>>> >>>>> should
>>>>>>> >>>>>>> strive whenever possible to allow for compatibility.
>>>>>>> >>>>>>>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>>>>>> In the interest of moving forward with this
>>>>>>> proposal I
>>>>>>> >>>>> am
>>>>>>> >>>>>>> hoping to see if anyone in the community objects to this
>>>>>>> plan going
>>>>>>> >>>>>> forward
>>>>>>> >>>>>>> or has a better alternative.
>>>>>>> >>>>>>>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>>>>>> As always I am thankful for your time and am
>>>>>>> eager to
>>>>>>> >>>>> hear
>>>>>>> >>>>>>> back from everyone,
>>>>>>> >>>>>>>>>>>>>>>>>>>>> Russ
>>>>>>> >>>>>>>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>>>>>>>>>>>>>>>
>>>>>>> >>>>>>>
>>>>>>> >>>>>>
>>>>>>> >>>>>
>>>>>>> >>>>
>>>>>>> >>>
>>>>>>> >>
>>>>>>> >
>>>>>>>
>>>>>>

Re: [DISCUSS] Variant Spec Location

Reply via email to