Ah, thanks. I've tried to find a rationale and ended up on
https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 . Is it
a good description of what you're after?
If so, then I don't think Arrow is a good match. This seems mostly to be
a marshalling format for semi-structured data (like Avro?). Arrow data
types are meant to be in a representation ideal for querying and
computation, rather than transport and storage.
This could be developed separately and then be represented in Arrow
using an extension type (perhaps a canonical one as in
https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html).
What do other Arrow developers think?
Regards
Antoine.
Le 22/08/2024 à 10:45, Gang Wu a écrit :
Sorry for the inconvenience.
This is the permalink for the discussion:
https://lists.apache.org/thread/hopkr2f0ftoywwt9zo3jxb7n0ob5s5bw
On Thu, Aug 22, 2024 at 3:51 PM Antoine Pitrou <anto...@python.org> wrote:
Hi Gang,
Sorry, but can you give a pointer to the start of this discussion thread
in a readable format (for example a mailing-list archive)? It appears
that dev@arrow wasn't cc'ed from the start and that can make it
difficult to understand what this is about.
Regards
Antoine.
Le 22/08/2024 à 08:32, Gang Wu a écrit :
It seems that we have reached a consensus to some extent that there
should be a new home for the variant spec. The pending question
is whether Parquet or Arrow is a better choice. As a committer from
Arrow,
Parquet and ORC communities, I am neutral to choose any and happy to
help with the movement once a decision has been made.
Should we start a vote to move forward?
Best,
Gang
On Sat, Aug 17, 2024 at 8:34 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:
That being said, I think the most important consideration for now is
where
are the current maintainers / contributors to the variant type. If most
of
them are already PMC members / committers on a project, it becomes a
bit
easier. Otherwise if there isn't much overlap with a project's existing
governance, I worry there could be a bit of friction. How many active
contributors are there from Iceberg? And how about from Arrow?
I think this is the key question. What are the requirements around
governance? I've seen some tangential messaging here but I'm not clear
on
what everyone expects.
I think for a lot of the other concerns my view is that the exact
project
does not really matter (and choosing a project with mature cross
language
testing infrastructure or committing to building it is critical). IIUC
we
are talking about following artifacts:
1. A stand alone specification document (this can be hosted anyplace)
2. A set of language bindings with minimal dependencies can be consumed
downstream (again, as long as dependencies are managed carefully any
project can host these)
3. Potential integration where appropriate into file format libraries
to
support shredding (but as of now this is being bypassed by using
conventions anyways). My impression is that at least for Parquet there
has
been a proliferation of vectorized readers across different projects, so
I'm not clear how much standardization in parquet-java could help here.
To respond to some other questions:
Arrow is not used as Spark's in-memory model, nor Trino and others so
those
existing relationships aren't there. I also worry that differences in
approaches would make it difficult later on.
While Arrow is not in the core memory model, for Spark I believe it is
still used for IPC for things like Java<->Python. Trino also consumes
Arrow
libraries today to support things like Snowflake/Bigquery federation.
But I
think this is minor because as mentioned above I think the functional
libraries would be relatively stand-alone.
Do we think it could be introduced as a canonical extension arrow type?
I believe it can be, I think there are probably different layouts
that can
be supported:
1. A struct with two variable width bytes columns (metadata and value
data
are stored separately and each entry has a 1:1 relationship).
2. Shredded (shredded according to the same convention as parquet), I
would need to double check but I don't think Arrow would have problems
here
but REE would likely be required to make this efficient (i.e. sparse
value
support is important).
In both cases the main complexity is providing the necessary functions
for
manipulation.
Thanks,
Micah
On Fri, Aug 16, 2024 at 3:58 PM Will Jones <will.jones...@gmail.com>
wrote:
In being more engine and format agnostic, I agree the Arrow project
might
be a good host for such a specification. It seems like we want to move
away
from hosting in Spark to make it engine agnostic. But moving into
Iceberg
might make it less format agnostic, as I understand multiple formats
might
want to implement this. I'm not intimately familiar with the state of
this,
but I believe Delta Lake would like to be aligned with the same format
as
Iceberg. In addition, the Lance format (which I work on), will
eventually
be interesting as well. It seems equally bad to me to attach this
specification to a particular table format as it does a particular
query
engine.
That being said, I think the most important consideration for now is
where
are the current maintainers / contributors to the variant type. If most
of
them are already PMC members / committers on a project, it becomes a
bit
easier. Otherwise if there isn't much overlap with a project's existing
governance, I worry there could be a bit of friction. How many active
contributors are there from Iceberg? And how about from Arrow?
BTW, I'd add I'm interested in helping develop an Arrow extension type
for
the binary variant type. I've been experimenting with a DataFusion
extension that operates on this [1], and already have some ideas on how
such an extension type might be defined. I'm not yet caught up on the
shredded specification, but I think having just the binary format would
be
beneficial for in-memory analytics, which are most relevant to Arrow.
I'll
be creating a seperate thread on the Arrow ML about this soon.
Best,
Will Jones
[1]
https://github.com/datafusion-contrib/datafusion-functions-variant/issues
On Thu, Aug 15, 2024 at 7:39 PM Gang Wu <ust...@gmail.com> wrote:
+ dev@arrow
Thanks for all the valuable suggestions! I am inclined to Micah's idea
that
Arrow might be a better host compared to Parquet.
To give more context, I am taking the initiative to add the geometry
type
to both Parquet and ORC. I'd like to do the same thing for variant
type
in
that variant type is engine and file format agnostic. This does mean
that
Parquet might not be the neutral place to hold the variant spec.
Best,
Gang
On Fri, Aug 16, 2024 at 10:00 AM Jingsong Li <jingsongl...@gmail.com>
wrote:
Thanks all for your discussion.
The Apache Paimon community is also considering support for this
Variant type, without a doubt, we hope to maintain consistency with
Iceberg.
Not only the Paimon community, but also various computing engines
need
to adapt to this type, such as Flink and StarRocks. We also hope to
promote them to adapt to this type.
It is worth noting that we also need to standardize many functions
related to it.
A neutral place to maintain it is a great choice.
- As Gang Wu said, a standalone project is good, just like
RoaringBitmap
[1].
- As Ryan said, Parquet community is a neutral option too.
- As Micah said, Arrow is also an option too.
[1] https://github.com/RoaringBitmap
Best,
Jingsong
On Fri, Aug 16, 2024 at 7:18 AM Micah Kornfield <
emkornfi...@gmail.com
wrote:
Thats fair @Micah, so far all the discussions have been direct and
off
the dev list. Would you like to make the request on the public Spark
Dev
list? I would be glad to co-sign, I can also draft up a quick email
if
you
don't have time.
I think once we come to consensus, if you have bandwidth, I think
the
message might be better coming from you, as you have more context on
some
of the non-public conversations, the requirements from an Iceberg
perspective on governance and the blockers that were encountered. If
details on the conversations can't be shared, (i.e. we are starting
from
scratch) it seems like suggesting a new project via SPIP might be the
way
forward. I'm happy to help with that if it is useful but I would
guess
Aihua or Tyler might be in a better place to start as it seems they
have
done more serious thinking here.
If we decide to try to standardize on Parquet or Arrow I'm happy to
help
support the effort in those communities.
Thanks,
Micah
On Thu, Aug 15, 2024 at 8:09 AM Russell Spitzer <
russell.spit...@gmail.com> wrote:
Thats fair @Micah, so far all the discussions have been direct and
off
the dev list. Would you like to make the request on the public Spark
Dev
list? I would be glad to co-sign, I can also draft up a quick email
if
you
don't have time.
On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield <
emkornfi...@gmail.com>
wrote:
I agree that it would be beneficial to make a sub-project, the
main
problem is political and not logistic. I've been asking for movement
from
other relative projects for a month and we simply haven't gotten
anywhere.
I just wanted to double check that these issues were brought
directly
to the spark community (i.e. a discussion thread on the Spark
developer
mailing list) and not via backchannels.
I'm not sure the outcome would be different and I don't think
this
should block forking the spec, but we should make sure that the
decision
is
publicly documented within both communities.
Thanks,
Micah
On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer <
russell.spit...@gmail.com> wrote:
@Gang Wu
I agree that it would be beneficial to make a sub-project, the
main
problem is political and not logistic. I've been asking for movement
from
other relative projects for a month and we simply haven't gotten
anywhere.
I don't think there is anything that would stop us from moving to a
joint
project in the future and if you know of some way of encouraging that
movement from other relevant parties I would be glad to collaborate
in
doing that. One thing that I don't want to do is have the Iceberg
project
stay in a holding pattern without any clear roadmap as to how to
proceed.
On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu <flyrain...@gmail.com
wrote:
I’m on board with copying the spec into our repository.
However,
as
we’ve talked about, it’s not just a straightforward copy—there are
already
some divergences. Some of them are under discussion. Iceberg is
definitely
the best place for these specs. Engines like Trino and Flink can then
rely
on the Iceberg specs as a solid foundation.
Yufei
On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <ust...@gmail.com>
wrote:
Sorry for chiming in late.
From the discussion in
https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I
don't
quite understand why it is logistically complicated to create a
sub-project
to hold the variant spec and impl.
IMHO, coping the variant type spec into Apache Iceberg has
some
deficiencies:
- It is a burden to update two repos if there is a variant
type
spec change and will likely result in deviation if some changes do
not
reach agreement from both parties.
- Implementers are required to keep an eye on both specs
(considering proprietary engines where both Iceberg and Delta are
supported).
- Putting the spec and impl of variant type in Iceberg repo
does
lose the opportunity for better native support from file formats like
Parquet and ORC.
I'm not sure if it is possible to create a separate project
(e.g.
apache/variant-type) to make it a single point of truth. We can learn
from
the experience of Apache Arrow. In this fashion, different engines,
table
formats and file formats can follow the same spec and are free to
depend
on
the reference implementations from apache/variant-type or implement
their
own.
Best,
Gang
On Thu, Aug 15, 2024 at 10:07 AM Jack Ye <yezhao...@gmail.com
wrote:
+1 for copying the spec into our repository, I think we need
to
own it fully as a part of the table spec, and we can build
compatibility
through tests.
-Jack
On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer <
russell.spit...@gmail.com> wrote:
I'm not really in favor of linking and annotating as that
just
makes things more complicated and still is essentially forking just
with
more steps. If we just track our annotations / modifications to a
single
commit/version then we have the same issue again but now you have to
go
to
multiple sources to get the actual Spec. In addition, our very copy
of
the
Spec is going to require new types which don't exist in the Spark
Spec
which necessarily means diverging. We will need to take up new
primitive
id's (as noted in my first email)
The other issue I have is I don't think the Spark Spec is
really
going through a thorough review process from all members of the Spark
community, I believe it probably should have gone through the SPIP
but
instead seems to have been merged without broad community
involvement.
The only way to truly avoid diverging is to only have a
single
copy of the spec, in our previous discussions the vast majority of
Apache
Iceberg community want it to exist here.
On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks <
dwe...@apache.org
wrote:
I'm really excited about the introduction of variant type
to
Iceberg, but I want to raise concerns about forking the spec.
I feel like preemptively forking would create the situation
where we end up diverging because there's little reason to work with
both
communities to evolve in a way that benefits everyone.
I would much rather point to a specific version of the spec
and
annotate any variance in Iceberg's handling. This would allow us to
continue without dividing the communities.
If at any point there are irreconcilable differences, I
would
support forking, but I don't feel like that should be the initial
step.
No one is excited about the possibility that the physical
representations end up diverging, but it feels like we're setting
ourselves
up for that exact scenario.
-Dan
On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong <
fo...@apache.org> wrote:
+1 to what's already being said here. It is good to copy
the
spec to Iceberg and add context that's specific to Iceberg, but at
the
same
time, we should maintain compatibility.
Kind regards,
Fokko
Op wo 14 aug 2024 om 15:30 schreef Manu Zhang <
owenzhang1...@gmail.com>:
+1 to copy the spec into our repository. I think the best
way
to keep compatibility is building integration tests.
Thanks,
Manu
On Wed, Aug 14, 2024 at 8:27 PM Péter Váry <
peter.vary.apa...@gmail.com> wrote:
Thanks Russell and Aihua for pushing Variant support!
Given the differences between the supported types and
the
lack of interest from the other project, I think it is reasonable to
duplicate the specification to our repository.
I would give very strong emphasis on sticking to the
Spark
spec as much as possible, to keep compatibility as much as possible.
Maybe
even revert to a shared specification if the situation changes.
Thanks,
Peter
Aihua Xu <aihu...@gmail.com> ezt írta (időpont: 2024.
aug.
13., K, 19:52):
Thanks Russell for bringing this up.
This is the main blocker to move forward with the
Variant
support in Iceberg and hopefully we can have a consensus. To me, I
also
feel it makes more sense to move the spec into Iceberg rather than
Spark
engine owns it and we try to keep it compatible with Spark spec.
Thanks,
Aihua
On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer <
russell.spit...@gmail.com> wrote:
Hi Y’all,
We’ve hit a bit of a roadblock with the Variant
Proposal,
while we were hoping to move the Variant and Shredding specifications
from
Spark into Iceberg there doesn’t seem to be a lot of interest in
that.
Unfortunately, I think we have a number of issues with just linking
to
the
Spark project directly from within Iceberg and I believe we need to
copy
the specifications into our repository.
There are a few reasons why i think this is necessary
First, we have a divergence of types already. The
Spark
Specification already includes types which Iceberg has no definition
for
(19, 20 - Interval Types) and Iceberg already has a type which is not
included within the Spark Specification (Time) and will soon have
more
with
TimestampNS, and Geo.
Second, We would like to make sure that Spark is not a
hard
dependency for other engines. We are working with several
implementers
of
the Iceberg spec and it has previously been agreed that it would be
best
if
the source of truth for Variant existed in an engine and file format
neutral location. The Iceberg project has a good open model of
governance
and, as we have seen so far discussing Variant, open and active
collaboration. This would also help as we can strictly version our
changes
in-line with the rest of the Iceberg spec.
Third, The Shredding spec is not quite finished and
requires some group analysis and discussion before we commit it. I
think
again the Iceberg community is probably the right place for this to
happen
as we have already started discussions here on these topics.
For these reasons I think we should go with a direct
copy
of the existing specification from the Spark Project and move ahead
with
our discussions and modifications within Iceberg. That said, I do not
want
to diverge if possible from the Spark proposal. For example, although
we
do
not use the Interval types above, I think we should not reuse those
type
ids within our spec. Iceberg's Variant Spec types 19 and 20 would
remain
unused along with any other types we think are not applicable. We
should
strive whenever possible to allow for compatibility.
In the interest of moving forward with this proposal I
am
hoping to see if anyone in the community objects to this plan going
forward
or has a better alternative.
As always I am thankful for your time and am eager to
hear
back from everyone,
Russ