https://github.com/apache/arrow/blob/master/format/Schema.fbs#L88

"optionally typeIds provides an indirection between the child offset
and the type id for each child typeIds[offset] is the id used in the
type vector"

On Mon, Jun 17, 2019 at 12:26 PM Ben Kietzman <ben.kietz...@rstudio.com> wrote:
>
> Somewhat related:
>
> Could we clarify the expected content of the type_ids buffer of union
> arrays? Layout.rst
> <https://github.com/apache/arrow/blob/dede1e6/docs/source/format/Layout.rst#dense-union-type>
> seems to indicate these should be indices of the corresponding child array,
> but the C++ implementation allows them to be any positive int8 and
> maintains a mapping to child indices. (for example see what is generated
> for integration testing
> <https://github.com/apache/arrow/blob/dede1e6/cpp/src/arrow/ipc/test-common.cc#L410>:
> has two child arrays and type_ids 5, 10)
>
> On Mon, Jun 17, 2019 at 11:35 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > Sounds good.  Sorry I got distracted with some other stuff but should be
> > getting back to this soonish
> >
> > On Monday, June 17, 2019, Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > > I'd already moved the Union issues to 1.0.0 so we are all good there
> > >
> > > On Mon, Jun 17, 2019 at 10:18 AM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > > >
> > > > I'm also +1 for generalized unions as we currently have specified. The
> > > > objections from the Java users seems to be mostly on the basis of
> > > > performance in the union-of-primitives case -- that's an
> > > > implementation specific issue, so if Java needs to have a
> > > > "GeneralizedDenseUnionVector" or something to handle the
> > > > union-of-anything case, then that seems reasonable. The important
> > > > thing is that the binary protocol itself and serialized metadata is
> > > > something that we are happy with and won't need to change going
> > > > forward.
> > > >
> > > > It seems we're getting a bit long in the tooth to get this into 0.14.0
> > > > so I'm going to move the Union-related issues to the 1.0.0 milestone
> > > > so we can get this resolved for the 1.0.0 release
> > > >
> > > > Thanks
> > > > Wes
> > > >
> > > > On Mon, Jun 10, 2019 at 12:33 AM Ravindra Pindikura <
> > ravin...@dremio.com>
> > > wrote:
> > > > >
> > > > > On Sat, May 25, 2019 at 12:29 PM Micah Kornfield <
> > > emkornfi...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Thanks for the responses, I've clipped the questions and provided
> > > responses
> > > > > > inline.
> > > > > >
> > > > > > is the proposal that both cpp & java will support only option 2 ?
> > > > > > > I guess 1 is a subset of 2 anyway.
> > > > > >
> > > > > > CPP already supports option 2.  I would like to make CPP and java
> > > > > > compatible, in a way that this acceptable for Java maintainers.
> > > Yes, 1 is
> > > > > > a subset of 2.
> > > > > >
> > > > > > The metadata on java side uses the minor type id as the type
> > > identifier in
> > > > > > > the union (and the field name is expected to the the same as the
> > > type
> > > > > > name
> > > > > > > in a union). If you were to support a generalized union, this
> > > wouldn't
> > > > > > > work. How will the type identifiers be generated ?
> > > > > > > I'm trying to see if we can make the change backward compatible,
> > > with
> > > > > > > existing unions in java.
> > > > > >
> > > > > >
> > > > > > Looking at the code, I don't think the existing Union class is
> > > > > > generalizable because of this assumption (it caches a single type
> > of
> > > each
> > > > > > locally) and based on the Javadoc this seems to be for performance
> > > reasons,
> > > > > > so I would like to try to avoid touching it if possible.
> > > > > >
> > > > >
> > > > > @Micah Kornfield <emkornfi...@gmail.com> sorry, I wasn't clear. I
> > > meant to
> > > > > ask if the format would be backward compatible, which I think it will
> > > be
> > > > > (since 1 is a subset of 2, and your proposal isn't making a change in
> > > the
> > > > > wire format).
> > > > >
> > > > > I'm fine if the APIs are not backward compatible. Or, once we have 2,
> > > we
> > > > > can add wrappers for 1, if required.
> > > > >
> > > > >
> > > > > >
> > > > > > My original thinking was to try to factor out a common base class
> > > from
> > > > > > UnionVector, then create a new GeneralizedUnionVector class that
> > has
> > > > > > slightly different method signatures (its possible I will need
> > > additional
> > > > > > supporting classes like a new GeneralizedUnionWriter, but i haven't
> > > gotten
> > > > > > that far yet).  The main challenge I see is a way to let users
> > switch
> > > > > > between the two implementations.  Some thoughts off the top of my
> > > head
> > > > > > (none of them seem good):
> > > > > > 1.  Create a flag like:
> > > > > >
> > > > > >
> > https://github.com/apache/arrow/blob/ccdaa9f2a4c1af1222df840b608e2e
> > > f465d331fc/java/memory/src/main/java/org/apache/arrow/
> > > memory/BoundsChecking.java
> > > > > > so
> > > > > > it is statically decided before hand, and have the new class
> > > implement the
> > > > > > same signatures as UnionVector to but throw an exception if a
> > method
> > > that
> > > > > > isn't compatible is called.
> > > > > > 2. Possibly try to augment ArrowType to pass through information
> > > about its
> > > > > > children vectors when reading vectors, but use the flag in option 1
> > > if it
> > > > > > can't be determined.
> > > > > >
> > > > > > I'm open to suggestions.  I'm also happy to try to prototype
> > > something and
> > > > > > get feedback once there is concrete code to evaluate.
> > > > > >
> > > > > > I don't understand the limitation to different types, so +1 for
> > > > > > > generalized unions.  That said, I don't think it's high-priority
> > > either.
> > > > > >
> > > > > >
> > > > > > Antoine, the fact that it isn't high-priority probably  is why it
> > > has taken
> > > > > > so long to resolve.  I'm excited to get to more interesting higher
> > > priority
> > > > > > work, but I would like to see some of the basics finished off
> > > first.  BTW,
> > > > > > if you have suggestions on priorities, I'd be happy to hear them.
> > > After
> > > > > > this, handling empty record batches, and the UBSan work I'm doing,
> > I
> > > was
> > > > > > thinking of either trying to get Avro support in, or work on fit
> > and
> > > finish
> > > > > > items for the C++/Python CSV reader.
> > > > > >
> > > > > > Thanks,
> > > > > > Micah
> > > > > >
> > > > > > [1]
> > https://github.com/apache/arrow/pull/987#issuecomment-493231493
> > > > > > [2]
> > > > > >
> > > > > >
> > https://github.com/apache/arrow/blob/7b2d68570b4336308c52081a034967
> > > 5e488caf11/java/vector/src/main/java/org/apache/arrow/
> > > vector/types/pojo/Field.java#L104
> > > > > >
> > > > > > On Fri, May 24, 2019 at 2:08 AM Antoine Pitrou <anto...@python.org
> > >
> > > wrote:
> > > > > >
> > > > > > >
> > > > > > > I don't understand the limitation to different types, so +1 for
> > > > > > > generalized unions.  That said, I don't think it's high-priority
> > > either.
> > > > > > >
> > > > > > > Regards
> > > > > > >
> > > > > > > Antoine.
> > > > > > >
> > > > > > >
> > > > > > > Le 24/05/2019 à 04:17, Micah Kornfield a écrit :
> > > > > > > > I'd like to bump this thread, to see if anyone has any
> > > comments.  If
> > > > > > > nobody
> > > > > > > > objects I will try to start implementing the changes next week.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Micah
> > > > > > > >
> > > > > > > > On Mon, May 20, 2019 at 9:37 PM Micah Kornfield <
> > > emkornfi...@gmail.com
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> In the past [1] there hasn't been agreement on the final
> > > requirements
> > > > > > > for
> > > > > > > >> union types.
> > > > > > > >>
> > > > > > > >> Briefly the two approaches that are currently advocated:
> > > > > > > >> 1.  Limit unions to only contain one field of each individual
> > > type
> > > > > > (e.g.
> > > > > > > >> you can't have two separate int32 fields).  Java takes this
> > > approach.
> > > > > > > >> 2.  Generalized unions (unions can have any number of fields
> > > with the
> > > > > > > same
> > > > > > > >> type).  C++ takes this approach.
> > > > > > > >>
> > > > > > > >> There was a prior PR [2] that stalled in trying to take this
> > > approach
> > > > > > > with
> > > > > > > >> Java.  For writing vectors it seemed to be slower on a
> > > benchmark.
> > > > > > > >>
> > > > > > > >> My proposal:  We should pursue option 2 (the general
> > > approach).  There
> > > > > > > are
> > > > > > > >> already data interchange formats that support it and it would
> > > be nice
> > > > > > > to a
> > > > > > > >> data-model that lets us make the translation between Arrow
> > > schemas
> > > > > > easy:
> > > > > > > >> 1.  Avro Seems to support it [3] (with the exception of
> > complex
> > > types)
> > > > > > > >> 2.  Protobufs loosely support it [4] via one-of.
> > > > > > > >>
> > > > > > > >> In order to address issues in [2], I propose the following
> > > making the
> > > > > > > >> changes/additions to the Java implementation:
> > > > > > > >> 1.  Keep the default write-path untouched with the existing
> > > class.
> > > > > > > >> 2.  Add in a new sparse union class that implements the same
> > > interface
> > > > > > > >> that can be used on the read path, and if a client opts in
> > (via
> > > direct
> > > > > > > >> construction).
> > > > > > > >> 3.  Add in a dense union class (I don't believe Java has one).
> > > > > > > >>
> > > > > > > >> I'm still ramping up the Java code base, so I'd like other
> > Java
> > > > > > > >> contributors to chime in to see if this plan sounds feasible
> > and
> > > > > > > acceptable.
> > > > > > > >>
> > > > > > > >> Any other thoughts on Unions?
> > > > > > > >>
> > > > > > > >> Thanks,
> > > > > > > >> Micah
> > > > > > > >>
> > > > > > > >> [1]
> > > > > > > >>
> > > > > > >
> > > > > >
> > https://lists.apache.org/thread.html/82ec2049fc3c29de232c9c6962aaee
> > > 9ec022d581cecb6cf0eb6a8f36@%3Cdev.arrow.apache.org%3E
> > > > > > > >> [2] https://github.com/apache/arrow/pull/987#issuecomment-
> > > 493231493
> > > > > > > >> [3] https://github.com/apache/arrow/pull/987#issuecomment-
> > > 493231493
> > > > > > > >> [4]
> > https://developers.google.com/protocol-buffers/docs/proto#
> > > oneof
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks and regards,
> > > > > Ravindra.
> > >
> >

Reply via email to