On Sat, May 25, 2019 at 12:29 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Thanks for the responses, I've clipped the questions and provided responses
> inline.
>
> is the proposal that both cpp & java will support only option 2 ?
> > I guess 1 is a subset of 2 anyway.
>
> CPP already supports option 2.  I would like to make CPP and java
> compatible, in a way that this acceptable for Java maintainers.  Yes, 1 is
> a subset of 2.
>
> The metadata on java side uses the minor type id as the type identifier in
> > the union (and the field name is expected to the the same as the type
> name
> > in a union). If you were to support a generalized union, this wouldn't
> > work. How will the type identifiers be generated ?
> > I'm trying to see if we can make the change backward compatible, with
> > existing unions in java.
>
>
> Looking at the code, I don't think the existing Union class is
> generalizable because of this assumption (it caches a single type of each
> locally) and based on the Javadoc this seems to be for performance reasons,
> so I would like to try to avoid touching it if possible.
>

@Micah Kornfield <emkornfi...@gmail.com> sorry, I wasn't clear. I meant to
ask if the format would be backward compatible, which I think it will be
(since 1 is a subset of 2, and your proposal isn't making a change in the
wire format).

I'm fine if the APIs are not backward compatible. Or, once we have 2, we
can add wrappers for 1, if required.


>
> My original thinking was to try to factor out a common base class from
> UnionVector, then create a new GeneralizedUnionVector class that has
> slightly different method signatures (its possible I will need additional
> supporting classes like a new GeneralizedUnionWriter, but i haven't gotten
> that far yet).  The main challenge I see is a way to let users switch
> between the two implementations.  Some thoughts off the top of my head
> (none of them seem good):
> 1.  Create a flag like:
>
> https://github.com/apache/arrow/blob/ccdaa9f2a4c1af1222df840b608e2ef465d331fc/java/memory/src/main/java/org/apache/arrow/memory/BoundsChecking.java
> so
> it is statically decided before hand, and have the new class implement the
> same signatures as UnionVector to but throw an exception if a method that
> isn't compatible is called.
> 2. Possibly try to augment ArrowType to pass through information about its
> children vectors when reading vectors, but use the flag in option 1 if it
> can't be determined.
>
> I'm open to suggestions.  I'm also happy to try to prototype something and
> get feedback once there is concrete code to evaluate.
>
> I don't understand the limitation to different types, so +1 for
> > generalized unions.  That said, I don't think it's high-priority either.
>
>
> Antoine, the fact that it isn't high-priority probably  is why it has taken
> so long to resolve.  I'm excited to get to more interesting higher priority
> work, but I would like to see some of the basics finished off first.  BTW,
> if you have suggestions on priorities, I'd be happy to hear them.  After
> this, handling empty record batches, and the UBSan work I'm doing, I was
> thinking of either trying to get Avro support in, or work on fit and finish
> items for the C++/Python CSV reader.
>
> Thanks,
> Micah
>
> [1] https://github.com/apache/arrow/pull/987#issuecomment-493231493
> [2]
>
> https://github.com/apache/arrow/blob/7b2d68570b4336308c52081a0349675e488caf11/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java#L104
>
> On Fri, May 24, 2019 at 2:08 AM Antoine Pitrou <anto...@python.org> wrote:
>
> >
> > I don't understand the limitation to different types, so +1 for
> > generalized unions.  That said, I don't think it's high-priority either.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 24/05/2019 à 04:17, Micah Kornfield a écrit :
> > > I'd like to bump this thread, to see if anyone has any comments.  If
> > nobody
> > > objects I will try to start implementing the changes next week.
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Mon, May 20, 2019 at 9:37 PM Micah Kornfield <emkornfi...@gmail.com
> >
> > > wrote:
> > >
> > >> In the past [1] there hasn't been agreement on the final requirements
> > for
> > >> union types.
> > >>
> > >> Briefly the two approaches that are currently advocated:
> > >> 1.  Limit unions to only contain one field of each individual type
> (e.g.
> > >> you can't have two separate int32 fields).  Java takes this approach.
> > >> 2.  Generalized unions (unions can have any number of fields with the
> > same
> > >> type).  C++ takes this approach.
> > >>
> > >> There was a prior PR [2] that stalled in trying to take this approach
> > with
> > >> Java.  For writing vectors it seemed to be slower on a benchmark.
> > >>
> > >> My proposal:  We should pursue option 2 (the general approach).  There
> > are
> > >> already data interchange formats that support it and it would be nice
> > to a
> > >> data-model that lets us make the translation between Arrow schemas
> easy:
> > >> 1.  Avro Seems to support it [3] (with the exception of complex types)
> > >> 2.  Protobufs loosely support it [4] via one-of.
> > >>
> > >> In order to address issues in [2], I propose the following making the
> > >> changes/additions to the Java implementation:
> > >> 1.  Keep the default write-path untouched with the existing class.
> > >> 2.  Add in a new sparse union class that implements the same interface
> > >> that can be used on the read path, and if a client opts in (via direct
> > >> construction).
> > >> 3.  Add in a dense union class (I don't believe Java has one).
> > >>
> > >> I'm still ramping up the Java code base, so I'd like other Java
> > >> contributors to chime in to see if this plan sounds feasible and
> > acceptable.
> > >>
> > >> Any other thoughts on Unions?
> > >>
> > >> Thanks,
> > >> Micah
> > >>
> > >> [1]
> > >>
> >
> https://lists.apache.org/thread.html/82ec2049fc3c29de232c9c6962aaee9ec022d581cecb6cf0eb6a8f36@%3Cdev.arrow.apache.org%3E
> > >> [2] https://github.com/apache/arrow/pull/987#issuecomment-493231493
> > >> [3] https://github.com/apache/arrow/pull/987#issuecomment-493231493
> > >> [4] https://developers.google.com/protocol-buffers/docs/proto#oneof
> > >>
> > >
> >
>


-- 
Thanks and regards,
Ravindra.

Reply via email to