Hi Wes,

Thanks for your clarification.
I agree with you that the problem should be considered in the
implementation level.

Best,
Liya Fan

On Mon, Nov 25, 2019 at 10:34 AM Wes McKinney <wesmck...@gmail.com> wrote:

> On Sun, Nov 24, 2019 at 8:07 PM Fan Liya <liya.fa...@gmail.com> wrote:
> >
> > Hi Wes,
> >
> > I agree with you that this is a data representation issue.
> >
> > My point is that, data representation and data operation are closely
> > related.
> > As far as this issue is concerned, if we allow several values in the
> union
> > vector to be mapped to the same value in the underlying vector, it is
> > possible that when we modify one value in the union vector, the other
> value
> > is also modified, which is unexpected.
>
> Right, but Arrow columnar data is immutable, so any mutation
> operations are application/implementation-level concerns and should
> not influence the specification documents. Implementations need to be
> aware of the implications of the specification, of course.
>
> > This is a problem with our current specification, because our
> > vectors/arrays provide set/write APIs.
> > So we may need a "coherency protocol" to define the behavior (e.g. copy
> on
> > write) when trying to modify a shared value, IMO.
>
> It's an application/implementation-level concern so I think it would
> need to be addressed separately from clarifying the specification.
>
> >
> > Best,
> > Liya Fan
> >
> > On Sat, Nov 23, 2019 at 3:31 AM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >
> > > hi Liya,
> > >
> > > I don't understand your point -- we are strictly discussing data
> > > representation here I believe. From a data representation perspective,
> > > there is no conflict with repeated or non-monotonic offset values.
> > >
> > > On Fri, Nov 22, 2019 at 1:49 AM Fan Liya <liya.fa...@gmail.com> wrote:
> > > >
> > > > This is an interesting question.
> > > > IMO, to support repeated values, we also need to design a "coherency
> > > > protocol", to avoid the scenario where once a value is witten, the
> change
> > > > is propagated to another slot unexpectedly.
> > > >
> > > > Best,
> > > > Liya Fan
> > > >
> > > > On Fri, Nov 22, 2019 at 1:34 PM Micah Kornfield <
> emkornfi...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hmm, I also thought the intention was monotonically increasing. I
> can't
> > > > > think of a strong reason one way or another. If the argument about
> > > code to
> > > > > do random access is the same in all cases, is there any benefit to
> > > forcing
> > > > > any order at all?  Memory prefetching?
> > > > >
> > > > > On Thu, Nov 21, 2019 at 11:48 AM Wes McKinney <wesmck...@gmail.com
> >
> > > wrote:
> > > > >
> > > > > > hi Antoine,
> > > > > >
> > > > > > It's a good question.
> > > > > >
> > > > > > The intent when we wrote the specification was to be strictly
> > > > > > monotonic, but there seems nothing especially harmful about
> relaxing
> > > > > > the constraint to allow for repeated values or even
> non-monotonicity
> > > > > > (strict or otherwise). For example, if we had the union
> > > > > >
> > > > > > ['a', 'a', 'a', 0, 1, 'b', 'b']
> > > > > >
> > > > > > then this could be represented as
> > > > > >
> > > > > > type_ids: [0, 0, 0, 1, 1, 0, 0]
> > > > > > offsets: [0, 0, 0, 0, 1, 1, 1]
> > > > > > child[0]: ['a', 'b']
> > > > > > child[1]: [0, 1]
> > > > > >
> > > > > > or
> > > > > >
> > > > > > type_ids: [0, 0, 0, 1, 1, 0, 0]
> > > > > > offsets: [1, 1, 1, 0, 1, 0, 0]
> > > > > > child[0]: ['b', 'a']
> > > > > > child[1]: [0, 1]
> > > > > >
> > > > > > What do others think? Either way some clarification in the
> > > > > > specification would be useful. Because the code used to do random
> > > > > > access is the same in all cases, I feel weakly supportive of
> removing
> > > > > > constraints on the offsets.
> > > > > >
> > > > > > - Wes
> > > > > >
> > > > > > On Thu, Nov 21, 2019 at 9:04 AM Antoine Pitrou <
> anto...@python.org>
> > > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > I'd like some clarification on the spec and intent for dense
> > > arrays.
> > > > > > >
> > > > > > > Currently, it is specified that offsets of a dense union are
> "in
> > > order
> > > > > /
> > > > > > > increasing" (*).  However, it is not obvious whether repeated
> > > values
> > > > > are
> > > > > > > allowed or not.
> > > > > > >
> > > > > > > I suspect the intent is to avoid having people exploit unions
> as
> > > some
> > > > > > > kind of poor man's dictionaries.  Also, perhaps some
> optimizations
> > > are
> > > > > > > possible if monotonic or strictly monotonic indices are
> assumed?
> > > But I
> > > > > > > don't know the history behind the union type.
> > > > > > >
> > > > > > > Regards
> > > > > > >
> > > > > > > Antoine.
> > > > > > >
> > > > > > >
> > > > > > > (*)
> https://arrow.apache.org/docs/format/Columnar.html#dense-union
> > > > > >
> > > > >
> > >
>

Reply via email to