Hi Wes, Thanks for your clarification. I agree with you that the problem should be considered in the implementation level.
Best, Liya Fan On Mon, Nov 25, 2019 at 10:34 AM Wes McKinney <wesmck...@gmail.com> wrote: > On Sun, Nov 24, 2019 at 8:07 PM Fan Liya <liya.fa...@gmail.com> wrote: > > > > Hi Wes, > > > > I agree with you that this is a data representation issue. > > > > My point is that, data representation and data operation are closely > > related. > > As far as this issue is concerned, if we allow several values in the > union > > vector to be mapped to the same value in the underlying vector, it is > > possible that when we modify one value in the union vector, the other > value > > is also modified, which is unexpected. > > Right, but Arrow columnar data is immutable, so any mutation > operations are application/implementation-level concerns and should > not influence the specification documents. Implementations need to be > aware of the implications of the specification, of course. > > > This is a problem with our current specification, because our > > vectors/arrays provide set/write APIs. > > So we may need a "coherency protocol" to define the behavior (e.g. copy > on > > write) when trying to modify a shared value, IMO. > > It's an application/implementation-level concern so I think it would > need to be addressed separately from clarifying the specification. > > > > > Best, > > Liya Fan > > > > On Sat, Nov 23, 2019 at 3:31 AM Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > hi Liya, > > > > > > I don't understand your point -- we are strictly discussing data > > > representation here I believe. From a data representation perspective, > > > there is no conflict with repeated or non-monotonic offset values. > > > > > > On Fri, Nov 22, 2019 at 1:49 AM Fan Liya <liya.fa...@gmail.com> wrote: > > > > > > > > This is an interesting question. > > > > IMO, to support repeated values, we also need to design a "coherency > > > > protocol", to avoid the scenario where once a value is witten, the > change > > > > is propagated to another slot unexpectedly. > > > > > > > > Best, > > > > Liya Fan > > > > > > > > On Fri, Nov 22, 2019 at 1:34 PM Micah Kornfield < > emkornfi...@gmail.com> > > > > wrote: > > > > > > > > > Hmm, I also thought the intention was monotonically increasing. I > can't > > > > > think of a strong reason one way or another. If the argument about > > > code to > > > > > do random access is the same in all cases, is there any benefit to > > > forcing > > > > > any order at all? Memory prefetching? > > > > > > > > > > On Thu, Nov 21, 2019 at 11:48 AM Wes McKinney <wesmck...@gmail.com > > > > > wrote: > > > > > > > > > > > hi Antoine, > > > > > > > > > > > > It's a good question. > > > > > > > > > > > > The intent when we wrote the specification was to be strictly > > > > > > monotonic, but there seems nothing especially harmful about > relaxing > > > > > > the constraint to allow for repeated values or even > non-monotonicity > > > > > > (strict or otherwise). For example, if we had the union > > > > > > > > > > > > ['a', 'a', 'a', 0, 1, 'b', 'b'] > > > > > > > > > > > > then this could be represented as > > > > > > > > > > > > type_ids: [0, 0, 0, 1, 1, 0, 0] > > > > > > offsets: [0, 0, 0, 0, 1, 1, 1] > > > > > > child[0]: ['a', 'b'] > > > > > > child[1]: [0, 1] > > > > > > > > > > > > or > > > > > > > > > > > > type_ids: [0, 0, 0, 1, 1, 0, 0] > > > > > > offsets: [1, 1, 1, 0, 1, 0, 0] > > > > > > child[0]: ['b', 'a'] > > > > > > child[1]: [0, 1] > > > > > > > > > > > > What do others think? Either way some clarification in the > > > > > > specification would be useful. Because the code used to do random > > > > > > access is the same in all cases, I feel weakly supportive of > removing > > > > > > constraints on the offsets. > > > > > > > > > > > > - Wes > > > > > > > > > > > > On Thu, Nov 21, 2019 at 9:04 AM Antoine Pitrou < > anto...@python.org> > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > I'd like some clarification on the spec and intent for dense > > > arrays. > > > > > > > > > > > > > > Currently, it is specified that offsets of a dense union are > "in > > > order > > > > > / > > > > > > > increasing" (*). However, it is not obvious whether repeated > > > values > > > > > are > > > > > > > allowed or not. > > > > > > > > > > > > > > I suspect the intent is to avoid having people exploit unions > as > > > some > > > > > > > kind of poor man's dictionaries. Also, perhaps some > optimizations > > > are > > > > > > > possible if monotonic or strictly monotonic indices are > assumed? > > > But I > > > > > > > don't know the history behind the union type. > > > > > > > > > > > > > > Regards > > > > > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > > > > (*) > https://arrow.apache.org/docs/format/Columnar.html#dense-union > > > > > > > > > > > > > > >