Re: [Discuss] Extremely dubious Python equality semantics

Wes McKinney Thu, 02 Jul 2020 08:02:14 -0700

On Wed, Jul 1, 2020 at 9:52 AM Joris Van den Bossche
<jorisvandenboss...@gmail.com> wrote:
>
> I am personally fine with removing the compute dunder methods again (i.e.
> Array.__richcmp__), if that resolves the ambiguity. Although they *are*
> convenient IMO, even for developers (question might also come up if we want
> to add __add__, __sub__ etc, though). So it could also be an option to say
> that for "data structure equality", one should use the ".equals(..)"
> method, and not rely on __eq__.


Yeah, I think this is a slippery slope. My vote is to avoid the
conflict by removing the dunder methods and provide analytics
exclusively through named / non-dunder instance methods and normal
functions (e.g. stuff in pyarrow.compute).

> If we use __eq__ for data structure equality, there is still the question
> of what "null == null" should return: True or False? Although somewhat
> counter-intuitive, it should probably then return True? (given that
> "pa.array([null]) == pa.array([null])" would give True, and I suppose the
> C++ Equals method will also return True).

Right, comparing two null scalars of the same type returns true, same
with two all-null arrays of the same type

> On Wed, 1 Jul 2020 at 16:30, Wes McKinney <wesmck...@gmail.com> wrote:
>
> > I think we need to have a hard separation between "data structure equality"
> > (do these objects contain equivalent data) and "analytical/semantic
> > equality". The latter is more the domain of pyarrow.compute and I am not
> > sure we should be overloading dunder methods with compute functions. I
> > might recommend actually that we remove the compute functions from
> > Array.__richcmp__ also
> >
> > Keep in mind that the pyarrow data structures **are not for end users**.
> > They are intended for developer use and so analytical conveniences should
> > not outweigh consistency for use by library developers.
> >
> > On Wed, Jul 1, 2020, 9:03 AM Maarten Breddels <maartenbredd...@gmail.com>
> > wrote:
> >
> > > I think that if __eq__ does not return True/False exclusively, __bool__
> > > should raise an exception to avoid these unexpected truthies. Python
> > users
> > > are used to that due to Numpy.
> > >
> > >
> > > Op wo 1 jul. 2020 om 15:40 schreef Joris Van den Bossche <
> > > jorisvandenboss...@gmail.com>:
> > >
> > > > On Wed, 1 Jul 2020 at 09:46, Antoine Pitrou <anto...@python.org>
> > wrote:
> > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > Recent changes to PyArrow seem to have taken the stance that
> > comparing
> > > > > null values should return null.
> > > >
> > > >
> > > > Small note that it is not a *very* recent change (
> > > > https://github.com/apache/arrow/pull/5330, ARROW-6488
> > > > <https://issues.apache.org/jira/browse/ARROW-6488>), at least for
> > > scalars
> > > > (I am supposing you are talking about scalars in specific? Or also
> > about
> > > > array equality?).
> > > >
> > > >
> > > > > The problem is that it breaks the
> > > > > expectation that comparisons should return booleans, and perculates
> > > into
> > > > > crazy behaviour in other places.
> > > >
> > > >
> > > > It's certainly true that not returning a boolean from __eq__ gives a
> > > whole
> > > > set of surprising/strange behaviour, but doing so would also introduce
> > an
> > > > inconsistency with array equality (where nulls propagate in
> > comparisons).
> > > > I think this might boil down to different "expectations" about what
> > > __eq__
> > > > should do or is used for:
> > > >
> > > > 1) Be the scalar equivalent of element-wise array equality
> > > (Array.__eq__).
> > > > Since right now nulls propagate in comparisons, this would lead to
> > > > Scalar.__eq__ also to return null (and not True/False) if one of the
> > > > operands is null. The null propagation of array comparisons can also be
> > > > discussed of course.
> > > > 2) Be a general equality check of the two objects (matching the
> > > > ".equals(..)" method). For example, Array.equals considers nulls at the
> > > > same location in the array as equal.
> > > > (however, for this second case it's actually not fully clear if nulls
> > > > evaluate to True or False ..)
> > > >
> > > >  Some other inline comments about the the examples:
> > > >
> > > > Here is an example of such
> > > > > misbehaviour in the scalar refactor PR:
> > > > >
> > > > > >>> import pyarrow as pa
> > > > > >>> na = pa.scalar(None)
> > > > > >>> na == na
> > > > > <pyarrow.NullScalar: None>
> > > > >
> > > > > >>> na == 5
> > > > > <pyarrow.NullScalar: None>
> > > > >
> > > > > >>> bool(na == 5)
> > > > > True
> > > > >
> > > >
> > > > This could also be changed to raise an error instead. And that way, the
> > > > next example would also raise (requiring the user to be specific on how
> > > to
> > > > handle a scalar null: is it truthy or falsey?)
> > > >
> > > >
> > > > > >>> if na == 5: print("yo!")
> > > > > yo!
> > > > >
> > > > > >>> na in [5]
> > > > > True
> > > > >
> > > > > But you can see it also with arrays containing null values:
> > > > >
> > > > > >>> pa.array([1, None]) in [pa.scalar(42)]
> > > > > True
> > > > >
> > > > > Note that this one doesn't follow from the scalar behaviour (the same
> > > can
> > > > be seen on master), but from the Array equality returning a boolean
> > array
> > > > instead of a single boolean value (I changed this recently in
> > > > https://github.com/apache/arrow/pull/7273, also up for discussion of
> > > > course).
> > > > This is because we let any array evaluate to True:
> > > >
> > > > >>> bool(pa.array([False]))
> > > > True
> > > >
> > > > which we might want to change to raising an error similarly as numpy
> > does
> > > > if we keep the element-wise __eq__.
> > > >
> > > > I think that Python equality operators should behave in a
> > > > > Python-sensible way (return True or False).  Have people call another
> > > > > method if they like the fancy (or noxious, depending on the POV)
> > > > > semantics of returning null when comparing null with anything.
> > > > >
> > > >
> > > > It would certainly be an option to reserve __eq__ to general object
> > > > equality (always True/False), and have users call an explicit compute
> > > > function for one of the element-wise comparison operations (eg
> > > > pyarrow.compute.equals).
> > > >
> > > >
> > > > >
> > > > > (note that Numpy doesn't have null scalars, so it can be less
> > > > > conservative in its customization of equality methods)
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > >
> > >
> >

Re: [Discuss] Extremely dubious Python equality semantics

Reply via email to