Yimche commented on issue #43985:
URL: https://github.com/apache/arrow/issues/43985#issuecomment-2424838714

   > I would expect the result to be the same whether the Python object is the 
same or if the object is a copy
   
   I think I see where you're coming from, and I think you're correct as what 
is happening here is that (if my understanding of the equality function is 
correct):
   ```Python
       def equals(self, Table other, bint check_metadata=False):
           self._assert_cpu()
           if other is None:
               return False
   
           cdef:
               CTable* this_table = self.table
               CTable* other_table = other.table
               c_bool result
   
           with nogil:
               result = this_table.Equals(deref(other_table), check_metadata)
   
           return result
   ```
   In terms of C, since it points to itself, and by extension the same "NaN" 
data, it results to being equal. The copy previously notioned wasn't a copy of 
the pointer, but a copy by value, hence the failed comparison, as I think C 
would have just had whatever data was previously there, if it wasn't just 
filled with random data. If we were to instead do a copy by reference (i.e. 
making table_2 point to the same point of memory) it passes (but note this is 
just restating an equals to self).
   ```Python
   >>> import pyarrow as pa
   >>> table_1 = pa.Table.from_pydict({"foo": [float("nan")]})
   >>> table_2 = pa.Table.from_pydict({"foo": [float("nan")]})
   >>> table_1 == table_2
   False
   >>> table_2 = table_1
   >>> table_1 == table_2
   True
   ```
   I will note that when looking for similar situations I noticed some 
inconsistencies in how other libraries deal with such an edge case with NaNs:
   ```Python
   >>> import numpy as np
   >>> a1 = np.array([float("nan")])
   >>> a2 = np.array([float("nan")])
   >>> np.array_equal(a1, a2)
   False
   >>> np.array_equal(a1, a1)
   False
   ```
   and
   ```Python
   >>> import pandas as pd
   >>> table1 = pd.DataFrame([[float("nan")]])
   >>> table2 = pd.DataFrame([[float("nan")]])
   >>> table1.equals(table2)
   True
   >>> table1.equals(table1)
   True
   ```
   So this feels like something a pyarrow maintainer or the greater community 
need to decide on as the "correct" behaviour for the equality comparison.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to