Re: [I] [C++][Python] Pyarrow.Table.join() breaks on large tables v.18.0.0.dev486 [arrow]

via GitHub Thu, 19 Dec 2024 00:12:00 -0800


zanmato1984 commented on issue #44513:
URL: https://github.com/apache/arrow/issues/44513#issuecomment-2553027209


   > 1. apply filter ID_DEV_STYLECOLOR_SIZE = 88506230299 and ID_DEPARTMENT = 
16556030299. It should return 2 in PL_VALUE column.
   
   Correct:
   ```
   >>> cond = pc.and_(pc.equal(large['ID_DEV_STYLECOLOR_SIZE'], 88506230299), 
pc.equal(large['ID_DEPARTMENT'], 16556030299))
   >>> filtered = large.filter(cond)
   >>> print(filtered)
   pyarrow.Table
   ID_DEV_STYLECOLOR_SIZE: int64
   ID_DEPARTMENT: int64
   ID_COLLECTION: int64
   PL_VALUE: int64
   ----
   ID_DEV_STYLECOLOR_SIZE: [[88506230299]]
   ID_DEPARTMENT: [[16556030299]]
   ID_COLLECTION: [[11240299]]
   PL_VALUE: [[2]]
   >
   ```
   > 2. Apply sum(PL_VALUE) and it should return 58360744
   
   No:
   ```
   >>> sum = pc.sum(large['PL_VALUE'])
   >>> print(sum)
   461379027
   ```
   
   > That's just to eliminate 'false positive'. I mentioned that I tested on 
different versions and it sometimes caused a silent wrong answer even though 
there were no seg.fault.
   
   Hmm, I think we should only focus on v18.1.0. As I mentioned, there are a 
lot of fixes ever since, so the behavior in prior versions will vary for sure, 
and I think most of the issues (if not all) are already addressed.
   
   > If all above is correct, might the segfault error be caused by any 
system/os settings?
   
   I also verified on my Intel MBP (I just realized that we have 
x86-specialized SIMD code path for hash join so I wanted to see if the issue 
was there), but still unable to reproduce. And your setup doesn't seem to have 
any particular thing to do with this issue.
   
   To proceed with the debugging:
   1. Did you run my python script on your env to see if it runs into segfault? 
(And in case it doesn't, would you kindly help to fix it to make the segfault 
happen?) I think this is quite essential, because we need to agree on a minimal 
reproducible case (at least on either env of us). Then I can ask some other 
people to help verifying on broader environments.
   2. Would you help to confirm the difference of `sum(PL_VALUE)` in my run 
(`461379027`) against yours (`58360744`)?
   3. What is your CPU model?
   4. In your original run of segfault (again, on v18.1.0), is it always 
reproducible or by chance?
   
   Debugging this kind of issue is tricky and takes time and communication. I 
really appreciate your patience @kolfild26 , thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [C++][Python] Pyarrow.Table.join() breaks on large tables v.18.0.0.dev486 [arrow]

Reply via email to