zanmato1984 commented on issue #44513: URL: https://github.com/apache/arrow/issues/44513#issuecomment-2553027209
> 1. apply filter ID_DEV_STYLECOLOR_SIZE = 88506230299 and ID_DEPARTMENT = 16556030299. It should return 2 in PL_VALUE column. Correct: ``` >>> cond = pc.and_(pc.equal(large['ID_DEV_STYLECOLOR_SIZE'], 88506230299), pc.equal(large['ID_DEPARTMENT'], 16556030299)) >>> filtered = large.filter(cond) >>> print(filtered) pyarrow.Table ID_DEV_STYLECOLOR_SIZE: int64 ID_DEPARTMENT: int64 ID_COLLECTION: int64 PL_VALUE: int64 ---- ID_DEV_STYLECOLOR_SIZE: [[88506230299]] ID_DEPARTMENT: [[16556030299]] ID_COLLECTION: [[11240299]] PL_VALUE: [[2]] > ``` > 2. Apply sum(PL_VALUE) and it should return 58360744 No: ``` >>> sum = pc.sum(large['PL_VALUE']) >>> print(sum) 461379027 ``` > That's just to eliminate 'false positive'. I mentioned that I tested on different versions and it sometimes caused a silent wrong answer even though there were no seg.fault. Hmm, I think we should only focus on v18.1.0. As I mentioned, there are a lot of fixes ever since, so the behavior in prior versions will vary for sure, and I think most of the issues (if not all) are already addressed. > If all above is correct, might the segfault error be caused by any system/os settings? I also verified on my Intel MBP (I just realized that we have x86-specialized SIMD code path for hash join so I wanted to see if the issue was there), but still unable to reproduce. And your setup doesn't seem to have any particular thing to do with this issue. To proceed with the debugging: 1. Did you run my python script on your env to see if it runs into segfault? (And in case it doesn't, would you kindly help to fix it to make the segfault happen?) I think this is quite essential, because we need to agree on a minimal reproducible case (at least on either env of us). Then I can ask some other people to help verifying on broader environments. 2. Would you help to confirm the difference of `sum(PL_VALUE)` in my run (`461379027`) against yours (`58360744`)? 3. What is your CPU model? 4. In your original run of segfault (again, on v18.1.0), is it always reproducible or by chance? Debugging this kind of issue is tricky and takes time and communication. I really appreciate your patience @kolfild26 , thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
