raulcd commented on issue #46777:
URL: https://github.com/apache/arrow/issues/46777#issuecomment-2966035378

   Sorry, I missed the `write.py` script on the first subfolder. I've been able 
to reproduce the issue locally comparing Arrow 17 and Arrow 18. After some 
bisect, I've been able to track down the issue to this specific commit 
https://github.com/apache/arrow/commit/44b72d5c2518b7dc70b67b588432fb06ea3896c7 
which is related to the `is_in` functionality, the PR:
   - https://github.com/apache/arrow/pull/43761
   
   The previous commit doesn't have the issue 
https://github.com/apache/arrow/commit/a87a8e0efe1650b01ac85f7a7331ccfcffc088a2
   
   pyarrow built locally at commit 44b72d5c2518b7dc70b67b588432fb06ea3896c7,  
has the issue:
   ```
   [((44b72d5c25...))]$ python read.py 
   === PYARROW VERSION 17 ===
   Retrieved 10,000,000 rows in 115.71 seconds.
   ```
   pyarrow built locally at commit a87a8e0efe1650b01ac85f7a7331ccfcffc088a2, 
does not have the issue:
   ```
   [((a87a8e0efe...))]$ python read.py 
   === PYARROW VERSION 17 ===
   Retrieved 10,000,000 rows in 3.22 seconds.
   ```
   
   @bkietz @mapleFU @pitrou thoughts? this is a 50x performance reduction on 
this filter expression:
   ```
   population = [i for i in range(1, 10_000_001)]
   filter_expr = ds.field("unit_id").isin(population)
   
   start = perf_counter()
   table = dataset.to_table(filter=filter_expr)
   end = perf_counter()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to