raulcd commented on issue #46777: URL: https://github.com/apache/arrow/issues/46777#issuecomment-2966035378
Sorry, I missed the `write.py` script on the first subfolder. I've been able to reproduce the issue locally comparing Arrow 17 and Arrow 18. After some bisect, I've been able to track down the issue to this specific commit https://github.com/apache/arrow/commit/44b72d5c2518b7dc70b67b588432fb06ea3896c7 which is related to the `is_in` functionality, the PR: - https://github.com/apache/arrow/pull/43761 The previous commit doesn't have the issue https://github.com/apache/arrow/commit/a87a8e0efe1650b01ac85f7a7331ccfcffc088a2 pyarrow built locally at commit 44b72d5c2518b7dc70b67b588432fb06ea3896c7, has the issue: ``` [((44b72d5c25...))]$ python read.py === PYARROW VERSION 17 === Retrieved 10,000,000 rows in 115.71 seconds. ``` pyarrow built locally at commit a87a8e0efe1650b01ac85f7a7331ccfcffc088a2, does not have the issue: ``` [((a87a8e0efe...))]$ python read.py === PYARROW VERSION 17 === Retrieved 10,000,000 rows in 3.22 seconds. ``` @bkietz @mapleFU @pitrou thoughts? this is a 50x performance reduction on this filter expression: ``` population = [i for i in range(1, 10_000_001)] filter_expr = ds.field("unit_id").isin(population) start = perf_counter() table = dataset.to_table(filter=filter_expr) end = perf_counter() ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org