GitHub user thisisnic added a comment to the discussion: how to debug
arrow/dplyr to consider a bug report?
One useful thing to try at this point is working out whether the discrepancy
lives in the R bindings to the Arrow C++ library or in the Arrow C++ library
itself. In the case of the former, I'll dig into it more myself, but in the
case of the latter, I might choose to ask someone more familiar with it to
help. One way to work this out is to test out the equivalent PyArrow code -
both R and Python provide bindings to the C++ library, so if they have
different results, we can conclude the issue is in R.
I asked chatGPT for the Python equivalent of the snippet:
```
full_papers <-
open_dataset('data/softcite-extractions-oa-data/p01_one_percent_random_subset/papers.parquet',
format = 'parquet')
full_papers |>
filter(published_year < 1990) |>
collect() |>
nrow()
```
and got this:
```py
import pyarrow.dataset as ds
# Load dataset
full_papers =
ds.dataset('data/softcite-extractions-oa-data/p01_one_percent_random_subset/papers.parquet',
format='parquet')
# Filter and count rows
full_papers.to_table(filter=ds.field("published_year") < 1990).num_rows
```
which gave me the result:
```
0
```
And just to check things looked the same, I also tried the following Python:
```py
full_papers.to_table(filter=ds.field("published_year") >= 1990).num_rows
```
which returned
```
62421
```
Given that this maps to what you found in R, it looks like this is happening at
the C++ level.
GitHub link:
https://github.com/apache/arrow/discussions/46383#discussioncomment-13119345
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]