GitHub user jameshowison closed a discussion: how to debug arrow/dplyr to consider a bug report?
We are seeing unexpected behavior with arrow using dplyr `filter`. The issue seems to be centered around a less than filter that works when we use in-memory but doesn't work when we use `open_dataset`. We asked the issue on stackoverflow here: https://stackoverflow.com/questions/79607580/how-to-properly-use-less-than-in-a-dplyr-filter-of-a-sharded-arrow-dataset#comment140408196_79607580 And I've created a test dataset and code at: https://github.com/softcite/softcite-extractions-parquet-analysis in the https://github.com/softcite/softcite-extractions-parquet-analysis/blob/main/analysis/queries_on_parquet.qmd file. I have no idea if this is pointing to a bug, so I don't want to post an issue. I didn't think that posit forums would help, since I think the arrow/parquet versions of the dplyr verbs are implemented here? But I also don't know how to debug this further, so any guidance on that would be appreciated. If I can debug it further and it does look like an issue I'll try to create a smaller dataset to show the behavior (but there is one in the GitHub repo above that it's too giant). Thanks! James GitHub link: https://github.com/apache/arrow/discussions/46383 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
