iabhi4 opened a new issue, #46589:
URL: https://github.com/apache/arrow/issues/46589
### Describe the bug, including details regarding any error messages,
version, and platform.
### Description
The `utf8_is_digit` kernel in `pyarrow.compute` does not fully replicate
Python's `str.isdigit()` behavior, especially with certain Unicode digit
characters.
For example, the character `'³'` (U+00B3 SUPERSCRIPT THREE) returns `True`
with Python’s `str.isdigit()` but returns `False` when passed to
`pyarrow.compute.utf8_is_digit`.
This divergence leads to downstream inconsistencies, particularly in pandas
when using `StringDtype(storage="pyarrow")`.
---
### Reproduction
```python
import pyarrow as pa
import pyarrow.compute as pc
arr = pa.array(['3', '٣', '५', '123', '³'])
print(pc.utf8_is_digit(arr).to_pylist())
```
**Output:**
```
[True, True, True, True, False] # <-- '³' incorrectly returns False
```
**Expected Output (matches `str.isdigit()`):**
```
[True, True, True, True, True]
```
---
### Notes
- The issue seems to stem from the implementation of
`IsDigitUnicode::PredicateCharacterAll` not including characters in the Unicode
"No" (Number, Other) category, such as superscript digits (`³`, `²`, etc.).
- Python's behavior can be verified as:
```python
print("³".isdigit()) # True
```
---
### Impact
This affects pandas string operations like `.str.isdigit()` when using
`pyarrow` storage. Python string-based behavior passes, but pyarrow-based
behavior fails for characters like `'³'`.
---
### System Info
Tested with:
- PyArrow 20.0.0 (pip-installed)
- Pyarrow `main` 0.1.dev17578+g218c886
- Python 3.12
- Debian-based Linux (Ubuntu)
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]