pearu commented on code in PR #50146:
URL: https://github.com/apache/arrow/pull/50146#discussion_r3454410904
##########
python/pyarrow/tests/test_csv.py:
##########
@@ -1164,6 +1164,65 @@ def test_dates(self):
'b': [datetime(1970, 1, 2), datetime(1971, 1, 2)],
}
+ def test_dates_with_timestamp_parsers(self):
+ # Custom timestamp parsers are used as a fallback for explicitly-typed
+ # date columns (GH-28303, GH-41488)
+ rows = b"a,b\n15/10/2015,1970-01-02\n18/06/1990,15/10/2015\n"
+ opts = ConvertOptions()
+ opts.column_types = {'a': pa.date32(), 'b': pa.date64()}
+ with pytest.raises(pa.ArrowInvalid,
+ match="CSV conversion error to date"):
+ self.read_bytes(rows, convert_options=opts)
+
+ opts.timestamp_parsers = ['%d/%m/%Y']
+ table = self.read_bytes(rows, convert_options=opts)
+ schema = pa.schema([('a', pa.date32()),
+ ('b', pa.date64())])
+ assert table.schema == schema
+ # ISO values keep working alongside custom-parsed ones
+ assert table.to_pydict() == {
+ 'a': [date(2015, 10, 15), date(1990, 6, 18)],
+ 'b': [date(1970, 1, 2), date(2015, 10, 15)],
+ }
+
+ # Month names are parsed case-insensitively
+ rows = b"a\n15-OCT-15\n18-Jun-90\n"
+ opts = ConvertOptions(column_types={'a': pa.date32()},
+ timestamp_parsers=['%d-%b-%y'])
+ table = self.read_bytes(rows, convert_options=opts)
+ assert table.to_pydict() == {
+ 'a': [date(2015, 10, 15), date(1990, 6, 18)],
+ }
Review Comment:
Thanks — I checked this and the tests are deterministic as written, so I've
left them unchanged.
Neither the test binary nor pyarrow adopts the environment's `LC_TIME`:
there is no `setlocale(LC_ALL/LC_TIME, "")` anywhere in Arrow's C++ (outside
vendored code) or in pyarrow, and CPython coerces only `LC_CTYPE` at startup,
never `LC_TIME`. So a process started with a non-English `LC_TIME` in the
environment still runs `strptime` in the `C` locale. On glibc there is a second
reason: `strptime`'s `%b`/`%B` keeps the C-locale (English) month names as a
fallback even under a non-English locale, so English abbreviations parse
regardless (`setlocale(LC_TIME, "fr_FR.UTF-8")` followed by parsing
`"15-JUL-15"` still succeeds).
<details>
<summary>Minimal check: CSV <code>%b</code> parsing is independent of the
environment locale</summary>
Each child process below is started with a non-English `LC_ALL`/`LC_TIME` in
its environment — exactly the scenario in the comment — and still parses the
English month abbreviation correctly:
```python
import os
import subprocess
import sys
CHILD = r"""
import os
import pyarrow as pa
from pyarrow import csv
from datetime import date
data = b"a\n15-JUL-15\n" # English abbreviated month name "JUL"
opts = csv.ConvertOptions(column_types={"a": pa.date32()},
timestamp_parsers=["%d-%b-%y"])
got = csv.read_csv(pa.py_buffer(data), convert_options=opts).to_pydict()
assert got == {"a": [date(2015, 7, 15)]}, got
print("OK with LC_ALL=%-14r ->" % os.environ.get("LC_ALL"), got)
"""
for lc in ("C", "fr_FR.UTF-8", "de_DE.UTF-8"):
env = dict(os.environ, LC_ALL=lc, LC_TIME=lc)
subprocess.run([sys.executable, "-c", CHILD], env=env, check=True)
```
Output (identical whether or not those locales are actually installed):
```
OK with LC_ALL='C' -> {'a': [datetime.date(2015, 7, 15)]}
OK with LC_ALL='fr_FR.UTF-8' -> {'a': [datetime.date(2015, 7, 15)]}
OK with LC_ALL='de_DE.UTF-8' -> {'a': [datetime.date(2015, 7, 15)]}
```
</details>
A genuinely locale-dependent case does remain — an application that itself
calls `setlocale(LC_ALL, "")` under a non-English locale, on a libc whose
`strptime` lacks that English fallback — but that is pre-existing (it affects
`timestamp` columns too) and out of scope here. Happy to file a follow-up issue
if that is worth tracking.
##########
cpp/src/arrow/csv/converter.cc:
##########
@@ -440,6 +441,13 @@ struct SingleParserTimestampValueDecoder : public
ValueDecoder {
const TimestampParser& parser_;
};
+std::vector<const TimestampParser*> GetTimestampParsers(const ConvertOptions&
options) {
+ std::vector<const TimestampParser*>
parsers(options.timestamp_parsers.size());
+ std::ranges::transform(options.timestamp_parsers, parsers.begin(),
+ [](const auto& parser) { return parser.get(); });
+ return parsers;
+}
Review Comment:
Good catch — reverted to a plain index loop (matching the sibling
`MultipleParsersTimestampValueDecoder`), since the codebase guards
`std::ranges` use behind `__cpp_lib_ranges` for partial-C++20 toolchains. Also
dropped the now-unused `<algorithm>` include.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]