pearu commented on code in PR #50146:
URL: https://github.com/apache/arrow/pull/50146#discussion_r3454410904


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -1164,6 +1164,65 @@ def test_dates(self):
             'b': [datetime(1970, 1, 2), datetime(1971, 1, 2)],
         }
 
+    def test_dates_with_timestamp_parsers(self):
+        # Custom timestamp parsers are used as a fallback for explicitly-typed
+        # date columns (GH-28303, GH-41488)
+        rows = b"a,b\n15/10/2015,1970-01-02\n18/06/1990,15/10/2015\n"
+        opts = ConvertOptions()
+        opts.column_types = {'a': pa.date32(), 'b': pa.date64()}
+        with pytest.raises(pa.ArrowInvalid,
+                           match="CSV conversion error to date"):
+            self.read_bytes(rows, convert_options=opts)
+
+        opts.timestamp_parsers = ['%d/%m/%Y']
+        table = self.read_bytes(rows, convert_options=opts)
+        schema = pa.schema([('a', pa.date32()),
+                            ('b', pa.date64())])
+        assert table.schema == schema
+        # ISO values keep working alongside custom-parsed ones
+        assert table.to_pydict() == {
+            'a': [date(2015, 10, 15), date(1990, 6, 18)],
+            'b': [date(1970, 1, 2), date(2015, 10, 15)],
+        }
+
+        # Month names are parsed case-insensitively
+        rows = b"a\n15-OCT-15\n18-Jun-90\n"
+        opts = ConvertOptions(column_types={'a': pa.date32()},
+                              timestamp_parsers=['%d-%b-%y'])
+        table = self.read_bytes(rows, convert_options=opts)
+        assert table.to_pydict() == {
+            'a': [date(2015, 10, 15), date(1990, 6, 18)],
+        }

Review Comment:
   Thanks — I checked this and the tests are deterministic as written, so I've 
left them unchanged.
   
   Neither the test binary nor pyarrow adopts the environment's `LC_TIME`: 
there is no `setlocale(LC_ALL/LC_TIME, "")` anywhere in Arrow's C++ (outside 
vendored code) or in pyarrow, and CPython coerces only `LC_CTYPE` at startup, 
never `LC_TIME`. So a process started with a non-English `LC_TIME` in the 
environment still runs `strptime` in the `C` locale. On glibc there is a second 
reason: `strptime`'s `%b`/`%B` keeps the C-locale (English) month names as a 
fallback even under a non-English locale, so English abbreviations parse 
regardless (`setlocale(LC_TIME, "fr_FR.UTF-8")` followed by parsing 
`"15-JUL-15"` still succeeds).
   
   <details>
   <summary>Minimal check: CSV <code>%b</code> parsing is independent of the 
environment locale</summary>
   
   Each child process below is started with a non-English `LC_ALL`/`LC_TIME` in 
its environment — exactly the scenario in the comment — and still parses the 
English month abbreviation correctly:
   
   ```python
   import os
   import subprocess
   import sys
   
   CHILD = r"""
   import os
   import pyarrow as pa
   from pyarrow import csv
   from datetime import date
   
   data = b"a\n15-JUL-15\n"          # English abbreviated month name "JUL"
   opts = csv.ConvertOptions(column_types={"a": pa.date32()},
                             timestamp_parsers=["%d-%b-%y"])
   got = csv.read_csv(pa.py_buffer(data), convert_options=opts).to_pydict()
   assert got == {"a": [date(2015, 7, 15)]}, got
   print("OK with LC_ALL=%-14r ->" % os.environ.get("LC_ALL"), got)
   """
   
   for lc in ("C", "fr_FR.UTF-8", "de_DE.UTF-8"):
       env = dict(os.environ, LC_ALL=lc, LC_TIME=lc)
       subprocess.run([sys.executable, "-c", CHILD], env=env, check=True)
   ```
   
   Output (identical whether or not those locales are actually installed):
   
   ```
   OK with LC_ALL='C'            -> {'a': [datetime.date(2015, 7, 15)]}
   OK with LC_ALL='fr_FR.UTF-8'  -> {'a': [datetime.date(2015, 7, 15)]}
   OK with LC_ALL='de_DE.UTF-8'  -> {'a': [datetime.date(2015, 7, 15)]}
   ```
   
   </details>
   
   A genuinely locale-dependent case does remain — an application that itself 
calls `setlocale(LC_ALL, "")` under a non-English locale, on a libc whose 
`strptime` lacks that English fallback — but that is pre-existing (it affects 
`timestamp` columns too) and out of scope here. Happy to file a follow-up issue 
if that is worth tracking.



##########
cpp/src/arrow/csv/converter.cc:
##########
@@ -440,6 +441,13 @@ struct SingleParserTimestampValueDecoder : public 
ValueDecoder {
   const TimestampParser& parser_;
 };
 
+std::vector<const TimestampParser*> GetTimestampParsers(const ConvertOptions& 
options) {
+  std::vector<const TimestampParser*> 
parsers(options.timestamp_parsers.size());
+  std::ranges::transform(options.timestamp_parsers, parsers.begin(),
+                         [](const auto& parser) { return parser.get(); });
+  return parsers;
+}

Review Comment:
   Good catch — reverted to a plain index loop (matching the sibling 
`MultipleParsersTimestampValueDecoder`), since the codebase guards 
`std::ranges` use behind `__cpp_lib_ranges` for partial-C++20 toolchains. Also 
dropped the now-unused `<algorithm>` include.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to