davlee1972 opened a new issue, #45936:
URL: https://github.com/apache/arrow/issues/45936
### Describe the enhancement requested
This impacts pyarrow.compute.cast() and reading pyarrow datasets using
schemas for Text and Json files.
It also impacts SQL result sets which return string values for date/datetime
columns. (Tested using ADBC).
Can we add some string to date32 and date64 conversions which strips the
time portion from YYYY-MM-DD HH:MM:SS.ffff??
For Json YYYY-MM-DD can only be converted to timestamp[s]. This is
inconsistent with the CSV reader which will by default converts YYYY-MM-DD into
date32..
```
>>> import pyarrow.compute as pc
>>> import pyarrow.dataset as ds
>>>
>>> # This works
>>> today = pa.scalar('2025-03-24')
>>> pc.cast(today, "date32")
<pyarrow.Date32Scalar: datetime.date(2025, 3, 24)>
>>> pc.cast(today, "timestamp[s]").cast("date32")
<pyarrow.Date32Scalar: datetime.date(2025, 3, 24)>
>>>
>>> # This works if you cast first to timestamp
>>> today = pa.scalar('2025-03-24 00:00:00')
>>> pc.cast(today, "date32")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/u1/leed/miniconda3/lib/python3.9/site-packages/pyarrow/compute.py",
line 405, in cast
return call_function("cast", [arr], options, memory_pool)
File "pyarrow/_compute.pyx", line 598, in pyarrow._compute.call_function
File "pyarrow/_compute.pyx", line 393, in pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 155, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Failed to parse string: '2025-03-24 00:00:00' as a
scalar of type date32[day]
>>> pc.cast(today, "timestamp[s]").cast("date32")
<pyarrow.Date32Scalar: datetime.date(2025, 3, 24)>
>>>
>>> # For text you also can't parse dates with 00:00:00 into dates
>>> with open("test.csv", "w") as f:
... f.write("today\n")
... f.write("2025-03-24 00:00:00\n")
... f.write("2025-03-24 00:00:00\n")
... f.write("2025-03-24 00:00:00\n")
...
6
20
20
20
>>> text_dataset = ds.dataset("test.csv", format="csv",
schema=pa.schema([pa.field("today", "date32")]))
>>> text_dataset.schema
today: date32[day]
>>> text_dataset.head(10)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/_dataset.pyx", line 730, in pyarrow._dataset.Dataset.head
File "pyarrow/_dataset.pyx", line 3911, in pyarrow._dataset.Scanner.head
File "pyarrow/error.pxi", line 155, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not open CSV input source 'test.csv':
Invalid: In CSV column #0: Row #2: CSV conversion error to date32[day]: invalid
value '2025-03-24 00:00:00'
>>>
>>> # JSON only supports timestamps.
>>> with open("test.json", "w") as f:
... f.write('{"today": "2025-03-24"}\n')
... f.write('{"today": "2025-03-24"}\n')
... f.write('{"today": "2025-03-24"}\n')
...
24
24
24
>>> json_dataset = ds.dataset("test.json", format="json",
schema=pa.schema([pa.field("today", "date32")]))
>>> json_dataset.schema
today: date32[day]
>>> json_dataset.head(10)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/_dataset.pyx", line 730, in pyarrow._dataset.Dataset.head
File "pyarrow/_dataset.pyx", line 3911, in pyarrow._dataset.Scanner.head
File "pyarrow/error.pxi", line 155, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not open JSON input source 'test.json':
Invalid: JSON parse error: Column(/today) changed from number to string in row 0
>>>
>>> json_dataset = ds.dataset("test.json", format="json")
>>> json_dataset.schema
today: timestamp[s]
>>> json_dataset.head(10)
pyarrow.Table
today: timestamp[s]
----
today: [[2025-03-24 00:00:00,2025-03-24 00:00:00,2025-03-24 00:00:00]]
```
### Component(s)
C++, Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]