jeffreyssmith2nd commented on issue #7925:
URL: https://github.com/apache/datafusion/issues/7925#issuecomment-2130106971
The case we're running into in InfluxDB when enabling timezones is slightly
different. It is a parquet file with Timestamp without a timezone and then
querying with either a timezone or in UTC. The odd part is that we have
manually set the schema to be 'UTC', even though the backing parquet file is
not.
> the issue originates when the schema provided to the ParquetExec and used
for planning doesn't match that of the underlying parquet file.
I have not been able to reproduce the issue using the datafusion-cli, as
there is no way (that I can see) to change the schema. It does behave as I
would expect when the schema matches the files.
This is the script I'm using to generate some parquet files.
```python
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.DataFrame({
'time': [
pd.Timestamp(year=2024,month=1,day=1,second=0),
pd.Timestamp(year=2024,month=1,day=1,second=1),
pd.Timestamp(year=2024,month=1,day=1,second=2),
pd.Timestamp(year=2024,month=1,day=1,second=3),
],
})
table = pa.Table.from_pandas(df)
pq.write_table(table, 'example_no_tz.parquet')
df = pd.DataFrame({
'time': [
pd.Timestamp(year=2024,month=1,day=1,second=0,tz='UTC'),
pd.Timestamp(year=2024,month=1,day=1,second=1,tz='UTC'),
pd.Timestamp(year=2024,month=1,day=1,second=2,tz='UTC'),
pd.Timestamp(year=2024,month=1,day=1,second=3,tz='UTC'),
],
})
table = pa.Table.from_pandas(df)
pq.write_table(table, 'example_with_tz_utc.parquet')
```
Some sample queries:
```
> SET datafusion.execution.parquet.pushdown_filters=true;
0 row(s) fetched.
Elapsed 0.003 seconds.
> select * from './example_no_tz.parquet';
+---------------------+
| time |
+---------------------+
| 2024-01-01T00:00:00 |
| 2024-01-01T00:00:01 |
| 2024-01-01T00:00:02 |
| 2024-01-01T00:00:03 |
+---------------------+
4 row(s) fetched.
Elapsed 0.003 seconds.
> select * from './example_no_tz.parquet' where time >=
'2024-01-01T00:00:02.000';
+---------------------+
| time |
+---------------------+
| 2024-01-01T00:00:02 |
| 2024-01-01T00:00:03 |
+---------------------+
2 row(s) fetched.
Elapsed 0.009 seconds.
> select * from './example_no_tz.parquet' where time >=
('2024-01-01T00:00:02.000' at time zone 'Europe/Brussels');
+---------------------+
| time |
+---------------------+
| 2024-01-01T00:00:00 |
| 2024-01-01T00:00:01 |
| 2024-01-01T00:00:02 |
| 2024-01-01T00:00:03 |
+---------------------+
4 row(s) fetched.
Elapsed 0.006 seconds.
```
If you perform any queries against the example with timezone, where the
predicate is not in the same timezone, you get a type coercion error, which I
believe makes sense.
```
> select * from './example_with_tz_utc.parquet';
+----------------------+
| time |
+----------------------+
| 2024-01-01T00:00:00Z |
| 2024-01-01T00:00:01Z |
| 2024-01-01T00:00:02Z |
| 2024-01-01T00:00:03Z |
+----------------------+
4 row(s) fetched.
Elapsed 0.004 seconds.
> select * from './example_with_tz_utc.parquet' where time >=
'2024-01-01T00:00:02.000';
+----------------------+
| time |
+----------------------+
| 2024-01-01T00:00:02Z |
| 2024-01-01T00:00:03Z |
+----------------------+
2 row(s) fetched.
Elapsed 0.007 seconds.
> select * from './example_with_tz_utc.parquet' where time >=
('2024-01-01T00:00:02.000' at time zone 'Europe/Brussels');
Error during planning: Cannot infer common argument type for comparison
operation Timestamp(Nanosecond, Some("UTC")) >= Timestamp(Nanosecond,
Some("Europe/Brussels"))
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]