bigluck opened a new issue, #7736:
URL: https://github.com/apache/iceberg/issues/7736
### Apache Iceberg version
main (development)
### Query engine
Other
### Please describe the bug 🐞
Ciao @Fokko; not sure if it's a bug, but I'm encountering strange behavior
when trying to scan a partitioned table.
Dataset: taxi (full dataset)
Data catalog: glue
Table partitions: `request_datetime`, transform=`month`
This is my snippet:
```python
from datetime import timedelta, datetime, timezone
from pyiceberg.catalog import load_catalog
from pyiceberg.expressions import GreaterThanOrEqual, LessThanOrEqual, And
catalog = load_catalog('default', type='glue')
table = catalog.load_table(('biglake', 'taxi_dremio_by_month'))
from_date = datetime(2021, 1, 1, 0, 0, 0, 0, tzinfo=timezone.utc)
to_date = from_date + timedelta(days=7)
scan = table.scan(
row_filter=And(
GreaterThanOrEqual('request_datetime',
from_date.strftime('%Y-%m-%dT00:00:00.000+00:00')),
LessThanOrEqual('request_datetime',
to_date.strftime('%Y-%m-%dT00:00:00.000+00:00')),
),
selected_fields=('request_datetime',),
)
files = [plan.file.file_path for plan in scan.plan_files()]
```
`scan.metadata.partitions_spec[0]` contains `{'name':
'request_datetime_month', 'transform': 'month', 'source-id': 4, 'field-id':
1000}` (it's the only partition), and this is the entire content of the scan
object:
<img width="817" alt="Screenshot 2023-05-30 at 11 45 20"
src="https://github.com/apache/iceberg/assets/1511095/f787af1f-5f2f-40a6-bc7f-6a01a0bae4ba">
The final value of the scan.row_filter variable is:
```python
And(left=GreaterThanOrEqual(term=Reference(name='request_datetime'),
literal=literal('2021-01-01T00:00:00.000+00:00')),
right=LessThanOrEqual(term=Reference(name='request_datetime'),
literal=literal('2021-01-08T00:00:00.000+00:00')))
```
Once the code reaches the next statement (files = ...) it crashes with this
error:
```
Traceback (most recent call last):
File "/Users/bigluck/Desktop/duckbanch/run_pyiceberg2.py", line 121, in
<module>
res = run(
File "/Users/bigluck/Desktop/duckbanch/run_pyiceberg2.py", line 96, in run
files = [plan.file.file_path for plan in scan.plan_files()]
File
"/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/table/__init__.py",
line 394, in plan_files
*pool.starmap(
File
"/Users/bigluck/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/pool.py",
line 375, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File
"/Users/bigluck/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/pool.py",
line 774, in get
raise self._value
File
"/Users/bigluck/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/pool.py",
line 125, in worker
result = (True, func(*args, **kwds))
File
"/Users/bigluck/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/pool.py",
line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File
"/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/table/__init__.py",
line 332, in _open_manifest
return [FileScanTask(file) for file in matching_partition_data_files if
metrics_evaluator(file)]
File
"/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/table/__init__.py",
line 332, in <listcomp>
return [FileScanTask(file) for file in matching_partition_data_files if
metrics_evaluator(file)]
File
"/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/table/__init__.py",
line 367, in <lambda>
return lambda data_file: evaluator(data_file.partition)
File
"/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/expressions/visitors.py",
line 468, in eval
return visit(self.bound, self)
File "/Users/bigluck/.pyenv/versions/3.10.11/lib/python3.10/functools.py",
line 889, in wrapper
return dispatch(args[0].__class__)(*args, **kw)
File
"/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/expressions/visitors.py",
line 177, in _
left_result: T = visit(obj.left, visitor=visitor)
File "/Users/bigluck/.pyenv/versions/3.10.11/lib/python3.10/functools.py",
line 889, in wrapper
return dispatch(args[0].__class__)(*args, **kw)
File
"/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/expressions/visitors.py",
line 191, in _
return visitor.visit_bound_predicate(predicate=obj)
File
"/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/expressions/visitors.py",
line 347, in visit_bound_predicate
return visit_bound_predicate(predicate, self)
File "/Users/bigluck/.pyenv/versions/3.10.11/lib/python3.10/functools.py",
line 889, in wrapper
return dispatch(args[0].__class__)(*args, **kw)
File
"/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/expressions/visitors.py",
line 398, in _
return visitor.visit_greater_than_or_equal(term=expr.term,
literal=expr.literal)
File
"/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/expressions/visitors.py",
line 497, in visit_greater_than_or_equal
return term.eval(self.struct) >= literal.value
TypeError: '>=' not supported between instances of 'NoneType' and 'int'
```
I've added a print on the `File
"/Users/bigluck/Desktop/duckbanch/.venv/lib/python3.10/site-packages/pyiceberg/expressions/visitors.py",
line 347, in visit_bound_predicate` line, and this the content of the
`predicate` var:
```
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundLessThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
BoundGreaterThanOrEqual(term=BoundReference(field=NestedField(field_id=1000,
name='request_datetime_month', field_type=IntegerType(), required=False),
accessor=Accessor(position=0,inner=None)), literal=LongLiteral(612))
```
It's unclear to me if it's a bug, a problem with the table itself or if I'm
passing invalid values to the `row_filter` argument, but this SQL query (done
using Athena) works:
```sql
SELECT DATE_TRUNC('day', "request_datetime"), COUNT(*) FROM
"taxi_dremio_by_month"
WHERE "request_datetime" >= CAST('2021-01-01' AS DATE) AND
"request_datetime" <= CAST('2021-01-08' AS DATE)
GROUP BY 1
ORDER BY 1
```
Can you help me? Thanks so much.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]