Cabeda opened a new issue, #1804:
URL: https://github.com/apache/iceberg-python/issues/1804
### Apache Iceberg version
0.9.0 (latest release)
### Please describe the bug 🐞
Hi,
Not sure if this is a bug but worst case scenario this might be something
for other to look up into in the future.
I've created a table like follows using pyiceberg
```python
schema = Schema(
NestedField(field_id=1, name="bk_id",
field_type=StringType(), required=False),
NestedField(field_id=2, name="possnr",
field_type=StringType(), required=False),
NestedField(field_id=3, name="posg",
field_type=StringType(), required=False),
NestedField(field_id=4, name="posp",
field_type=StringType(), required=False),
NestedField(field_id=5, name="aendnr_hinzu",
field_type=StringType(), required=False),
NestedField(field_id=6, name="bk_prod",
field_type=StringType(), required=False),
NestedField(field_id=7, name="bend",
field_type=StringType(), required=False),
NestedField(field_id=8, name="prod_ort",
field_type=StringType(), required=False),
NestedField(field_id=9, name="erzvgr",
field_type=StringType(), required=False),
NestedField(field_id=10, name="vkgr_art",
field_type=StringType(), required=False),
NestedField(field_id=11, name="sachnummernart",
field_type=StringType(), required=False),
NestedField(field_id=19, name="bza_log",
field_type=StringType(), required=False),
NestedField(field_id=20, name="eins_key",
field_type=StringType(), required=False),
NestedField(field_id=21, name="akkz",
field_type=StringType(), required=False),
NestedField(field_id=22, name="werk_zust",
field_type=StringType(), required=False),
NestedField(field_id=23, name="aendnr_entf",
field_type=StringType(), required=False),
NestedField(field_id=24, name="bearb_kz",
field_type=StringType(), required=False),
NestedField(field_id=25, name="pruefkz",
field_type=StringType(), required=False),
NestedField(field_id=26, name="drukz",
field_type=StringType(), required=False),
NestedField(field_id=27, name="kswkz",
field_type=StringType(), required=False),
NestedField(field_id=28, name="erskz",
field_type=StringType(), required=False),
NestedField(field_id=29, name="wdat",
field_type=TimestampType(), required=False),
NestedField(field_id=30, name="aedat",
field_type=TimestampType(), required=False),
NestedField(field_id=31, name="aeuser",
field_type=StringType(), required=False),
NestedField(field_id=32, name="pgm_name",
field_type=StringType(), required=False),
NestedField(field_id=33, name="kast_kz",
field_type=StringType(), required=False),
NestedField(field_id=34, name="rnum",
field_type=StringType(), required=False),
NestedField(field_id=35, name="mbma",
field_type=StringType(), required=False),
NestedField(field_id=36, name="mbma_confidence",
field_type=DoubleType(), required=False),
NestedField(field_id=37, name="kost",
field_type=StringType(), required=False),
NestedField(field_id=38, name="kost_confidence",
field_type=DoubleType(), required=False),
NestedField(field_id=39, name="stand",
field_type=StringType(), required=False),
NestedField(field_id=40, name="stand_confidence",
field_type=DoubleType(), required=False),
NestedField(field_id=41, name="inference_date",
field_type=TimestampType(), required=False),
NestedField(field_id=42, name="verified",
field_type=BooleanType(), required=False),
NestedField(field_id=43, name="id", field_type=StringType(),
required=True),
)
```
I've been able to do multiple appends to the table using pyiceberg with no
issues.
Now, to run some tests and prepare to use the new upsert operation, I
decided do append a row with id = 'dummy_id', and then run a scan filtering by
it. When I do the scan through AWS Athena I see the row, however, when doing
the scan with `dummy = table.scan(row_filter=EqualTo("id", 'dummy_id'))` I get
`list index out of range`. This seems to be because pyiceberg isn't able to
retrieve the row.
Here is the code I have setup to replicate the issue:
```python
from pyiceberg.expressions import EqualTo
import pyarrow as pa
df = pa.Table.from_pydict(
{
"bk_id": ["BK123456"],
"possnr": ["POS789"],
"posp": ["777"],
"posg": ["888"],
"bk_prod": ["PROD456"],
"bend": ["B789"],
"prod_ort": ["C"],
"erzvgr": ["E001"],
"vkgr_art": ["VA123"],
"sachnummernart": ["S456"],
"bza_log": ["BL789"],
"akkz": ["AK123"],
"werk_zust": ["WZ456"],
"aendnr_entf": ["AE789"],
"bearb_kz": ["BK001"],
"pruefkz": ["PK002"],
"drukz": ["D003"],
"kswkz": ["K004"],
"eins_key": [1],
"erskz": ["E005"],
"aeuser": ["user123"],
"pgm_name": ["program456"],
"kast_kz": ["KK789"],
"rnum": ["R001"],
"mbma": ["M123"],
"mbma_confidence": [0.85],
"kost": ["K456"],
"kost_confidence": [0.92],
"stand": ["S789"],
"stand_confidence": [0.78],
"inference_date": [pd.Timestamp.now()],
"verified": [False],
"id": ["dummy_id"],
}
)
catalog = load_catalog(
"glue",
**{
"type": "glue",
"warehouse": warehouse_path,
"downcast-ns-timestamp-to-us-on-write": True,
},
)
table_identifier = "database_name.table_name"
table = catalog.load_table(table_identifier)
table.append(df)
dummy = table.scan(row_filter=EqualTo("id", 'dummy_id'))
dummy.to_arrow()
```
Is there something I'm doing wrong?
### Willingness to contribute
- [ ] I can contribute a fix for this bug independently
- [x] I would be willing to contribute a fix for this bug with guidance from
the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]