ssandona commented on issue #9923:
URL: https://github.com/apache/iceberg/issues/9923#issuecomment-1987928914
Here a quick code to reproduce the error (pyspark):
```
from pyspark.sql.functions import col
ICEBERG_DB_NAME="mydb"
ICEBERG_TABLE_NAME_MOR="my_mor_table"
# Define the number of columns
num_columns = 1010
# Create column names
column_names = [f"col{i}" for i in range(1, num_columns + 1)]
# Create 5 rows
data = [tuple([1] + [1] * (num_columns-1)), tuple([1] + [2] *
(num_columns-1)), tuple([1] + [3] * (num_columns-1)), tuple([1] + [4] *
(num_columns-1)), tuple([1] + [5] * (num_columns-1))]
# Create a DataFrame with the specified row
df_with_row = spark.createDataFrame(data, column_names)
df_with_row.createOrReplaceTempView("table_input")
spark.sql(f"""
CREATE TABLE {ICEBERG_DB_NAME}.{ICEBERG_TABLE_NAME_MOR}
USING iceberg
PARTITIONED BY (col1)
TBLPROPERTIES (
'format-version'='2',
'write.delete.mode'='merge-on-read',
'write.update.mode'='merge-on-read',
'write.merge.mode'='merge-on-read',
'write.distribution-mode'='hash',
'write.delete.distribution-mode'='hash',
'write.update.distribution-mode'='hash',
'write.merge.distribution-mode'='hash'
)
AS SELECT * FROM table_input
"""
)
spark.sql(f"""
MERGE INTO {ICEBERG_DB_NAME}.{ICEBERG_TABLE_NAME_MOR} t
USING (SELECT * FROM table_input WHERE col2 = 1) s
ON t.col2 = s.col2
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
"""
)
spark.sql(f"""
MERGE INTO {ICEBERG_DB_NAME}.{ICEBERG_TABLE_NAME_MOR} t
USING (SELECT * FROM table_input WHERE col2 = 2) s
ON t.col2 = s.col2
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
"""
)
```
This fails:
```
CALL system.rewrite_position_delete_files(
table => '{ICEBERG_DB_NAME}.{ICEBERG_TABLE_NAME_MOR}',
options => map('rewrite-all', 'true')
)
"""
)
```
Error:
```
An error was encountered:
Multiple entries with same key: 1000=partition.col1 and 1000=row.col1000
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line
1631, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self)
File
"/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line
1322, in __call__
return_value = get_return_value(
File
"/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py",
line 185, in deco
raise converted from None
pyspark.errors.exceptions.captured.IllegalArgumentException: Multiple
entries with same key: 1000=partition.col1 and 1000=row.col1000
```
Also this fails:
```
spark.sql(f"""
UPDATE {ICEBERG_DB_NAME}.{ICEBERG_TABLE_NAME_MOR}
SET col2 = 6
WHERE col2 =1
"""
)
```
Error:
```
An error was encountered:
Multiple entries with same key: 1000=_partition.col1 and 1000=col1000
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line
1631, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self)
File
"/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line
1322, in __call__
return_value = get_return_value(
File
"/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py",
line 185, in deco
raise converted from None
pyspark.errors.exceptions.captured.IllegalArgumentException: Multiple
entries with same key: 1000=_partition.col1 and 1000=col1000
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]