Re: [I] Calling `rewrite_position_delete_files` fails on tables with more than 1k columns [iceberg]

via GitHub Mon, 11 Mar 2024 02:08:39 -0700


ssandona commented on issue #9923:
URL: https://github.com/apache/iceberg/issues/9923#issuecomment-1987928914


   Here a quick code to reproduce the error (pyspark):
   
   ```
   from pyspark.sql.functions import col
   
   ICEBERG_DB_NAME="mydb"
   ICEBERG_TABLE_NAME_MOR="my_mor_table"
   
   # Define the number of columns
   num_columns = 1010
   
   # Create column names
   column_names = [f"col{i}" for i in range(1, num_columns + 1)]
   
   # Create 5 rows
   data = [tuple([1] + [1] * (num_columns-1)), tuple([1] + [2] * 
(num_columns-1)), tuple([1] + [3] * (num_columns-1)), tuple([1] + [4] * 
(num_columns-1)), tuple([1] + [5] * (num_columns-1))]
   
   # Create a DataFrame with the specified row
   df_with_row = spark.createDataFrame(data, column_names)
   
   df_with_row.createOrReplaceTempView("table_input")
   
   spark.sql(f"""
   CREATE TABLE {ICEBERG_DB_NAME}.{ICEBERG_TABLE_NAME_MOR}
   USING iceberg
   PARTITIONED BY (col1)
   TBLPROPERTIES (
       'format-version'='2',
       'write.delete.mode'='merge-on-read',
       'write.update.mode'='merge-on-read',
       'write.merge.mode'='merge-on-read',
       'write.distribution-mode'='hash',
       'write.delete.distribution-mode'='hash',
       'write.update.distribution-mode'='hash',
       'write.merge.distribution-mode'='hash'
   )
   AS SELECT * FROM table_input
   """
   )
   
   
   spark.sql(f"""
   MERGE INTO {ICEBERG_DB_NAME}.{ICEBERG_TABLE_NAME_MOR} t
   USING (SELECT * FROM table_input WHERE col2 = 1) s
       ON t.col2 = s.col2
   WHEN MATCHED THEN UPDATE SET * 
   WHEN NOT MATCHED THEN INSERT *
   """
   )
   
   spark.sql(f"""
   MERGE INTO {ICEBERG_DB_NAME}.{ICEBERG_TABLE_NAME_MOR} t
   USING (SELECT * FROM table_input WHERE col2 = 2) s
       ON t.col2 = s.col2
   WHEN MATCHED THEN UPDATE SET * 
   WHEN NOT MATCHED THEN INSERT *
   """
   )
   ```
   
   This fails:
   
   ```
   CALL system.rewrite_position_delete_files(
   table => '{ICEBERG_DB_NAME}.{ICEBERG_TABLE_NAME_MOR}', 
   options => map('rewrite-all', 'true')
   )
   """
   )
   ```
   
   Error:
   
   ```
   An error was encountered:
   Multiple entries with same key: 1000=partition.col1 and 1000=row.col1000
   Traceback (most recent call last):
     File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 
1631, in sql
       return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self)
     File 
"/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 
1322, in __call__
       return_value = get_return_value(
     File 
"/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", 
line 185, in deco
       raise converted from None
   pyspark.errors.exceptions.captured.IllegalArgumentException: Multiple 
entries with same key: 1000=partition.col1 and 1000=row.col1000
   ```
   
   Also this fails:
   
   ```
   spark.sql(f"""
   UPDATE {ICEBERG_DB_NAME}.{ICEBERG_TABLE_NAME_MOR}
   SET col2 = 6
   WHERE col2 =1
   """
   )
   ```
   
   Error:
   
   ```
   An error was encountered:
   Multiple entries with same key: 1000=_partition.col1 and 1000=col1000
   Traceback (most recent call last):
     File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 
1631, in sql
       return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self)
     File 
"/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 
1322, in __call__
       return_value = get_return_value(
     File 
"/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", 
line 185, in deco
       raise converted from None
   pyspark.errors.exceptions.captured.IllegalArgumentException: Multiple 
entries with same key: 1000=_partition.col1 and 1000=col1000
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Calling `rewrite_position_delete_files` fails on tables with more than 1k columns [iceberg]

Reply via email to