[I] Calling `rewrite_position_delete_files` fails on tables with more than 1k columns [iceberg]

via GitHub Mon, 11 Mar 2024 01:58:48 -0700


bk-mz opened a new issue, #9923:
URL: https://github.com/apache/iceberg/issues/9923


   ### Apache Iceberg version
   
   1.4.3 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   When calling maintenance procedure `rewrite_position_delete_files`:
   
   ```
   CALL glue.system.rewrite_position_delete_files(table => 'table', where => 
"data_load_ts between TIMESTAMP '2024-02-07 13:51:58.729' and TIMESTAMP 
'2024-03-08 12:51:58.729'", options => map('partial-progress.enabled', 'true', 
'min-file-size-bytes', '26843545', 'max-file-size-bytes', '134217728', 
'min-input-files', '5', 'max-concurrent-file-group-rewrites', '50'))
   ```
   
   We observe following exception:
   
   ```
   java.lang.IllegalArgumentException: Multiple entries with same key: 
1000=row.struct.substruct.substruct_field and 1000=partition.data_load_ts_hour
       at 
org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap.conflictException(ImmutableMap.java:378)
 ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
       at 
org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:372)
 ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
       at 
org.apache.iceberg.relocated.com.google.common.collect.RegularImmutableMap.checkNoConflictInKeyBucket(RegularImmutableMap.java:246)
 ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
       at 
org.apache.iceberg.relocated.com.google.common.collect.RegularImmutableMap.fromEntryArrayCheckingBucketOverflow(RegularImmutableMap.java:133)
 ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
       at 
org.apache.iceberg.relocated.com.google.common.collect.RegularImmutableMap.fromEntryArray(RegularImmutableMap.java:95)
 ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
       at 
org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:572)
 ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
       at 
org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap$Builder.buildOrThrow(ImmutableMap.java:600)
 ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
       at 
org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:587)
 ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
       at org.apache.iceberg.types.IndexByName.byId(IndexByName.java:81) 
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
       at org.apache.iceberg.types.TypeUtil.indexNameById(TypeUtil.java:172) 
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
       at org.apache.iceberg.Schema.lazyIdToName(Schema.java:183) 
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
       at org.apache.iceberg.Schema.<init>(Schema.java:112) 
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
       at org.apache.iceberg.Schema.<init>(Schema.java:91) 
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
       at org.apache.iceberg.Schema.<init>(Schema.java:87) 
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
       at org.apache.iceberg.Schema.<init>(Schema.java:160) 
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
       at 
org.apache.iceberg.PositionDeletesTable.calculateSchema(PositionDeletesTable.java:129)
 ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
       at 
org.apache.iceberg.PositionDeletesTable.<init>(PositionDeletesTable.java:62) 
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
       at 
org.apache.iceberg.MetadataTableUtils.createMetadataTableInstance(MetadataTableUtils.java:81)
 ~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
   ```
   
   ## Steps to reproduce
   Problem is reproducible locally:
   
   ```
   import org.apache.iceberg._
   import org.apache.iceberg.aws.glue._
   import org.apache.iceberg.catalog._
   import org.apache.iceberg.types._
   
   val glue = new GlueCatalog();
   glue.initialize("glue", new java.util.HashMap());
   
   val table = glue.loadTable(TableIdentifier.parse("db.table"));
   
   val partitionType = Partitioning.partitionType(table);
   
   val result =
   new Schema(
       Types.NestedField.optional(
           MetadataColumns.DELETE_FILE_ROW_FIELD_ID,
           MetadataColumns.DELETE_FILE_ROW_FIELD_NAME,
           table.schema().asStruct(),
           MetadataColumns.DELETE_FILE_ROW_DOC),
       Types.NestedField.required(
           MetadataColumns.PARTITION_COLUMN_ID,
           PositionDeletesTable.PARTITION,
           Partitioning.partitionType(table),
           "Partition that position delete row belongs to"));
   ```
   
   ## Triage
   
   Triaging the bug, it's seen that running `partitionType` function on a table 
results in creating the struct with id:1000:
   
   ```
   scala> val partitionType = Partitioning.partitionType(table);
   partitionType: org.apache.iceberg.types.Types.StructType = struct<1000: 
data_load_ts_hour: optional int>
   ```
   
   Further down the code, when creating schema for `PositionDeleteTable`, 
exception is thrown on evaluating:
   
   ```
   this.highestFieldId = lazyIdToName().keySet().stream().mapToInt(i -> 
i).max().orElse(0);
   ```
   
   ```
     public Map<Integer, String> byId() {
       ImmutableMap.Builder<Integer, String> builder = ImmutableMap.builder();
       nameToId.forEach((key, value) -> builder.put(value, key)); // <-- HERE 
MAP IS IMMUTABLE
       return builder.build();
     }
   ```
   
   Triaging further, 
   
   `PositionDeletesTable` creates new schema joining existing table schema and 
partition spec as struct type.
   
   `PartitionSpec` would set that struct type a constant id 1000:
   
   ```java
   private int nextFieldId() { return lastAssignedFieldId.incrementAndGet(); }
   
   where
   
   private final AtomicInteger lastAssignedFieldId = new 
AtomicInteger(unpartitionedLastAssignedId());
   
   where
   
   private static int unpartitionedLastAssignedId() { return 
PARTITION_DATA_ID_START - 1; }
   ```
   
   In our case, `hour` partitioning would invoke `nextFieldId()`:
   
   ```java
   PartitionField field = new PartitionField(sourceColumn.fieldId(), 
nextFieldId(), targetName, Transforms.hour());
   ```
   
   Then in turn when joining table schema with more than 1K identifiers and 
partition spec with 1000: id, adding value to immutable map would fail.
   
   # Possible solutions
   
   Change `Partitioning.partitionType(table)` method, so that it will allways 
return structs with ids that are following max field id:
   
   ```java
     public static StructType partitionType(Table table) {
       Collection<PartitionSpec> specs = table.specs().values();
       int highestFieldId = table.schema().highestFieldId();
       List<NestedField> sortedStructFields = buildPartitionNestedFields("table 
partition", specs, allFieldIds(specs));
       return StructType.of(sortedStructFields.stream().map(f ->
               NestedField.optional(highestFieldId + f.fieldId(), f.name(), 
f.type())).collect(Collectors.toList()));
     }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Calling `rewrite_position_delete_files` fails on tables with more than 1k columns [iceberg]

Reply via email to