bk-mz opened a new issue, #9923:
URL: https://github.com/apache/iceberg/issues/9923
### Apache Iceberg version
1.4.3 (latest release)
### Query engine
Spark
### Please describe the bug 🐞
When calling maintenance procedure `rewrite_position_delete_files`:
```
CALL glue.system.rewrite_position_delete_files(table => 'table', where =>
"data_load_ts between TIMESTAMP '2024-02-07 13:51:58.729' and TIMESTAMP
'2024-03-08 12:51:58.729'", options => map('partial-progress.enabled', 'true',
'min-file-size-bytes', '26843545', 'max-file-size-bytes', '134217728',
'min-input-files', '5', 'max-concurrent-file-group-rewrites', '50'))
```
We observe following exception:
```
java.lang.IllegalArgumentException: Multiple entries with same key:
1000=row.struct.substruct.substruct_field and 1000=partition.data_load_ts_hour
at
org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap.conflictException(ImmutableMap.java:378)
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
at
org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:372)
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
at
org.apache.iceberg.relocated.com.google.common.collect.RegularImmutableMap.checkNoConflictInKeyBucket(RegularImmutableMap.java:246)
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
at
org.apache.iceberg.relocated.com.google.common.collect.RegularImmutableMap.fromEntryArrayCheckingBucketOverflow(RegularImmutableMap.java:133)
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
at
org.apache.iceberg.relocated.com.google.common.collect.RegularImmutableMap.fromEntryArray(RegularImmutableMap.java:95)
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
at
org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:572)
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
at
org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap$Builder.buildOrThrow(ImmutableMap.java:600)
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
at
org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:587)
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
at org.apache.iceberg.types.IndexByName.byId(IndexByName.java:81)
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
at org.apache.iceberg.types.TypeUtil.indexNameById(TypeUtil.java:172)
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
at org.apache.iceberg.Schema.lazyIdToName(Schema.java:183)
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
at org.apache.iceberg.Schema.<init>(Schema.java:112)
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
at org.apache.iceberg.Schema.<init>(Schema.java:91)
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
at org.apache.iceberg.Schema.<init>(Schema.java:87)
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
at org.apache.iceberg.Schema.<init>(Schema.java:160)
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
at
org.apache.iceberg.PositionDeletesTable.calculateSchema(PositionDeletesTable.java:129)
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
at
org.apache.iceberg.PositionDeletesTable.<init>(PositionDeletesTable.java:62)
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
at
org.apache.iceberg.MetadataTableUtils.createMetadataTableInstance(MetadataTableUtils.java:81)
~[org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.4.3.jar:?]
```
## Steps to reproduce
Problem is reproducible locally:
```
import org.apache.iceberg._
import org.apache.iceberg.aws.glue._
import org.apache.iceberg.catalog._
import org.apache.iceberg.types._
val glue = new GlueCatalog();
glue.initialize("glue", new java.util.HashMap());
val table = glue.loadTable(TableIdentifier.parse("db.table"));
val partitionType = Partitioning.partitionType(table);
val result =
new Schema(
Types.NestedField.optional(
MetadataColumns.DELETE_FILE_ROW_FIELD_ID,
MetadataColumns.DELETE_FILE_ROW_FIELD_NAME,
table.schema().asStruct(),
MetadataColumns.DELETE_FILE_ROW_DOC),
Types.NestedField.required(
MetadataColumns.PARTITION_COLUMN_ID,
PositionDeletesTable.PARTITION,
Partitioning.partitionType(table),
"Partition that position delete row belongs to"));
```
## Triage
Triaging the bug, it's seen that running `partitionType` function on a table
results in creating the struct with id:1000:
```
scala> val partitionType = Partitioning.partitionType(table);
partitionType: org.apache.iceberg.types.Types.StructType = struct<1000:
data_load_ts_hour: optional int>
```
Further down the code, when creating schema for `PositionDeleteTable`,
exception is thrown on evaluating:
```
this.highestFieldId = lazyIdToName().keySet().stream().mapToInt(i ->
i).max().orElse(0);
```
```
public Map<Integer, String> byId() {
ImmutableMap.Builder<Integer, String> builder = ImmutableMap.builder();
nameToId.forEach((key, value) -> builder.put(value, key)); // <-- HERE
MAP IS IMMUTABLE
return builder.build();
}
```
Triaging further,
`PositionDeletesTable` creates new schema joining existing table schema and
partition spec as struct type.
`PartitionSpec` would set that struct type a constant id 1000:
```java
private int nextFieldId() { return lastAssignedFieldId.incrementAndGet(); }
where
private final AtomicInteger lastAssignedFieldId = new
AtomicInteger(unpartitionedLastAssignedId());
where
private static int unpartitionedLastAssignedId() { return
PARTITION_DATA_ID_START - 1; }
```
In our case, `hour` partitioning would invoke `nextFieldId()`:
```java
PartitionField field = new PartitionField(sourceColumn.fieldId(),
nextFieldId(), targetName, Transforms.hour());
```
Then in turn when joining table schema with more than 1K identifiers and
partition spec with 1000: id, adding value to immutable map would fail.
# Possible solutions
Change `Partitioning.partitionType(table)` method, so that it will allways
return structs with ids that are following max field id:
```java
public static StructType partitionType(Table table) {
Collection<PartitionSpec> specs = table.specs().values();
int highestFieldId = table.schema().highestFieldId();
List<NestedField> sortedStructFields = buildPartitionNestedFields("table
partition", specs, allFieldIds(specs));
return StructType.of(sortedStructFields.stream().map(f ->
NestedField.optional(highestFieldId + f.fieldId(), f.name(),
f.type())).collect(Collectors.toList()));
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]