ahmedabu98 opened a new issue, #11900:
URL: https://github.com/apache/iceberg/issues/11900
### Apache Iceberg version
1.7.1 (latest release)
### Query engine
None
### Please describe the bug 🐞
Part of our workflow in Apache Beam's Iceberg connector requires recreating
DataFiles, but this process isn't smooth when the file is partitioned by month
or hour. See the following reproducible code:
```java
org.apache.iceberg.Schema schema =
new org.apache.iceberg.Schema(
Types.NestedField.required(1, "month",
Types.TimestampType.withoutZone()),
Types.NestedField.required(2, "hour",
Types.TimestampType.withoutZone()));
PartitionSpec spec =
PartitionSpec.builderFor(schema).month("month").hour("hour").build();
Table table = catalog.createTable(TableIdentifier.parse("db.table"), schema,
spec);
LocalDateTime val = LocalDateTime.parse("2024-10-08T13:18:20.053");
Record rec = GenericRecord.create(schema).copy(
ImmutableMap.of(
"month", val,
"hour", val));
Record partitionableRec = getPartitionableRecord(rec, spec, schema);
PartitionKey pk = new PartitionKey(spec, schema);
pk.partition(partitionableRec);
DataWriter<Record> writer =
Parquet.writeData(
table
.io()
.newOutputFile(table.locationProvider().newDataLocation(spec, pk, "test_file")))
.createWriterFunc(GenericParquetWriter::buildWriter)
.schema(table.schema())
.withSpec(table.spec())
.withPartition(pk)
.overwrite()
.build();
writer.write(rec);
writer.close();
DataFile file = writer.toDataFile();
// recreate data file using the original file
DataFiles.builder(spec)
.withPath(file.path().toString())
.withFormat(file.format())
.withPartition(file.partition())
.withFileSizeInBytes(file.fileSizeInBytes())
.withRecordCount(file.recordCount())
.withPartitionPath(spec.partitionToPath(file.partition()))
.build();
```
The last bit fails with the following error:
```
java.lang.NumberFormatException: For input string: "2024-10"
at
java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.base/java.lang.Integer.parseInt(Integer.java:652)
at java.base/java.lang.Integer.valueOf(Integer.java:983)
at
org.apache.iceberg.types.Conversions.fromPartitionString(Conversions.java:51)
at org.apache.iceberg.DataFiles.fillFromPath(DataFiles.java:86)
at
org.apache.iceberg.DataFiles$Builder.withPartitionPath(DataFiles.java:266)
```
I would expect that the result of `spec.partitionToPath(file.partition())`
could be naturally used when recreating the DataFile, but the [logic
here](https://github.com/apache/iceberg/blob/e3f50e5c62d01f3f31239d197ef281fc36cf31fa/core/src/main/java/org/apache/iceberg/DataFiles.java#L78-L87)
doesn't seem to be robust enough.
We've been able to use this [work
around](https://github.com/apache/beam/blob/18ec3317e500a6fee72fc8c24552c21808437bef/sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/RecordWriterManager.java#L237-L259),
replicated below:
<details>
<summary><b>Work around</b></summary>
```java
static String getPartitionDataPath(
String partitionPath, Map<String, PartitionField> partitionFieldMap) {
if (partitionPath.isEmpty() || partitionFieldMap.isEmpty()) {
return partitionPath;
}
List<String> resolved = new ArrayList<>();
for (String partition : Splitter.on('/').splitToList(partitionPath)) {
List<String> nameAndValue = Splitter.on('=').splitToList(partition);
String name = nameAndValue.get(0);
String value = nameAndValue.get(1);
String transformName =
Preconditions.checkArgumentNotNull(partitionFieldMap.get(name)).transform().toString();
if (Transforms.month().toString().equals(transformName)) {
int month = YearMonth.parse(value).getMonthValue();
value = String.valueOf(month);
} else if (Transforms.hour().toString().equals(transformName)) {
long hour = ChronoUnit.HOURS.between(EPOCH, LocalDateTime.parse(value,
HOUR_FORMATTER));
value = String.valueOf(hour);
}
resolved.add(name + "=" + value);
}
return String.join("/", resolved);
}
```
</details>
But I would expect the Iceberg API to take care of this by itself.
### Willingness to contribute
- [ ] I can contribute a fix for this bug independently
- [X] I would be willing to contribute a fix for this bug with guidance from
the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]