a-agmon commented on issue #338:
URL: https://github.com/apache/iceberg-rust/issues/338#issuecomment-2095268461
Thanks, @Fokko and @zeodtr, for the clarifications and explanations!
I think that it's important to fix this in the next release, as the current
situation is that the Rust API currently does not support Iceberg tables
generated using the > 1.5.0 Iceberg Java library, which seems to be quite a
limitation.
We agree that we must use `field-id` for field resolution, but also that
this might require a lot of work because the Avro Rust library at the moment
uses `field-name` for serde.
I have tried to implement a resolution that incorporates both requirements:
it uses `field-id` when reading the Avro manifest files, but also does not make
significant changes in the code flow and the usage of the Avro lib.
It's not the most elegant solution, but seems to resolve the issue along
these lines, at least as I tested.
The solution consists of 3 stages:
1. reading the Manifest schema to a map of field-id -> field-name
2. reading the actual avro file and using its schema, and map it to map of
field-id -> field-value
3. iterate the schema, and compose the avro records by using field-id to get
the value read from the file.
Here is just the main flow in the `parse_with_version()` function
```rust
FormatVersion::V2 => {
// 1. get a hashmap that maps the field_id to the field_name
in the Manifest's schema
let manifest_file_schema =
Self::get_record_schema(MANIFEST_LIST_AVRO_SCHEMA_V2.clone())?;
let manifest_file_schema_fields =
Self::get_manifest_schema_fields_map(manifest_file_schema, true)?;
// 2. get a hashmap that maps field_name to field_id in the
schema of the read avro file
let reader = Reader::new(bs)?;
let file_schema =
Self::get_record_schema(reader.writer_schema().clone())?;
let file_schema_fields: HashMap<String, String> =
Self::get_manifest_schema_fields_map(file_schema,
false)?;
// 3. get a vec of records from the read avro file .
// each record is a hashmap of field_id and field_value
let file_records =
reader.collect::<std::result::Result<Vec<Value>, _>>()?;
let file_records_values_map =
Self::get_avro_records_as_map(file_records,
file_schema_fields)?;
// 4. for each record (manifest file) in the Avro file
records maps,
// traverse the schema of the manifest file: for each field
id in the schema, get the field value from the record
let manifest_records: Vec<Value> = file_records_values_map
.into_iter()
.map(|file_record_fields| {
let fields_values: Vec<_> =
manifest_file_schema_fields
.iter()
.filter_map(|(schem_field_id, schem_field_name)|
{
file_record_fields
.get(schem_field_id)
.map(|value| (schem_field_name.clone(),
value.clone()))
})
.collect();
Value::Record(fields_values)
})
.collect();
let values = Value::Array(manifest_records);
let manifest =
from_value::<_serde::ManifestListV2>(&values)?;
manifest.try_into(partition_type_provider)
}
```
Please let me know what you think,
(Its just a suggestion, ofc, its also possible to re-write this without the
Avro lib)
I'm also posting this in Slack for visibility as I think its sufficiently
important.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]