Re: [I] Compatibility issues with `org.apache.iceberg:iceberg-spark-runtime-3.5_2.13:1.5.0` [iceberg-rust]

via GitHub Sun, 05 May 2024 23:20:06 -0700


a-agmon commented on issue #338:
URL: https://github.com/apache/iceberg-rust/issues/338#issuecomment-2095268461


   Thanks, @Fokko and @zeodtr, for the clarifications and explanations! 
   I think that it's important to fix this in the next release, as the current 
situation is that the Rust API currently does not support Iceberg tables 
generated using the > 1.5.0 Iceberg Java library, which seems to be quite a 
limitation. 
   We agree that we must use `field-id` for field resolution, but also that 
this might require a lot of work because the Avro Rust library at the moment 
uses `field-name` for serde. 
   I have tried to implement a resolution that incorporates both requirements: 
it uses `field-id` when reading the Avro manifest files, but also does not make 
significant changes in the code flow and the usage of the Avro lib. 
   
   It's not the most elegant solution, but seems to resolve the issue along 
these lines, at least as I tested. 
   The solution consists of 3 stages: 
   1. reading the Manifest schema to a map of field-id -> field-name 
   2. reading the actual avro file and using its schema, and map it to map of 
field-id -> field-value
   3. iterate the schema, and compose the avro records by using field-id to get 
the value read from the file. 
   Here is just the main flow in the `parse_with_version()` function
   
   ```rust
               FormatVersion::V2 => {
                   // 1. get a hashmap that maps the field_id to the field_name 
in the Manifest's schema
                   let manifest_file_schema =
                       
Self::get_record_schema(MANIFEST_LIST_AVRO_SCHEMA_V2.clone())?;
                   let manifest_file_schema_fields =
                       
Self::get_manifest_schema_fields_map(manifest_file_schema, true)?;
   
                   // 2. get a hashmap that maps field_name to field_id in the 
schema of the read avro file
                   let reader = Reader::new(bs)?;
                   let file_schema = 
Self::get_record_schema(reader.writer_schema().clone())?;
                   let file_schema_fields: HashMap<String, String> =
                       Self::get_manifest_schema_fields_map(file_schema, 
false)?;
   
                   // 3. get a vec of records from the read avro file .
                   // each record is a hashmap of field_id and field_value
                   let file_records = 
reader.collect::<std::result::Result<Vec<Value>, _>>()?;
                   let file_records_values_map =
                       Self::get_avro_records_as_map(file_records, 
file_schema_fields)?;
   
                   // 4. for each record (manifest file) in the Avro file 
records maps,
                   // traverse the schema of the manifest file: for each field 
id in the schema, get the field value from the record
                   let manifest_records: Vec<Value> = file_records_values_map
                       .into_iter()
                       .map(|file_record_fields| {
                           let fields_values: Vec<_> = 
manifest_file_schema_fields
                               .iter()
                               .filter_map(|(schem_field_id, schem_field_name)| 
{
                                   file_record_fields
                                       .get(schem_field_id)
                                       .map(|value| (schem_field_name.clone(), 
value.clone()))
                               })
                               .collect();
                           Value::Record(fields_values)
                       })
                       .collect();
   
                   let values = Value::Array(manifest_records);
                   let manifest = 
from_value::<_serde::ManifestListV2>(&values)?;
                   manifest.try_into(partition_type_provider)
               }
   ```
   
   Please let me know what you think,
   (Its just a suggestion, ofc, its also possible to re-write this without the 
Avro lib)
   
   I'm also posting this in Slack for visibility as I think its sufficiently 
important.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Compatibility issues with `org.apache.iceberg:iceberg-spark-runtime-3.5_2.13:1.5.0` [iceberg-rust]

Reply via email to