mightyshazam commented on PR #4967:
URL: https://github.com/apache/arrow-rs/pull/4967#issuecomment-1773463012

   The example is a log file for a delta table. The log entries are in json and 
they contain a stats field that lists the min/max for each column in the 
schema. delta-rs writes the contents of binary columns as a serialized string 
using the below method 
   ```
   let escaped_bytes = v
                       .into_iter()
                       .flat_map(std::ascii::escape_default)
                       .collect::<Vec<u8>>();
   let escaped_string = String::from_utf8(escaped_bytes).unwrap();
   ```
   When creating a checkpoint, it takes that accumulated json from the various 
log files that looks something like this 
`{"add":{"path":"part-00001-8a7f1b6b-5869-4c37-8760-462ee9c97d49-c000.parquet","size":1522,"partitionValues":{},"modificationTime":1697658217285,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"tenant_model_id\":\"00000004\",\"content\":\"00000004\",\"job_id\":\"00000004\",\"platform_model_id\":\"00000004\"},\"maxValues\":{\"tenant_model_id\":\"00000004\",\"content\":\"00000004\",\"job_id\":\"00000004\",\"platform_model_id\":\"00000004\"},\"nullCount\":{\"job_id\":0,\"content\":0,\"tenant_model_id\":0,\"platform_model_id\":0}}"}}`
 where the `content` field is the binary field in this case.
   The checkpoint code takes a series of json entries like the one above and 
uses the ReaderBuilder's `serialize` method to turn that into an RecordBatch 
since the checkpoints are in parquet format.
   The issue isn't that the schema is in correct, but the usage may be. It 
might make sense to convert the delta schema to an arrow schema with utf-8 
columns because that is what is ultimately written. However, the code as is 
just converts directly from a delta schema to an arrow schema and builds the 
decoder. Due to the lack of binary column support, the method returns an error 
instead of a decoder. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to