mightyshazam commented on PR #4967:
URL: https://github.com/apache/arrow-rs/pull/4967#issuecomment-1773463012
The example is a log file for a delta table. The log entries are in json and
they contain a stats field that lists the min/max for each column in the
schema. delta-rs writes the contents of binary columns as a serialized string
using the below method
```
let escaped_bytes = v
.into_iter()
.flat_map(std::ascii::escape_default)
.collect::<Vec<u8>>();
let escaped_string = String::from_utf8(escaped_bytes).unwrap();
```
When creating a checkpoint, it takes that accumulated json from the various
log files that looks something like this
`{"add":{"path":"part-00001-8a7f1b6b-5869-4c37-8760-462ee9c97d49-c000.parquet","size":1522,"partitionValues":{},"modificationTime":1697658217285,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"tenant_model_id\":\"00000004\",\"content\":\"00000004\",\"job_id\":\"00000004\",\"platform_model_id\":\"00000004\"},\"maxValues\":{\"tenant_model_id\":\"00000004\",\"content\":\"00000004\",\"job_id\":\"00000004\",\"platform_model_id\":\"00000004\"},\"nullCount\":{\"job_id\":0,\"content\":0,\"tenant_model_id\":0,\"platform_model_id\":0}}"}}`
where the `content` field is the binary field in this case.
The checkpoint code takes a series of json entries like the one above and
uses the ReaderBuilder's `serialize` method to turn that into an RecordBatch
since the checkpoints are in parquet format.
The issue isn't that the schema is in correct, but the usage may be. It
might make sense to convert the delta schema to an arrow schema with utf-8
columns because that is what is ultimately written. However, the code as is
just converts directly from a delta schema to an arrow schema and builds the
decoder. Due to the lack of binary column support, the method returns an error
instead of a decoder.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]