cshuo commented on PR #18958:
URL: https://github.com/apache/hudi/pull/18958#issuecomment-4725262176
Another possible direction is to expose BLOB fields as `BYTES` / `BINARY` in
Flink DDL, and use table options to tell the connector which columns are BLOB
fields and whether they should be materialized during reads. For example:
```sql
CREATE TABLE media_assets (
asset_id STRING,
blob_content BYTES,
thumbnail BYTES,
ts BIGINT,
PRIMARY KEY (asset_id) NOT ENFORCED
) WITH (
'connector' = 'hudi',
'path' = 's3://bucket/media_assets',
'table.type' = 'MERGE_ON_READ',
'hoodie.blob.fields' = 'blob_content,thumbnail',
'hoodie.blob.read.materialize' = 'true'
);
```
When `hoodie.blob.read.materialize` is `false`, the `BYTES` value is the
descriptor bytes. When it is `true`, the connector would materialize and return
the actual data bytes.
This would let the connector own the read path instead of relying on a
scalar UDF. The source/operator could then identify the configured BLOB fields
and batch BLOB reads for optimization, rather than being constrained by per-row
scalar UDF evaluation.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]