Re: [PR] [DOCS] Restructure 1.2.0 unstructured-data pages [hudi]

via GitHub Thu, 28 May 2026 13:25:07 -0700


hudi-agent commented on code in PR #18876:
URL: https://github.com/apache/hudi/pull/18876#discussion_r3320420911



##########
website/docs/sql_dml.md:
##########
@@ -48,6 +48,29 @@ INSERT INTO hudi_cow_pt_tbl PARTITION(dt = '2021-12-09', 
hh='11') SELECT 2, 'a2'
 INSERT INTO hudi_cow_pt_tbl PARTITION(dt, hh) SELECT 1 AS id, 'a1' AS name, 
1000 AS ts, '2021-12-09' AS dt, '10' AS hh;
 ```
 
+#### Inserting VECTOR, BLOB, and VARIANT columns
+
+```sql
+-- VECTOR: pass an ARRAY of floats with the declared dimension
+INSERT INTO products SELECT 'prod_001', 'Shoes', ARRAY(0.12, -0.03, /* ... 768 
floats ... */);
+
+-- BLOB (INLINE): construct the internal struct with named_struct
+INSERT INTO media_assets VALUES (
+  'asset_001', 'logo.png', 'image/png', 45230,
+  named_struct(
+    'type',      'INLINE',
+    'data',      <binary>,
+    'reference', CAST(NULL AS STRUCT<external_path: STRING, offset: BIGINT, 
length: BIGINT, managed: BOOLEAN>)
+  )

Review Comment:
   🤖 Consider also showing an `OUT_OF_LINE` BLOB INSERT here. The rest of the 
docs (and writing_data.md) call out the OOL mode as the recommended path for 
large objects, so omitting it from the only INSERT example may push users 
toward inline by default. Even a one-liner showing the `reference` struct 
populated would round this section out.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
website/docs/sql_dml.md:
##########
@@ -48,6 +48,29 @@ INSERT INTO hudi_cow_pt_tbl PARTITION(dt = '2021-12-09', 
hh='11') SELECT 2, 'a2'
 INSERT INTO hudi_cow_pt_tbl PARTITION(dt, hh) SELECT 1 AS id, 'a1' AS name, 
1000 AS ts, '2021-12-09' AS dt, '10' AS hh;
 ```
 
+#### Inserting VECTOR, BLOB, and VARIANT columns
+
+```sql
+-- VECTOR: pass an ARRAY of floats with the declared dimension
+INSERT INTO products SELECT 'prod_001', 'Shoes', ARRAY(0.12, -0.03, /* ... 768 
floats ... */);
+
+-- BLOB (INLINE): construct the internal struct with named_struct
+INSERT INTO media_assets VALUES (
+  'asset_001', 'logo.png', 'image/png', 45230,
+  named_struct(

Review Comment:
   🤖 The `<binary>` token here isn't valid SQL, so a user copy-pasting this 
snippet will hit a parse error before they ever see the named_struct shape. It 
might help to substitute a runnable placeholder such as `CAST('hello world' AS 
BINARY)` (and add a comment noting that in practice you'd pass real bytes from 
the DataFrame API), so the example doubles as a smoke test.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
website/docs/reading_tables_batch_reads.md:
##########
@@ -45,6 +45,19 @@ SELECT * FROM hudi_table WHERE age > 25;
 
 For more Flink read options, see [Using Flink](ingestion_flink.md).
 
+## VECTOR and BLOB Columns
+
+Hudi exposes two Spark SQL extensions for reading the 1.2.0 types:
+
+- `hudi_vector_search(table, vector_col, query_vector, top_k[, metric])` — 
top-K similarity search
+  over a `VECTOR` column. See [Vector Search](vector_search.md).
+- `read_blob(blob_col)` — materializes raw bytes from a `BLOB` column. Under 
the default
+  `hoodie.read.blob.inline.mode=DESCRIPTOR`, calling `read_blob()` on an 
`INLINE` column throws —
+  set the mode to `CONTENT` to read inline bytes. See [Unstructured 
Data](blob_unstructured_data.md).

Review Comment:
   🤖 Worth confirming the default and the error semantics here. The note says 
that under the default `hoodie.read.blob.inline.mode=DESCRIPTOR`, calling 
`read_blob()` on an `INLINE` column throws — that's a surprising default for a 
function whose stated job is to materialize bytes, and users hitting it from 
the SQL pages may not understand why. A one-line rationale (e.g., "to avoid 
accidentally pulling MBs per row") would help, and @yihua could you 
double-check the config default name/value matches what shipped?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
website/docs/lance_file_format.md:
##########
@@ -91,35 +90,18 @@ export 
LANCE_BUNDLE_JAR=/path/to/lance-spark-bundle-3.5_2.12-0.4.0.jar
 spark-shell --jars $HUDI_BUNDLE_JAR,$LANCE_BUNDLE_JAR
 ```
 
-## How Hudi + Lance Work Together
+## Layering
 
-Hudi manages the table layer — transactions, schema, timeline, table services 
— while Lance handles the
-file-level storage:
+Hudi manages the table layer (timeline, metadata, schema, file groups, table 
services). Lance is the
+on-disk file format for base files. Log files for MOR tables remain Avro.
 
-```
-┌───────────────────────────────────┐
-│         Hudi Table Layer          │
-│  Timeline, Metadata, Indexing     │
-│  Transactions, Schema Evolution   │
-├───────────────────────────────────┤
-│     File Group / File Slice       │
-│  (same Hudi concepts as Parquet)  │
-├───────────────────────────────────┤
-│     Lance Data Files (.lance)     │
-│  Columnar storage                 │
-│  Fragment-based layout            │
-├───────────────────────────────────┤
-│   Storage (S3, GCS, HDFS, FS)    │
-└───────────────────────────────────┘
-```
-
-All Hudi table services work with Lance-backed tables:
+Table-service behavior on Lance-backed tables:
 
-- **Compaction** — merges log files into Lance base files
-- **Clustering** — reorganizes Lance files for better data locality
-- **Cleaning** — removes old Lance file versions
-- **Metadata indexing** — bloom filters work across Lance files; column stats 
and partition stats are
-  **automatically disabled** for Lance tables
+- **Compaction** — merges Avro log files into Lance base files.
+- **Clustering** — reorganizes records into new Lance files.
+- **Cleaning** — removes obsolete Lance file slices.
+- **Metadata indexing** — bloom filter indexing is supported. Column-stats and 
partition-stats
+  indices are automatically disabled for Lance base files.

Review Comment:
   🤖 The new Layering section is clearer than the ASCII diagram it replaces, 
but the claim "Log files for MOR tables remain Avro" is worth a quick sanity 
check against the actual Lance writer — if Lance-format log files are on the 
roadmap or already partially supported, you may want to soften this to 
"currently remain Avro" so the doc doesn't go stale on the next release.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
website/docs/writing_data.md:
##########
@@ -467,6 +467,22 @@ The following advanced storage configuration options were 
added in Hudi 1.2.0:
 | Config | Default | Description |
 |---|---|---|
 | `hoodie.parquet.write.config.injector.class` | (none) | Fully-qualified 
class name of a custom `HoodieParquetConfigInjector` implementation. Use this 
to inject custom Parquet writer properties (e.g., disable dictionary encoding, 
set bloom filter sizes) without modifying the Hudi source. The implementing 
class must implement `org.apache.hudi.io.HoodieParquetConfigInjector`. |
+| `hoodie.table.base.file.format` | `parquet` | Base file format for the 
table. Accepts `parquet`, `orc`, `hfile`, or `lance`. See [Lance File 
Format](lance_file_format.md) for the Lance-specific options. |
+
+### Writing VECTOR, BLOB, and VARIANT Columns
+
+Hudi 1.2.0 introduces three new column types that participate in writes the 
same way as standard
+SQL types:
+
+- **`VECTOR(dim[, FLOAT|DOUBLE|INT8])`** — written as an array of floats 
matching the declared
+  dimension. With the DataFrame API, stamp `hudi_type=VECTOR(dim)` metadata on 
the column.

Review Comment:
   🤖 Small clarification: the VECTOR bullet mentions stamping 
`hudi_type=VECTOR(dim)` metadata on the column for the DataFrame API. Is the 
key literally `hudi_type` and the value literally `VECTOR(<dim>)` as a string? 
A one-line code snippet (or link to the exact DataFrame helper in 
vector_search.md) would remove ambiguity for users writing Scala/PySpark code 
without round-tripping through SQL.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
website/docs/blob_unstructured_data.md:
##########
@@ -330,21 +269,12 @@ catalog's native struct type:
 The raw binary payload is preserved in the struct representation, but 
`read_blob()` is a Spark SQL
 function and is not available in Hive or BigQuery directly.
 
-## Best Practices
-
-1. **Choose the right mode** — Use inline for small, frequently-accessed 
objects. Use out-of-line for
-   anything over 1 MB.
-
-2. **Filter before resolving** — Always apply WHERE predicates before calling 
`read_blob()` to avoid
-   unnecessary data transfer.
-
-3. **Batch container files** — When using out-of-line mode, pack multiple 
objects into container files
-   rather than storing one file per object.
-
-4. **Combine with VECTOR** — Pair BLOB columns with VECTOR columns for 
powerful "search then retrieve"
-   workflows: vector search narrows candidates, then `read_blob()` fetches 
just the winners.
+## Notes
 
-5. **Use incremental queries** — Process only new BLOBs by leveraging Hudi's 
incremental query support:
-   ```sql
-   SELECT * FROM hudi_table_changes('media_assets', 'latest_state', 
'20260401000000');
-   ```
+- `read_blob()` is a Spark SQL function. It is not available from Hive, 
BigQuery, or other engines
+  reading the underlying struct directly.
+- For OUT_OF_LINE blobs, multiple rows can reference different `(offset, 
length)` ranges within the
+  same `external_path`. Hudi reads the configured byte ranges; it does not own 
the lifecycle of the

Review Comment:
   🤖 Minor: "see the struct definition above" is a soft pointer — readers 
landing here from a cross-reference (e.g., the new sql_dml/writing_data 
sections) may not have the earlier struct definition in mind. An anchor link to 
that section would make this self-locating.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [DOCS] Restructure 1.2.0 unstructured-data pages [hudi]

Reply via email to