hudi-agent commented on code in PR #18876:
URL: https://github.com/apache/hudi/pull/18876#discussion_r3320420911
##########
website/docs/sql_dml.md:
##########
@@ -48,6 +48,29 @@ INSERT INTO hudi_cow_pt_tbl PARTITION(dt = '2021-12-09',
hh='11') SELECT 2, 'a2'
INSERT INTO hudi_cow_pt_tbl PARTITION(dt, hh) SELECT 1 AS id, 'a1' AS name,
1000 AS ts, '2021-12-09' AS dt, '10' AS hh;
```
+#### Inserting VECTOR, BLOB, and VARIANT columns
+
+```sql
+-- VECTOR: pass an ARRAY of floats with the declared dimension
+INSERT INTO products SELECT 'prod_001', 'Shoes', ARRAY(0.12, -0.03, /* ... 768
floats ... */);
+
+-- BLOB (INLINE): construct the internal struct with named_struct
+INSERT INTO media_assets VALUES (
+ 'asset_001', 'logo.png', 'image/png', 45230,
+ named_struct(
+ 'type', 'INLINE',
+ 'data', <binary>,
+ 'reference', CAST(NULL AS STRUCT<external_path: STRING, offset: BIGINT,
length: BIGINT, managed: BOOLEAN>)
+ )
Review Comment:
🤖 Consider also showing an `OUT_OF_LINE` BLOB INSERT here. The rest of the
docs (and writing_data.md) call out the OOL mode as the recommended path for
large objects, so omitting it from the only INSERT example may push users
toward inline by default. Even a one-liner showing the `reference` struct
populated would round this section out.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
website/docs/sql_dml.md:
##########
@@ -48,6 +48,29 @@ INSERT INTO hudi_cow_pt_tbl PARTITION(dt = '2021-12-09',
hh='11') SELECT 2, 'a2'
INSERT INTO hudi_cow_pt_tbl PARTITION(dt, hh) SELECT 1 AS id, 'a1' AS name,
1000 AS ts, '2021-12-09' AS dt, '10' AS hh;
```
+#### Inserting VECTOR, BLOB, and VARIANT columns
+
+```sql
+-- VECTOR: pass an ARRAY of floats with the declared dimension
+INSERT INTO products SELECT 'prod_001', 'Shoes', ARRAY(0.12, -0.03, /* ... 768
floats ... */);
+
+-- BLOB (INLINE): construct the internal struct with named_struct
+INSERT INTO media_assets VALUES (
+ 'asset_001', 'logo.png', 'image/png', 45230,
+ named_struct(
Review Comment:
🤖 The `<binary>` token here isn't valid SQL, so a user copy-pasting this
snippet will hit a parse error before they ever see the named_struct shape. It
might help to substitute a runnable placeholder such as `CAST('hello world' AS
BINARY)` (and add a comment noting that in practice you'd pass real bytes from
the DataFrame API), so the example doubles as a smoke test.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
website/docs/reading_tables_batch_reads.md:
##########
@@ -45,6 +45,19 @@ SELECT * FROM hudi_table WHERE age > 25;
For more Flink read options, see [Using Flink](ingestion_flink.md).
+## VECTOR and BLOB Columns
+
+Hudi exposes two Spark SQL extensions for reading the 1.2.0 types:
+
+- `hudi_vector_search(table, vector_col, query_vector, top_k[, metric])` —
top-K similarity search
+ over a `VECTOR` column. See [Vector Search](vector_search.md).
+- `read_blob(blob_col)` — materializes raw bytes from a `BLOB` column. Under
the default
+ `hoodie.read.blob.inline.mode=DESCRIPTOR`, calling `read_blob()` on an
`INLINE` column throws —
+ set the mode to `CONTENT` to read inline bytes. See [Unstructured
Data](blob_unstructured_data.md).
Review Comment:
🤖 Worth confirming the default and the error semantics here. The note says
that under the default `hoodie.read.blob.inline.mode=DESCRIPTOR`, calling
`read_blob()` on an `INLINE` column throws — that's a surprising default for a
function whose stated job is to materialize bytes, and users hitting it from
the SQL pages may not understand why. A one-line rationale (e.g., "to avoid
accidentally pulling MBs per row") would help, and @yihua could you
double-check the config default name/value matches what shipped?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
website/docs/lance_file_format.md:
##########
@@ -91,35 +90,18 @@ export
LANCE_BUNDLE_JAR=/path/to/lance-spark-bundle-3.5_2.12-0.4.0.jar
spark-shell --jars $HUDI_BUNDLE_JAR,$LANCE_BUNDLE_JAR
```
-## How Hudi + Lance Work Together
+## Layering
-Hudi manages the table layer — transactions, schema, timeline, table services
— while Lance handles the
-file-level storage:
+Hudi manages the table layer (timeline, metadata, schema, file groups, table
services). Lance is the
+on-disk file format for base files. Log files for MOR tables remain Avro.
-```
-┌───────────────────────────────────┐
-│ Hudi Table Layer │
-│ Timeline, Metadata, Indexing │
-│ Transactions, Schema Evolution │
-├───────────────────────────────────┤
-│ File Group / File Slice │
-│ (same Hudi concepts as Parquet) │
-├───────────────────────────────────┤
-│ Lance Data Files (.lance) │
-│ Columnar storage │
-│ Fragment-based layout │
-├───────────────────────────────────┤
-│ Storage (S3, GCS, HDFS, FS) │
-└───────────────────────────────────┘
-```
-
-All Hudi table services work with Lance-backed tables:
+Table-service behavior on Lance-backed tables:
-- **Compaction** — merges log files into Lance base files
-- **Clustering** — reorganizes Lance files for better data locality
-- **Cleaning** — removes old Lance file versions
-- **Metadata indexing** — bloom filters work across Lance files; column stats
and partition stats are
- **automatically disabled** for Lance tables
+- **Compaction** — merges Avro log files into Lance base files.
+- **Clustering** — reorganizes records into new Lance files.
+- **Cleaning** — removes obsolete Lance file slices.
+- **Metadata indexing** — bloom filter indexing is supported. Column-stats and
partition-stats
+ indices are automatically disabled for Lance base files.
Review Comment:
🤖 The new Layering section is clearer than the ASCII diagram it replaces,
but the claim "Log files for MOR tables remain Avro" is worth a quick sanity
check against the actual Lance writer — if Lance-format log files are on the
roadmap or already partially supported, you may want to soften this to
"currently remain Avro" so the doc doesn't go stale on the next release.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
website/docs/writing_data.md:
##########
@@ -467,6 +467,22 @@ The following advanced storage configuration options were
added in Hudi 1.2.0:
| Config | Default | Description |
|---|---|---|
| `hoodie.parquet.write.config.injector.class` | (none) | Fully-qualified
class name of a custom `HoodieParquetConfigInjector` implementation. Use this
to inject custom Parquet writer properties (e.g., disable dictionary encoding,
set bloom filter sizes) without modifying the Hudi source. The implementing
class must implement `org.apache.hudi.io.HoodieParquetConfigInjector`. |
+| `hoodie.table.base.file.format` | `parquet` | Base file format for the
table. Accepts `parquet`, `orc`, `hfile`, or `lance`. See [Lance File
Format](lance_file_format.md) for the Lance-specific options. |
+
+### Writing VECTOR, BLOB, and VARIANT Columns
+
+Hudi 1.2.0 introduces three new column types that participate in writes the
same way as standard
+SQL types:
+
+- **`VECTOR(dim[, FLOAT|DOUBLE|INT8])`** — written as an array of floats
matching the declared
+ dimension. With the DataFrame API, stamp `hudi_type=VECTOR(dim)` metadata on
the column.
Review Comment:
🤖 Small clarification: the VECTOR bullet mentions stamping
`hudi_type=VECTOR(dim)` metadata on the column for the DataFrame API. Is the
key literally `hudi_type` and the value literally `VECTOR(<dim>)` as a string?
A one-line code snippet (or link to the exact DataFrame helper in
vector_search.md) would remove ambiguity for users writing Scala/PySpark code
without round-tripping through SQL.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
website/docs/blob_unstructured_data.md:
##########
@@ -330,21 +269,12 @@ catalog's native struct type:
The raw binary payload is preserved in the struct representation, but
`read_blob()` is a Spark SQL
function and is not available in Hive or BigQuery directly.
-## Best Practices
-
-1. **Choose the right mode** — Use inline for small, frequently-accessed
objects. Use out-of-line for
- anything over 1 MB.
-
-2. **Filter before resolving** — Always apply WHERE predicates before calling
`read_blob()` to avoid
- unnecessary data transfer.
-
-3. **Batch container files** — When using out-of-line mode, pack multiple
objects into container files
- rather than storing one file per object.
-
-4. **Combine with VECTOR** — Pair BLOB columns with VECTOR columns for
powerful "search then retrieve"
- workflows: vector search narrows candidates, then `read_blob()` fetches
just the winners.
+## Notes
-5. **Use incremental queries** — Process only new BLOBs by leveraging Hudi's
incremental query support:
- ```sql
- SELECT * FROM hudi_table_changes('media_assets', 'latest_state',
'20260401000000');
- ```
+- `read_blob()` is a Spark SQL function. It is not available from Hive,
BigQuery, or other engines
+ reading the underlying struct directly.
+- For OUT_OF_LINE blobs, multiple rows can reference different `(offset,
length)` ranges within the
+ same `external_path`. Hudi reads the configured byte ranges; it does not own
the lifecycle of the
Review Comment:
🤖 Minor: "see the struct definition above" is a soft pointer — readers
landing here from a cross-reference (e.g., the new sql_dml/writing_data
sections) may not have the earlier struct definition in mind. An anchor link to
that section would make this self-locating.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]