yihua commented on code in PR #18599:
URL: https://github.com/apache/hudi/pull/18599#discussion_r3251099684


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceWriter.java:
##########
@@ -399,13 +460,59 @@ protected void updateRecordMetadata(InternalRow row,
     row.update(FILENAME_METADATA_FIELD.ordinal(), fileName);
   }
 
-  @AllArgsConstructor(staticName = "of")
-  private static class SparkArrowWriter implements ArrowWriter<InternalRow> {
+  /**
+   * Forwards rows to the lance-spark {@link LanceArrowWriter}. When the schema
+   * has no {@code VariantType} columns, rows are passed through directly. 
When it
+   * does, a single {@link VariantProjectedRow} instance is reused per row to
+   * delegate every accessor to the underlying input row except at variant
+   * ordinals, where it returns a pre-allocated {@code (metadata, value)} 
struct
+   * populated by {@link 
org.apache.spark.sql.hudi.SparkAdapter#createVariantValueWriter}.

Review Comment:
   Does this projection introduce overhead?  Does Lance library or writer 
provide its own projection or adaptation for Variant Type?



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceWriter.java:
##########
@@ -293,6 +305,54 @@ private static Field rewriteBlobDataChild(Field 
blobStructField) {
     return new Field(blobStructField.getName(), 
blobStructField.getFieldType(), rebuilt);
   }
 
+  /**
+   * Single-pass walk that returns (a) the enriched schema with top-level
+   * {@code VariantType} fields replaced by Hudi's canonical
+   * {@code Struct[metadata: binary, value: binary]} (tagged {@code 
hudi_type=VARIANT}
+   * so {@code HoodieSparkSchemaConverters} promotes it back on read), and (b) 
the
+   * variant ordinals in ascending order. {@code LanceArrowUtils} has no 
VariantType
+   * case, so we hand it a plain struct. Top-level only - nested variants are 
not
+   * yet supported.
+   */
+  private static Pair<StructType, int[]> enrichForLanceVariant(StructType 
sparkSchema) {

Review Comment:
   Could `enrichForLanceVariant` be incorporated into 
`enrichSparkSchemaForLance`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to