hudi-agent commented on code in PR #18726:
URL: https://github.com/apache/hudi/pull/18726#discussion_r3248083753
##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala:
##########
@@ -620,3 +624,54 @@ case class HoodiePostAnalysisRule(sparkSession:
SparkSession) extends Rule[Logic
}
}
}
+
+/**
+ * Stamps a synthesized [[CatalogTable]] (table name, base path, schema) onto
path-based
+ * Hudi reads whose underlying file index is incremental or CDC. Without it,
lineage and
+ * governance tooling sees `LogicalRelation.catalogTable = None` and falls
back to the
+ * relation's class name as the dataset identifier -- useless for tracking
which table
+ * an incremental query came from.
+ *
+ * Scope is intentionally limited to incremental and CDC reads:
+ * - Catalog-registered reads already have `catalogTable` populated.
+ * - Path-based snapshot reads have a working file-path-based fallback in
existing
+ * lineage tooling; changing their behavior is a separate decision.
+ */
+object HoodieIncrementalRelationIdentifier extends Rule[LogicalPlan] {
Review Comment:
🤖 nit: the object handles both incremental and CDC reads (per
`isIncrementalOrCDC`), but the name only mentions `Incremental`. Could you
rename to something like `HoodieIncrementalAndCDCRelationIdentifier` (or just
`HoodiePathBasedRelationIdentifier`) so future readers searching for CDC
behavior find it here?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala:
##########
@@ -620,3 +624,54 @@ case class HoodiePostAnalysisRule(sparkSession:
SparkSession) extends Rule[Logic
}
}
}
+
+/**
+ * Stamps a synthesized [[CatalogTable]] (table name, base path, schema) onto
path-based
+ * Hudi reads whose underlying file index is incremental or CDC. Without it,
lineage and
+ * governance tooling sees `LogicalRelation.catalogTable = None` and falls
back to the
+ * relation's class name as the dataset identifier -- useless for tracking
which table
+ * an incremental query came from.
+ *
+ * Scope is intentionally limited to incremental and CDC reads:
+ * - Catalog-registered reads already have `catalogTable` populated.
+ * - Path-based snapshot reads have a working file-path-based fallback in
existing
+ * lineage tooling; changing their behavior is a separate decision.
+ */
+object HoodieIncrementalRelationIdentifier extends Rule[LogicalPlan] {
+ override def apply(plan: LogicalPlan): LogicalPlan =
+ AnalysisHelper.allowInvokingTransformsInAnalyzer {
+ plan transform {
+ // Type pattern + guard avoids destructuring `LogicalRelation`, whose
case-class
+ // arity differs between Spark 3.x (4 args) and Spark 4.x (5 args).
This rule
+ // lives in `hudi-spark`, which is compiled against every supported
profile.
+ case lr: LogicalRelation
+ if lr.catalogTable.isEmpty
+ && lr.relation.isInstanceOf[HadoopFsRelation]
+ &&
isIncrementalOrCDC(lr.relation.asInstanceOf[HadoopFsRelation].location) =>
+ val fsRelation = lr.relation.asInstanceOf[HadoopFsRelation]
Review Comment:
🤖 nit: `lr.relation.asInstanceOf[HadoopFsRelation]` is repeated three times
across the guard, the rebinding, and the `metaClient` extraction. Could you
bind it once (e.g. an `@` pattern or a `val` after the match) so the cast
appears in only one place?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]