hudi-agent commented on code in PR #18726:
URL: https://github.com/apache/hudi/pull/18726#discussion_r3235391470
##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala:
##########
@@ -620,3 +624,48 @@ case class HoodiePostAnalysisRule(sparkSession:
SparkSession) extends Rule[Logic
}
}
}
+
+/**
+ * Stamps a synthesized [[CatalogTable]] (table name, base path, schema) onto
path-based
+ * Hudi reads whose underlying file index is incremental or CDC. Without it,
lineage and
+ * governance tooling sees `LogicalRelation.catalogTable = None` and falls
back to the
+ * relation's class name as the dataset identifier -- useless for tracking
which table
+ * an incremental query came from.
+ *
+ * Scope is intentionally limited to incremental and CDC reads:
+ * - Catalog-registered reads already have `catalogTable` populated.
+ * - Path-based snapshot reads have a working file-path-based fallback in
existing
+ * lineage tooling; changing their behavior is a separate decision.
+ */
+object HoodieIncrementalRelationIdentifier extends Rule[LogicalPlan] {
+ override def apply(plan: LogicalPlan): LogicalPlan =
+ AnalysisHelper.allowInvokingTransformsInAnalyzer {
Review Comment:
🤖 The 4-arg pattern `LogicalRelation(hfsr, _, None, _)` will fail to compile
against Spark 4.0/4.1, where `LogicalRelation` is a 5-arg case class (see e.g.
`BaseSpark4Adapter`, `Spark4HoodiePruneFileSourcePartitions`,
`HoodieSpark40/41Analysis` — all 5-arg). Since `hudi-spark` is built per Spark
profile, this would break the Spark 4 builds. Could you switch to a
Spark-version-agnostic type pattern (e.g. `case lr: LogicalRelation if
lr.catalogTable.isEmpty && lr.relation.isInstanceOf[HadoopFsRelation] &&
isIncrementalOrCDC(lr.relation.asInstanceOf[HadoopFsRelation].location] =>
...`) to match the convention used by every other `LogicalRelation` reference
in `hudi-spark`? @yihua
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala:
##########
@@ -620,3 +624,48 @@ case class HoodiePostAnalysisRule(sparkSession:
SparkSession) extends Rule[Logic
}
}
}
+
+/**
+ * Stamps a synthesized [[CatalogTable]] (table name, base path, schema) onto
path-based
+ * Hudi reads whose underlying file index is incremental or CDC. Without it,
lineage and
+ * governance tooling sees `LogicalRelation.catalogTable = None` and falls
back to the
+ * relation's class name as the dataset identifier -- useless for tracking
which table
+ * an incremental query came from.
+ *
+ * Scope is intentionally limited to incremental and CDC reads:
+ * - Catalog-registered reads already have `catalogTable` populated.
+ * - Path-based snapshot reads have a working file-path-based fallback in
existing
+ * lineage tooling; changing their behavior is a separate decision.
+ */
+object HoodieIncrementalRelationIdentifier extends Rule[LogicalPlan] {
+ override def apply(plan: LogicalPlan): LogicalPlan =
+ AnalysisHelper.allowInvokingTransformsInAnalyzer {
+ plan transform {
+ case lr @ LogicalRelation(hfsr: HadoopFsRelation, _, None, _)
Review Comment:
🤖 nit: `hfsr` requires a moment of mental decoding — could you use
`fsRelation` or just `rel` so the reader doesn't have to expand the
abbreviation?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]