parisni commented on PR #8683:
URL: https://github.com/apache/hudi/pull/8683#issuecomment-1613914438

   I have investigated a bit, and here my current understanding:
   
   Reading hudi table w/ spark has two path:
   1. if 
`spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog`
 (which is what hudi recommend in the documentation), then hudi [will rely on 
the `HiveSessionCatalog` to get the 
schema](https://github.com/apache/hudi/blob/dc3aa399ffc4875abba7be5833ebabca222eb6ff/hudi-spark-datasource/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieCatalog.scala#L101-L109).
 Then if it's a hive metastore implementation, spark will try to get the schema 
as case sensitive and thus not get it from the hive schema (which is case 
insensitive), and fall back fetching the table properties 
`spark.sql.sources.schema` instead.  If it's a glue metastore likely the same 
happens. BTW, [our hive_sync service currently don't propagate the 
comments](https://github.com/apache/hudi/blob/dc3aa399ffc4875abba7be5833ebabca222eb6ff/hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/SparkDataSourceTableUtils.java#L44-L97)
 in the `spark.sql.source
 s.schema` and that's why in this case `spark.sql("desc table")` or 
`spark.table("table").schema` won't return the comments. This behavior can 
currently be avoided by setting 
`hoodie.datasource.hive_sync.sync_as_datasource=false`, which forces spark to 
grab the information from hive (by letting the spark properties empty in the 
hms), but in a case insensitive way. I'm not sure what are the  consequences of 
relying on hive only.
   2. if `spark.sql.catalog.spark_catalog` is not set or if reading hudi table 
by path `spark.read.format("hudi").load("path")`, then spark uses the path 
updated in this PR, by mean get the schema information from the hudi avro file. 
Except when using `spark.sql("desc table")` b/c spark fallbacks to 
`hiveSessionCatalog` in this case.
   
   So right now, using this PR and setting 
both`hoodie.datasource.hive_sync.sync_comment=true`  and 
`hoodie.datasource.hive_sync.sync_as_datasource=false`, one will get the 
comments in any case (by identifier or by path). However not setting the spark 
datasource informations within the HMS might have some bad effects (if not, why 
making so much efforts to maintains two schemas within the hms?). 
   
   To fix this we could:
   1. make hive_sync [populate the comments in the 
properties](https://github.com/apache/hudi/blob/dc3aa399ffc4875abba7be5833ebabca222eb6ff/hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/SparkDataSourceTableUtils.java#L44-L97)
 
   2. make `HoodieCatalog` not use anymore the `HiveSessionCatalog` to get the 
schema, but use the hudi avro in place and skip the HMS for this.
   
   I would go for `1` b/c it keeps the current logic intact, and also cover the 
case `spark.sql("desc tablename")` when `spark.sql.catalog.spark_catalog` is 
not set.
   
   Thought @danny0405 @bhasudha @yihua ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to