subject:"\[GitHub\] \[hudi\] parisni commented on pull request #8683\: \[HUDI\-5533\] Support spark columns comments"

[GitHub] [hudi] parisni commented on pull request #8683: [HUDI-5533] Support spark columns comments

2023-07-29 Thread via GitHub



parisni commented on PR #8683:
URL: https://github.com/apache/hudi/pull/8683#issuecomment-1656905343

   @danny0405 implemented hive sync datasource, so to clarify about what brings 
this PR : 
   - spark read hudi from path (recursive support for comments)
   - spark read hudi from a hive metastore with 
`hoodie.datasource.hive_sync.sync_as_datasource=true` (default) : support for 
first level of comments
   
   #4960 provided:
   - spark read hudi from hive metastore with 
`hoodie.datasource.hive_sync.sync_as_datasource=false` : support for first 
level of comments
   - query engine (athena, trino..) access to comments from hive metastore
   
   #8740 provided:
   - spark read hudi from glue metastore with 
`hoodie.datasource.hive_sync.sync_as_datasource=false` : support for first 
level of comments
   - query engine (athena, trino..) access to comments from glue metastore
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] parisni commented on pull request #8683: [HUDI-5533] Support spark columns comments

2023-07-23 Thread via GitHub



parisni commented on PR #8683:
URL: https://github.com/apache/hudi/pull/8683#issuecomment-1646829280

   @danny0405 just learned [here why the spark data source stuff shall be kept 
within 
hms](https://cwiki.apache.org/confluence/plugins/servlet/mobile?contentId=147427331#content/view/147427331)
   
   > With the above, all of Spark SQL queries will use Hudi DataSource and 
hence end up using the custom FileIndex for doing the listing. 
   
   Then there is a need to update the datasource code to be aware of comments 
within hive sync.
   
   @prashantwason can you confirm this part is still applicable and as a result 
`hoodie.datasource.hive_sync.sync_as_datasource` is highly encouraged ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] parisni commented on pull request #8683: [HUDI-5533] Support spark columns comments

2023-07-10 Thread via GitHub



parisni commented on PR #8683:
URL: https://github.com/apache/hudi/pull/8683#issuecomment-1629076136

   @danny0405 The PR is ready. 
   
   The way to activate comment for all engines is as below:
   ```
   hoodie.datasource.hive_sync.sync_comment=true
   hoodie.datasource.hive_sync.sync_as_datasource=false
   ```
   
   To recap:
   - `hoodie.datasource.hive_sync.sync_comment`: ads the comments in the 
metastore (hive, glue ...)
   -  the current PR fixes missing comment when reading the hudi table from 
path (not metastore) with spark
   - `hoodie.datasource.hive_sync.sync_as_datasource`: allow spark to get 
comments from the metastore
   
   The reason spark datasource should be disabled is our current table property 
builder rely on parquet types which does [not know about 
comments](https://github.com/apache/hudi/blob/a9225bd9093e8be8f9ccdd07b0bc542994ffc4c6/hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/Parquet2SparkSchemaUtils.java#L36).
 It would be painful to modify and spark table properties are useless and leads 
to errors. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] parisni commented on pull request #8683: [HUDI-5533] Support spark columns comments

2023-06-29 Thread via GitHub

parisni commented on PR #8683:
URL: https://github.com/apache/hudi/pull/8683#issuecomment-1613914438

I have investigated a bit, and here my current understanding:

Reading hudi table w/ spark has two path:
1. if
`spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog`
(which is what hudi recommend in the documentation), then hudi [will rely on
the `HiveSessionCatalog` to get the
schema](https://github.com/apache/hudi/blob/dc3aa399ffc4875abba7be5833ebabca222eb6ff/hudi-spark-datasource/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieCatalog.scala#L101-L109).
Then if it's a hive metastore implementation, spark will try to get the schema
as case sensitive and thus not get it from the hive schema (which is case
insensitive), and fall back fetching the table properties
`spark.sql.sources.schema` instead. If it's a glue metastore likely the same
happens. BTW, [our hive_sync service currently don't propagate the
comments](https://github.com/apache/hudi/blob/dc3aa399ffc4875abba7be5833ebabca222eb6ff/hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/SparkDataSourceTableUtils.java#L44-L97)
in the `spark.sql.source
s.schema` and that's why in this case `spark.sql("desc table")` or
`spark.table("table").schema` won't return the comments. This behavior can
currently be avoided by setting
`hoodie.datasource.hive_sync.sync_as_datasource=false`, which forces spark to
grab the information from hive (by letting the spark properties empty in the
hms), but in a case insensitive way. I'm not sure what are the consequences of
relying on hive only.
2. if `spark.sql.catalog.spark_catalog` is not set or if reading hudi table
by path `spark.read.format("hudi").load("path")`, then spark uses the path
updated in this PR, by mean get the schema information from the hudi avro file.
Except when using `spark.sql("desc table")` b/c spark fallbacks to
`hiveSessionCatalog` in this case.

So right now, using this PR and setting
both`hoodie.datasource.hive_sync.sync_comment=true` and
`hoodie.datasource.hive_sync.sync_as_datasource=false`, one will get the
comments in any case (by identifier or by path). However not setting the spark
datasource informations within the HMS might have some bad effects (if not, why
making so much efforts to maintains two schemas within the hms?).

To fix this we could:
1. make hive_sync [populate the comments in the
properties](https://github.com/apache/hudi/blob/dc3aa399ffc4875abba7be5833ebabca222eb6ff/hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/SparkDataSourceTableUtils.java#L44-L97)

2. make `HoodieCatalog` not use anymore the `HiveSessionCatalog` to get the
schema, but use the hudi avro in place and skip the HMS for this.

I would go for `1` b/c it keeps the current logic intact, and also cover the
case `spark.sql("desc tablename")` when `spark.sql.catalog.spark_catalog` is
not set.

Thought @danny0405 @bhasudha @yihua ?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] parisni commented on pull request #8683: [HUDI-5533] Support spark columns comments

2023-06-12 Thread via GitHub



parisni commented on PR #8683:
URL: https://github.com/apache/hudi/pull/8683#issuecomment-1587806101

   while `spark.read.format("hudi").load("path")` and 
`spark.table("database.table")` returns the comments within the schema,  
`spark.sql("desc database.schema")` won't. So I have to investigate and add a 
fix. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] parisni commented on pull request #8683: [HUDI-5533] Support spark columns comments

2023-05-14 Thread via GitHub



parisni commented on PR #8683:
URL: https://github.com/apache/hudi/pull/8683#issuecomment-1546895581

   > Does Option.ofNullable work correctly here?
   
   Scala Option does not have ofNullable (java Optinonal do have). BTW 
`Option(value)` is equivalent as what you suggest, and I have corrected


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] parisni commented on pull request #8683: [HUDI-5533] Support spark columns comments

[GitHub] [hudi] parisni commented on pull request #8683: [HUDI-5533] Support spark columns comments

[GitHub] [hudi] parisni commented on pull request #8683: [HUDI-5533] Support spark columns comments

[GitHub] [hudi] parisni commented on pull request #8683: [HUDI-5533] Support spark columns comments

[GitHub] [hudi] parisni commented on pull request #8683: [HUDI-5533] Support spark columns comments

[GitHub] [hudi] parisni commented on pull request #8683: [HUDI-5533] Support spark columns comments

6 matches

Site Navigation

Mail list logo

Footer information