Re: [I] [SUPPORT] - Issues after upgrading EMR & Hudi [hudi]
MikeMccree closed issue #10273: [SUPPORT] - Issues after upgrading EMR & Hudi URL: https://github.com/apache/hudi/issues/10273 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] - Issues after upgrading EMR & Hudi [hudi]
MikeMccree commented on issue #10273: URL: https://github.com/apache/hudi/issues/10273#issuecomment-1855943806 Hi @ad1happy2go yes, confirmed it is syncing. I see the DB, tables and data -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] - Issues after upgrading EMR & Hudi [hudi]
ad1happy2go commented on issue #10273: URL: https://github.com/apache/hudi/issues/10273#issuecomment-1855513194 @MikeMccree Are you sure after removing this is it syncing to Glue Catalog. Did you confirmed the tables ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] - Issues after upgrading EMR & Hudi [hudi]
MikeMccree commented on issue #10273: URL: https://github.com/apache/hudi/issues/10273#issuecomment-1853396088 @ad1happy2go Which logs would you like to see? Also - after more playing around with the configs, I discovered the below: ``` # 'hoodie.meta.sync.client.tool.class': 'org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool', --> This comboed with the below create_managed_table = SCHEMA_NOT_FOUND # 'hoodie.datasource.hive_sync.create_managed_table': 'true', This on its own without the above AwsGlueCatalogSyncTool = "Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool" ``` I removed the above configs from my hudi configuration and everything is working now. Maybe I dont fully understand the configurations and perhaps never needed those anyway. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] - Issues after upgrading EMR & Hudi [hudi]
ad1happy2go commented on issue #10273: URL: https://github.com/apache/hudi/issues/10273#issuecomment-1852540846 Can you provide us the logs to look into it more. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] - Issues after upgrading EMR & Hudi [hudi]
MikeMccree commented on issue #10273: URL: https://github.com/apache/hudi/issues/10273#issuecomment-1852522867 @ad1happy2go Something else interesting to note. If I manually create the DB ``` database_name = "michael_test" # Create the database spark.sql(f"CREATE DATABASE IF NOT EXISTS {database_name}") ``` The error disappears, but I am not seeing the tables and data being added to the DB. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] - Issues after upgrading EMR & Hudi [hudi]
MikeMccree commented on issue #10273: URL: https://github.com/apache/hudi/issues/10273#issuecomment-1852492519 Hi @ad1happy2go _Also did you tried with explicitly defining the Glue Sync Tool?_ Yes I had it in while running all my tests I have added both of these configurations to my script and still get the same error: ``` 'hoodie.database.name': database_name, 'hoodie.table.name': loggingtablename, 'hoodie.datasource.hive_sync.database': database_name, 'hoodie.datasource.hive_sync.table': loggingtablename, 'hoodie.datasource.hive_sync.auto_create_database' : 'true', 'hoodie.meta.sync.client.tool.class': 'org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool', ``` By default 'hoodie.datasource.hive_sync.auto_create_database' is true in anycase, and I did not have to specify it in the previous versions and it would auto create my database. This issue has me really stumped at the moment.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] - Issues after upgrading EMR & Hudi [hudi]
ad1happy2go commented on issue #10273: URL: https://github.com/apache/hudi/issues/10273#issuecomment-1852436464 @MikeMccree Do you have this database in glue, If yes then your setup might not be accessing glue at all. You can use `[hoodie.datasource.hive_sync.auto_create_database](https://hudi.apache.org/docs/configurations/#hoodiedatasourcehive_syncauto_create_database)` to automatically create the database if not exists. Also did you tried with explicitly defining the Glue Sync Tool? https://github.com/apache/hudi/issues/10273#issuecomment-1849968200 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] - Issues after upgrading EMR & Hudi [hudi]
MikeMccree commented on issue #10273: URL: https://github.com/apache/hudi/issues/10273#issuecomment-1852429920 @ad1happy2go After more toying around I managed to get rid of the above exceptions by being specific about the JARS I am submitting along with my spark-submit. The problem now is I am running into the following issue: ``` [SCHEMA_NOT_FOUND] The schema `michael_test` cannot be found. Verify the spelling and correctness of the schema and catalog. If you did not qualify the name with a catalog, verify the current_schema() output, or qualify the name with the correct catalog. ``` Why would I receive the above error when I specify the below in my script: ``` # Specify the database name database_name = "michael_test" 'hoodie.database.name': database_name, 'hoodie.table.name': loggingtablename, 'hoodie.datasource.hive_sync.database': database_name, 'hoodie.datasource.hive_sync.table': loggingtablename, ``` **Again, the above config / script worked perfectly fine on EMR 6.10.0 > Spark 3.3.1 > Hudi 0.12.2** **Is there possibly something buggy with EMR 6.15.0 > Spark 3.4.1 > Hudi 0.14.0 ?** -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] - Issues after upgrading EMR & Hudi [hudi]
MikeMccree commented on issue #10273: URL: https://github.com/apache/hudi/issues/10273#issuecomment-1850461094 Hi @ad1happy2go Thanks for the above - I think my real question is the following "Are there any additional config changes I need to make to my script to upgrade from hudi 0.12.2 to 0.14.0" I have read the release notes and it doesnt seem to be the case. So I am just curious as to why my current 0.12.2 script does not work when upgrading to 0.14.0? I never had this config option originally ("hoodie.meta.sync.client.tool.class"). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] - Issues after upgrading EMR & Hudi [hudi]
ad1happy2go commented on issue #10273: URL: https://github.com/apache/hudi/issues/10273#issuecomment-1849968200 @MikeMccree Do you want to sync your table with Glue catalog. If yes, can you set "hoodie.meta.sync.client.tool.class" as "org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool"). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] - Issues after upgrading EMR & Hudi [hudi]
MikeMccree commented on issue #10273: URL: https://github.com/apache/hudi/issues/10273#issuecomment-1849816260 @ad1happy2go The only config I am using is mentioned above, but here it is again for you: `hudi_streaming_count_options = { 'hoodie.database.name': database_name, 'hoodie.table.name': loggingtablename, 'hoodie.datasource.hive_sync.database': database_name, 'hoodie.datasource.hive_sync.table': loggingtablename, 'hoodie.datasource.hive_sync.create_managed_table': 'true', 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.mode': 'hms', 'hoodie.datasource.hive_sync.support_timestamp': 'true', 'hoodie.datasource.write.hive_style_partitioning': 'true', 'hoodie.datasource.write.precombine.field': 'start_time', 'hoodie.datasource.write.partitionpath.field': "table_name, batch_id", 'hoodie.datasource.hive_sync.partition_fields': "table_name, batch_id", 'hoodie.datasource.write.table.type' : 'COPY_ON_WRITE', 'hoodie.datasource.write.recordkey.field': "batch_id", 'hoodie.datasource.write.operation': 'upsert', "hoodie.write.num.retries.on.conflict.failures": "15", }` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] - Issues after upgrading EMR & Hudi [hudi]
ad1happy2go commented on issue #10273: URL: https://github.com/apache/hudi/issues/10273#issuecomment-1849361768 @MikeMccree Can you let us know the hive sync configurations you are using? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] - Issues after upgrading EMR & Hudi [hudi]
MikeMccree commented on issue #10273: URL: https://github.com/apache/hudi/issues/10273#issuecomment-1847137840 You need any more info to assist? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] - Issues after upgrading EMR & Hudi [hudi]
MikeMccree opened a new issue, #10273: URL: https://github.com/apache/hudi/issues/10273 **Describe the problem you faced** Attempting to upgrade my hudi version and emr cluster, but running into sync issues / tabletype issues and partition sync issues. Seems the DB gets created and the tables are created, but there is no data inserted into these tables. None of the above experienced in my previous emr cluster. **To Reproduce** Steps to reproduce the behavior (For me): 1. Spin up new EMR cluster (emr-6.15.0 or 6.14.0) 2. Along with this, comes hudi 0.14 as seen in usr/lib/hudi 3. Try run my exact script which was working on emr-6.10 with hudi 0.12.2 4. I deleted destination s3 objects, glue db and glue tables **Expected behavior** I would assume my expectations would be to have the same script run successfully and insert data. **Environment Description** * Hudi version : 0.14.0 * Spark version : 3.4.1 * Hive version : 3.1.1 * Hadoop version : 3.3.6 * Storage (HDFS/S3/GCS..) : s3 * Running on Docker? (yes/no) : no **Additional context** config I am using: `hudi_streaming_count_options = { 'hoodie.database.name': database_name, 'hoodie.table.name': loggingtablename, 'hoodie.datasource.hive_sync.database': database_name, 'hoodie.datasource.hive_sync.table': loggingtablename, 'hoodie.datasource.hive_sync.create_managed_table': 'true', 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.mode': 'hms', 'hoodie.datasource.hive_sync.support_timestamp': 'true', 'hoodie.datasource.write.hive_style_partitioning': 'true', 'hoodie.datasource.write.precombine.field': 'start_time', 'hoodie.datasource.write.partitionpath.field': "table_name, batch_id", 'hoodie.datasource.hive_sync.partition_fields': "table_name, batch_id", 'hoodie.datasource.write.table.type' : 'COPY_ON_WRITE', 'hoodie.datasource.write.recordkey.field': "batch_id", 'hoodie.datasource.write.operation': 'upsert', "hoodie.write.num.retries.on.conflict.failures": "15", }` **Stacktrace** `An error occurred while calling o344.save. : org.apache.hudi.exception.HoodieMetaSyncException: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool at org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:81) at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$2(HoodieSparkSqlWriter.scala:993) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:991) at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:1089) at org.apache.hudi.HoodieSparkSqlWriter$.writeInternal(HoodieSparkSqlWriter.scala:441) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:132) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:150) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:104) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:123) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$9(SQLExecution.scala:160) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$8(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:271) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:159) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:69) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(Query