Hi Igor, As of current implementation, Hudi submits queries like creating table, syncing partitions etc directly to the hive server instead of directly communicating with the metastore. Thus while launching the EMR cluster, you should install Hive on the cluster as well. Also enable glue catalog for both spark and hive and you should be fine.
Thanks, Udit Mehrotra AWS | EMR On 2/18/20, 2:29 AM, "Igor Basko" <igorba...@gmail.com> wrote: Hi Dear List, I'm trying to catalog Hudi files in GLUE catalog using the sync hive tool, while using the spark save function (and not the standalone version). I've created an EMR with Spark application only (without Hive). Also added the following hive metastore client factory class configuration: "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory" I've started the spark-shell using the provided by EMR hudi jars, and also using the 0.5.1 version and they both gave me the "Cannot create hive connection ..." error when running the following code <https://gist.github.com/igorbasko01/05d81fef8f39e305527fd24b946fdb9a>. ( https://gist.github.com/igorbasko01/05d81fef8f39e305527fd24b946fdb9a) After looking inside HoodieSparkSqlWriter.scala in buildSyncConfig it seems that there is no way to override the HiveSyncConfig.useJdbc variable to be false, ( https://github.com/apache/incubator-hudi/blob/release-0.5.1/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L232 ) which means that in HoodieHiveClient constructor it will always try to createHiveConnection() ( https://github.com/apache/incubator-hudi/blob/release-0.5.1/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L111 ) Instead of creating a hive client from the configuration. The next thing I did was to add a parameter that would enable overriding the useJdbc variable. Used the custom hudi jar in the EMR, and was able to progress further. But got a different error down the line. What I was happy to see that apparently it was using the AWSGlueClientFactory: 20/02/17 13:55:17 INFO AWSGlueClientFactory: Using region from ec2 metadata : eu-west-1 And was able to detect that the table doesn't exists in GLUE: 20/02/17 13:55:18 INFO HiveSyncTool: Hive table drivers is not found. Creating it But I got the following exception: java.lang.NoClassDefFoundError: org/json/JSONException at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:10847) A partial log could be found here <https://gist.github.com/igorbasko01/612f773632cb8382014166e0ed2a06d3> ( https://gist.github.com/igorbasko01/612f773632cb8382014166e0ed2a06d3) As it seems to me, in the case of checking if a table exists, the HoodieHiveClient uses the client variable which is an interface IMetaStoreClient, that the AWSCatalogMetastoreClient implements. And it works fine. https://github.com/apache/incubator-hudi/blob/release-0.5.1/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L469 https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/blob/master/aws-glue-datacatalog-spark-client/src/main/java/com/amazonaws/glue/catalog/metastore/AWSCatalogMetastoreClient.java But the createTable of HoodieHiveClient, eventually creates a hive.ql.Driver and not uses the AWS client, which eventually gets an exception. So what I would like to know, is am I doing it wrong when trying to sync to GLUE? Or maybe currently Hudi doesn't support updating GLUE catalog without some code changes? Best Regards, Igor