Re: Hudi on EMR syncing GLUE catalog issue

Mehrotra, Udit Tue, 18 Feb 2020 14:37:10 -0800

Hi Igor,

As of current implementation, Hudi submits queries like creating table, syncing 
partitions etc directly to the hive server instead of directly communicating 
with the metastore. Thus while launching the EMR cluster, you should install 
Hive on the cluster as well. Also enable glue catalog for both spark and hive 
and you should be fine.


Thanks,
Udit Mehrotra
AWS | EMR

On 2/18/20, 2:29 AM, "Igor Basko" <igorba...@gmail.com> wrote:

    Hi Dear List,
    I'm trying to catalog Hudi files in GLUE catalog using the sync hive tool,
    while using the spark save function (and not the standalone version).
    
    I've created an EMR with Spark application only (without Hive). Also added
    the following hive metastore client factory class configuration:
    "hive.metastore.client.factory.class":
    "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    
    I've started the spark-shell using the provided by EMR hudi jars, and also
    using the 0.5.1 version and they both gave me the "Cannot create hive
    connection ..." error when running the following code
    <https://gist.github.com/igorbasko01/05d81fef8f39e305527fd24b946fdb9a>. (
    https://gist.github.com/igorbasko01/05d81fef8f39e305527fd24b946fdb9a)
    
    After looking inside HoodieSparkSqlWriter.scala in buildSyncConfig it seems
    that there is no way to override the HiveSyncConfig.useJdbc variable to be
    false,
    (
    
https://github.com/apache/incubator-hudi/blob/release-0.5.1/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L232
    )
    which means that in HoodieHiveClient constructor it will always try to
    createHiveConnection()
    (
    
https://github.com/apache/incubator-hudi/blob/release-0.5.1/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L111
    )
    Instead of creating a hive client from the configuration.
    
    The next thing I did was to add a parameter that would enable overriding
    the useJdbc variable.
    Used the custom hudi jar in the EMR, and was able to progress further. But
    got a different error down the line.
    What I was happy to see that apparently it was using the
    AWSGlueClientFactory:
    20/02/17 13:55:17 INFO AWSGlueClientFactory: Using region from ec2 metadata
    : eu-west-1
    
    And was able to detect that the table doesn't exists in GLUE:
    20/02/17 13:55:18 INFO HiveSyncTool: Hive table drivers is not found.
    Creating it
    
    But I got the following exception:
    java.lang.NoClassDefFoundError: org/json/JSONException
      at
    
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:10847)
    
    A partial log could be found here
    <https://gist.github.com/igorbasko01/612f773632cb8382014166e0ed2a06d3> (
    https://gist.github.com/igorbasko01/612f773632cb8382014166e0ed2a06d3)
    
    As it seems to me, in the case of checking if a table exists, the
    HoodieHiveClient uses the client variable which is an interface
    IMetaStoreClient, that the AWSCatalogMetastoreClient implements.
    And it works fine.
    
    
https://github.com/apache/incubator-hudi/blob/release-0.5.1/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L469
    
    
https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/blob/master/aws-glue-datacatalog-spark-client/src/main/java/com/amazonaws/glue/catalog/metastore/AWSCatalogMetastoreClient.java
    
    But the createTable of HoodieHiveClient, eventually creates a
    hive.ql.Driver and not uses the AWS client, which eventually gets an
    exception.
    
    So what I would like to know, is am I doing it wrong when trying to sync to
    GLUE?
    Or maybe currently Hudi doesn't support updating GLUE catalog without some
    code changes?
    
    Best Regards,
    Igor

Re: Hudi on EMR syncing GLUE catalog issue

Reply via email to