[ 
https://issues.apache.org/jira/browse/HUDI-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-6868.
---------------------------
    Resolution: Fixed

> Hudi HiveSync doesn't support extracting passwords from credential store
> ------------------------------------------------------------------------
>
>                 Key: HUDI-6868
>                 URL: https://issues.apache.org/jira/browse/HUDI-6868
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: hive, hudi-utilities, spark
>            Reporter: Kuldeep Kulkarni
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.15.0, 1.0.0
>
>         Attachments: pyspark_hudi_test.py
>
>
> We have a customer use-case of running PySpark on [Dataproc 
> Serverless|https://cloud.google.com/dataproc-serverless/docs/overview] with 
> [hudi-spark3-bundle|https://mvnrepository.com/artifact/org.apache.hudi/hudi-spark3-bundle],
>  PySpark job fails to sync Hudi table with HMS DB(remote CloudSQL DB 
> instance) due to not able to extract the password from the credential store. 
> Same job works fine if we mention Hive Metstore DB user password instead of 
> credential store. 
> Checking 
> [code|https://github.com/apache/hudi/blob/release-0.12.3/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java]
>  for HiveSync configs or 
> [HiveSyncConfigHolder|https://github.com/apache/hudi/blob/73c2167566730a76a0650d488511253ebc66156f/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfigHolder.java#L44],
>  I don't see any option where it detects credential store for extracting 
> passwords. Something like [this 
> code|https://github.com/apache/hive/blob/rel/release-2.3.9/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L482]
>  from HMS ObjectStore.
> [Hive Sync Config Document|https://hudi.apache.org/docs/syncing_metastore/] 
> also doesn't have any reference of using credential store. 
> In order to find the password through the Hadoop Credential Provider API, it 
> would need to make a call to 
> [`Configuration#getPassword(String)`|https://hadoop.apache.org/docs/r3.3.6/api/org/apache/hadoop/conf/Configuration.html#getPassword-java.lang.String-].
>  We don't see anywhere in the Hudi codebase calling "getPassword"
>  
> *Repro steps:*
>  
> Sample PySpark script - Attached. 
>  
> Command with successful job execution with Metastore DB password:
> {code:java}
> gcloud dataproc batches submit --version 1.1 --container-image 
> gcr.io/<container-repo>/new-custom-debian:v4 --region <region> pyspark 
> gs://<gcs-bucket>/pyspark_hudi_test.py 
> --jars="gs://<gcs-bucket>/hudi-spark3-bundle_2.12-0.12.3.jar" --properties 
> "spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:mysql://<cloud-sql-HMS-DB-IP>:3306/hive_metastore,spark.hadoop.javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver,spark.hadoop.javax.jdo.option.ConnectionUserName=hive,spark.hadoop.javax.jdo.option.ConnectionPassword=<hive-db-user-password>"
>  --deps-bucket gs://<gcs-bucket> -- SPARK_EXTRA_CLASSPATH=/opt/spark/jars/* 
> {code}
>  
> Failing command ( with credential store):
> {code:java}
> gcloud dataproc batches submit --version 1.1 --container-image 
> gcr.io/<container-repo>/new-custom-debian:v4 --region <region> pyspark 
> gs://<gcs-bucket>/pyspark_hudi_test.py 
> --jars="gs://<gcs-bucket>/hudi-spark3-bundle_2.12-0.12.3.jar" --properties 
> "spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:mysql://<cloud-sql-HMS-DB-IP>:3306/hive_metastore,spark.hadoop.javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver,spark.hadoop.javax.jdo.option.ConnectionUserName=hive,spark.hadoop.hadoop.security.credential.provider.path=jceks://gs@<gcs-bucket>/metastore-pass-v2.jceks"
>  --deps-bucket gs://<gcs-bucket> -- SPARK_EXTRA_CLASSPATH=/opt/spark/jars/*  
> {code}
>  
> Error:
> {code:java}
> 23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Commit 20230911042953444 
> successful!
> 23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Config.inlineCompactionEnabled 
> ? false
> 23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Compaction Scheduled is 
> Optional.empty
> 23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Config.asyncClusteringEnabled ? 
> false
> 23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Clustering Scheduled is 
> Optional.empty
> 23/09/11 04:30:42 INFO HiveConf: Found configuration file null
> [..]
> 23/09/11 04:30:42 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
> from gs://<gcs-bucket>/
> 23/09/11 04:30:42 INFO HoodieTableConfig: Loading table properties from 
> gs://<gcs-bucket>/.hoodie/hoodie.properties
> 23/09/11 04:30:42 INFO HoodieTableMetaClient: Finished Loading Table of type 
> COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from gs://<gcs-bucket>/
> 23/09/11 04:30:42 INFO HoodieTableMetaClient: Loading Active commit timeline 
> for gs://<gcs-bucket>/
> 23/09/11 04:30:42 INFO HoodieActiveTimeline: Loaded instants upto : 
> Option\{val=[20230911042953444__commit__COMPLETED]}
> 23/09/11 04:30:43 INFO HiveMetaStore: 0: Opening raw store with 
> implementation class:org.apache.hadoop.hive.metastore.ObjectStore
> 23/09/11 04:30:43 INFO ObjectStore: ObjectStore, initialize called
> 23/09/11 04:30:44 INFO Persistence: Property datanucleus.cache.level2 unknown 
> - will be ignored
> Mon Sep 11 04:30:44 UTC 2023 WARN: Establishing SSL connection without 
> server's identity verification is not recommended. According to MySQL 
> 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established 
> by default if explicit option isn't set. For compliance with existing 
> applications not using SSL the verifyServerCertificate property is set to 
> 'false'. You need either to explicitly disable SSL by setting useSSL=false, 
> or set useSSL=true and provide truststore for server certificate verification.
> [..]
> Unable to open a test connection to the given database. JDBC url = 
> jdbc:mysql://<cloud-sql-HMS-db-ip>:3306/hive_metastore, username = hive. 
> Terminating connection pool (set lazyInit to true if you expect to start your 
> database after your app). Original Exception: ------
> java.sql.SQLException: Access denied for user 'hive'@'<cloud-sql-HMS-db-ip>' 
> (using password: YES)
> at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:965)
> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3933)
> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3869)
> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:864)
> at 
> com.mysql.jdbc.MysqlIO.proceedHandshakeWithPluggableAuthentication(MysqlIO.java:1707)
> at com.mysql.jdbc.MysqlIO.doHandshake(MysqlIO.java:1217)
> [..]
> ------
> org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test 
> connection to the given database. JDBC url = 
> jdbc:mysql://<cloud-sql-HMS-db-ip>:3306/hive_metastore, username = hive. 
> Terminating connection pool (set lazyInit to true if you expect to start your 
> database after your app). Original Exception: ------
> java.sql.SQLException: Access denied for user 'hive'@'<cloud-sql-HMS-db-ip>' 
> (using password: YES)
> at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:965)
>  
> [..]
>  
> Caused by: java.sql.SQLException: Unable to open a test connection to the 
> given database. JDBC url = 
> jdbc:mysql://<cloud-sql-HMS-db-ip>:3306/hive_metastore, username = hive. 
> Terminating connection pool (set lazyInit to true if you expect to start your 
> database after your app). Original Exception: ------
> java.sql.SQLException: Access denied for user 'hive'@'<cloud-sql-HMS-db-ip>' 
> (using password: YES)
> at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:965)
>  
> {code}
>  
> *Note* - metastore-pass-v2.jceks in above example contains value of 
> "javax.jdo.option.ConnectionPassword" and there is no issue with it. It works 
> fine with this credential store for other pyspark jobs(without Hudi of course)
>  
> We tried with "hudi-spark3-bundle_2.12-0.13.1.jar" as well, it did not help.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to