[ https://issues.apache.org/jira/browse/HUDI-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ethan Guo closed HUDI-6868. --------------------------- Resolution: Fixed > Hudi HiveSync doesn't support extracting passwords from credential store > ------------------------------------------------------------------------ > > Key: HUDI-6868 > URL: https://issues.apache.org/jira/browse/HUDI-6868 > Project: Apache Hudi > Issue Type: Bug > Components: hive, hudi-utilities, spark > Reporter: Kuldeep Kulkarni > Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > Attachments: pyspark_hudi_test.py > > > We have a customer use-case of running PySpark on [Dataproc > Serverless|https://cloud.google.com/dataproc-serverless/docs/overview] with > [hudi-spark3-bundle|https://mvnrepository.com/artifact/org.apache.hudi/hudi-spark3-bundle], > PySpark job fails to sync Hudi table with HMS DB(remote CloudSQL DB > instance) due to not able to extract the password from the credential store. > Same job works fine if we mention Hive Metstore DB user password instead of > credential store. > Checking > [code|https://github.com/apache/hudi/blob/release-0.12.3/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java] > for HiveSync configs or > [HiveSyncConfigHolder|https://github.com/apache/hudi/blob/73c2167566730a76a0650d488511253ebc66156f/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfigHolder.java#L44], > I don't see any option where it detects credential store for extracting > passwords. Something like [this > code|https://github.com/apache/hive/blob/rel/release-2.3.9/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L482] > from HMS ObjectStore. > [Hive Sync Config Document|https://hudi.apache.org/docs/syncing_metastore/] > also doesn't have any reference of using credential store. > In order to find the password through the Hadoop Credential Provider API, it > would need to make a call to > [`Configuration#getPassword(String)`|https://hadoop.apache.org/docs/r3.3.6/api/org/apache/hadoop/conf/Configuration.html#getPassword-java.lang.String-]. > We don't see anywhere in the Hudi codebase calling "getPassword" > > *Repro steps:* > > Sample PySpark script - Attached. > > Command with successful job execution with Metastore DB password: > {code:java} > gcloud dataproc batches submit --version 1.1 --container-image > gcr.io/<container-repo>/new-custom-debian:v4 --region <region> pyspark > gs://<gcs-bucket>/pyspark_hudi_test.py > --jars="gs://<gcs-bucket>/hudi-spark3-bundle_2.12-0.12.3.jar" --properties > "spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:mysql://<cloud-sql-HMS-DB-IP>:3306/hive_metastore,spark.hadoop.javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver,spark.hadoop.javax.jdo.option.ConnectionUserName=hive,spark.hadoop.javax.jdo.option.ConnectionPassword=<hive-db-user-password>" > --deps-bucket gs://<gcs-bucket> -- SPARK_EXTRA_CLASSPATH=/opt/spark/jars/* > {code} > > Failing command ( with credential store): > {code:java} > gcloud dataproc batches submit --version 1.1 --container-image > gcr.io/<container-repo>/new-custom-debian:v4 --region <region> pyspark > gs://<gcs-bucket>/pyspark_hudi_test.py > --jars="gs://<gcs-bucket>/hudi-spark3-bundle_2.12-0.12.3.jar" --properties > "spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:mysql://<cloud-sql-HMS-DB-IP>:3306/hive_metastore,spark.hadoop.javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver,spark.hadoop.javax.jdo.option.ConnectionUserName=hive,spark.hadoop.hadoop.security.credential.provider.path=jceks://gs@<gcs-bucket>/metastore-pass-v2.jceks" > --deps-bucket gs://<gcs-bucket> -- SPARK_EXTRA_CLASSPATH=/opt/spark/jars/* > {code} > > Error: > {code:java} > 23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Commit 20230911042953444 > successful! > 23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Config.inlineCompactionEnabled > ? false > 23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Compaction Scheduled is > Optional.empty > 23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Config.asyncClusteringEnabled ? > false > 23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Clustering Scheduled is > Optional.empty > 23/09/11 04:30:42 INFO HiveConf: Found configuration file null > [..] > 23/09/11 04:30:42 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient > from gs://<gcs-bucket>/ > 23/09/11 04:30:42 INFO HoodieTableConfig: Loading table properties from > gs://<gcs-bucket>/.hoodie/hoodie.properties > 23/09/11 04:30:42 INFO HoodieTableMetaClient: Finished Loading Table of type > COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from gs://<gcs-bucket>/ > 23/09/11 04:30:42 INFO HoodieTableMetaClient: Loading Active commit timeline > for gs://<gcs-bucket>/ > 23/09/11 04:30:42 INFO HoodieActiveTimeline: Loaded instants upto : > Option\{val=[20230911042953444__commit__COMPLETED]} > 23/09/11 04:30:43 INFO HiveMetaStore: 0: Opening raw store with > implementation class:org.apache.hadoop.hive.metastore.ObjectStore > 23/09/11 04:30:43 INFO ObjectStore: ObjectStore, initialize called > 23/09/11 04:30:44 INFO Persistence: Property datanucleus.cache.level2 unknown > - will be ignored > Mon Sep 11 04:30:44 UTC 2023 WARN: Establishing SSL connection without > server's identity verification is not recommended. According to MySQL > 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established > by default if explicit option isn't set. For compliance with existing > applications not using SSL the verifyServerCertificate property is set to > 'false'. You need either to explicitly disable SSL by setting useSSL=false, > or set useSSL=true and provide truststore for server certificate verification. > [..] > Unable to open a test connection to the given database. JDBC url = > jdbc:mysql://<cloud-sql-HMS-db-ip>:3306/hive_metastore, username = hive. > Terminating connection pool (set lazyInit to true if you expect to start your > database after your app). Original Exception: ------ > java.sql.SQLException: Access denied for user 'hive'@'<cloud-sql-HMS-db-ip>' > (using password: YES) > at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:965) > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3933) > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3869) > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:864) > at > com.mysql.jdbc.MysqlIO.proceedHandshakeWithPluggableAuthentication(MysqlIO.java:1707) > at com.mysql.jdbc.MysqlIO.doHandshake(MysqlIO.java:1217) > [..] > ------ > org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test > connection to the given database. JDBC url = > jdbc:mysql://<cloud-sql-HMS-db-ip>:3306/hive_metastore, username = hive. > Terminating connection pool (set lazyInit to true if you expect to start your > database after your app). Original Exception: ------ > java.sql.SQLException: Access denied for user 'hive'@'<cloud-sql-HMS-db-ip>' > (using password: YES) > at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:965) > > [..] > > Caused by: java.sql.SQLException: Unable to open a test connection to the > given database. JDBC url = > jdbc:mysql://<cloud-sql-HMS-db-ip>:3306/hive_metastore, username = hive. > Terminating connection pool (set lazyInit to true if you expect to start your > database after your app). Original Exception: ------ > java.sql.SQLException: Access denied for user 'hive'@'<cloud-sql-HMS-db-ip>' > (using password: YES) > at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:965) > > {code} > > *Note* - metastore-pass-v2.jceks in above example contains value of > "javax.jdo.option.ConnectionPassword" and there is no issue with it. It works > fine with this credential store for other pyspark jobs(without Hudi of course) > > We tried with "hudi-spark3-bundle_2.12-0.13.1.jar" as well, it did not help. -- This message was sent by Atlassian Jira (v8.20.10#820010)