[jira] [Resolved] (SPARK-47197) Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell

2024-02-28 Thread YUBI LEE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE resolved SPARK-47197.
--
Resolution: Not A Problem

https://github.com/apache/spark/pull/45309#issuecomment-1969269354

> Failed to connect HiveMetastore when using iceberg with HiveCatalog on 
> spark-sql or spark-shell
> ---
>
> Key: SPARK-47197
> URL: https://issues.apache.org/jira/browse/SPARK-47197
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 3.2.3, 3.5.1
>Reporter: YUBI LEE
>Priority: Major
>  Labels: pull-request-available
>
> I can't connect to kerberized HiveMetastore when using iceberg with 
> HiveCatalog on spark-sql or spark-shell.
> I think this issue is caused by the fact that there is no way to get 
> HIVE_DELEGATION_TOKEN when using spark-sql or spark-shell.
> ([https://github.com/apache/spark/blob/v3.5.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/security/HiveDelegationTokenProvider.scala#L78-L83)]
>  
> {code:java}
>     val currentToken = 
> UserGroupInformation.getCurrentUser().getCredentials().getToken(tokenAlias)
>     currentToken == null && UserGroupInformation.isSecurityEnabled &&
>       hiveConf(hadoopConf).getTrimmed("hive.metastore.uris", "").nonEmpty &&
>       (SparkHadoopUtil.get.isProxyUser(UserGroupInformation.getCurrentUser()) 
> ||
>         (!Utils.isClientMode(sparkConf) && !sparkConf.contains(KEYTAB))) 
> {code}
> There should be a way to force to get HIVE_DELEGATION_TOKEN even when using 
> spark-sql or spark-shell.
> Possible way is to get HIVE_DELEGATION_TOKEN if the configuration below is 
> set?
> {code:java}
> spark.security.credentials.hive.enabled   true {code}
>  
> {code:java}
> 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
> (machine1.example.com executor 2): 
> org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
> Metastore
> ...
> Caused by: MetaException(message:Could not connect to meta store using any of 
> the URIs provided. Most recent failure: 
> org.apache.thrift.transport.TTransportException: GSS initiate failed {code}
>  
>  
> {code:java}
> spark-sql> select * from temp.test_hive_catalog;
> ...
> ...
> 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
> (machine1.example.com executor 2): 
> org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
> Metastore
>         at 
> org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:84)
>         at 
> org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:34)
>         at org.apache.iceberg.ClientPoolImpl.get(ClientPoolImpl.java:125)
>         at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:56)
>         at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51)
>         at 
> org.apache.iceberg.hive.CachedClientPool.run(CachedClientPool.java:122)
>         at 
> org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:158)
>         at 
> org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97)
>         at 
> org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80)
>         at 
> org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47)
>         at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:124)
>         at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:111)
>         at 
> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.overlayTableProperties(HiveIcebergStorageHandler.java:276)
>         at 
> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.configureInputJobProperties(HiveIcebergStorageHandler.java:86)
>         at 
> org.apache.spark.sql.hive.HiveTableUtil$.configureJobPropertiesForStorageHandler(TableReader.scala:426)
>         at 
> org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:456)
>         at 
> org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1(TableReader.scala:342)
>         at 
> org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1$adapted(TableReader.scala:342)
>         at 
> org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8(HadoopRDD.scala:181)
>         at 
> org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8$adapted(HadoopRDD.scala:181)
>         at scala.Option.foreach(Option.scala:407)
>         at 
> org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$6(HadoopRDD.scala:181)
>         at scala.Option.getOrElse(Option.scala:189)
>         at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:178)
>         at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:247)
>         at 

[jira] [Updated] (SPARK-47197) Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell

2024-02-27 Thread YUBI LEE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-47197:
-
Description: 
I can't connect to kerberized HiveMetastore when using iceberg with HiveCatalog 
on spark-sql or spark-shell.

I think this issue is caused by the fact that there is no way to get 
HIVE_DELEGATION_TOKEN when using spark-sql or spark-shell.

([https://github.com/apache/spark/blob/v3.5.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/security/HiveDelegationTokenProvider.scala#L78-L83)]

 
{code:java}
    val currentToken = 
UserGroupInformation.getCurrentUser().getCredentials().getToken(tokenAlias)
    currentToken == null && UserGroupInformation.isSecurityEnabled &&
      hiveConf(hadoopConf).getTrimmed("hive.metastore.uris", "").nonEmpty &&
      (SparkHadoopUtil.get.isProxyUser(UserGroupInformation.getCurrentUser()) ||
        (!Utils.isClientMode(sparkConf) && !sparkConf.contains(KEYTAB))) {code}
There should be a way to force to get HIVE_DELEGATION_TOKEN even when using 
spark-sql or spark-shell.

Possible way is to get HIVE_DELEGATION_TOKEN if the configuration below is set?
{code:java}
spark.security.credentials.hive.enabled   true {code}
 
{code:java}
24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
(machine1.example.com executor 2): 
org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
Metastore
...
Caused by: MetaException(message:Could not connect to meta store using any of 
the URIs provided. Most recent failure: 
org.apache.thrift.transport.TTransportException: GSS initiate failed {code}
 

 
{code:java}
spark-sql> select * from temp.test_hive_catalog;
...
...
24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
(machine1.example.com executor 2): 
org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
Metastore
        at 
org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:84)
        at 
org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:34)
        at org.apache.iceberg.ClientPoolImpl.get(ClientPoolImpl.java:125)
        at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:56)
        at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51)
        at 
org.apache.iceberg.hive.CachedClientPool.run(CachedClientPool.java:122)
        at 
org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:158)
        at 
org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97)
        at 
org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80)
        at 
org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47)
        at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:124)
        at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:111)
        at 
org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.overlayTableProperties(HiveIcebergStorageHandler.java:276)
        at 
org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.configureInputJobProperties(HiveIcebergStorageHandler.java:86)
        at 
org.apache.spark.sql.hive.HiveTableUtil$.configureJobPropertiesForStorageHandler(TableReader.scala:426)
        at 
org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:456)
        at 
org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1(TableReader.scala:342)
        at 
org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1$adapted(TableReader.scala:342)
        at 
org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8(HadoopRDD.scala:181)
        at 
org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8$adapted(HadoopRDD.scala:181)
        at scala.Option.foreach(Option.scala:407)
        at 
org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$6(HadoopRDD.scala:181)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:178)
        at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:247)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at 

[jira] [Updated] (SPARK-47197) Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell

2024-02-27 Thread YUBI LEE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-47197:
-
Summary: Failed to connect HiveMetastore when using iceberg with 
HiveCatalog on spark-sql or spark-shell  (was: Failed to connect HiveMetastore 
when using iceberg with HiveCatalog by spark-sql or spark-shell)

> Failed to connect HiveMetastore when using iceberg with HiveCatalog on 
> spark-sql or spark-shell
> ---
>
> Key: SPARK-47197
> URL: https://issues.apache.org/jira/browse/SPARK-47197
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 3.2.3, 3.5.1
>Reporter: YUBI LEE
>Priority: Major
>
> I can't connect to kerberized HiveMetastore when using iceberg with 
> HiveCatalog by spark-sql or spark-shell.
> I think this issue is caused by the fact that there is no way to get 
> HIVE_DELEGATION_TOKEN when using spark-sql or spark-shell.
> ([https://github.com/apache/spark/blob/v3.5.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/security/HiveDelegationTokenProvider.scala#L78-L83)]
>  
> {code:java}
>     val currentToken = 
> UserGroupInformation.getCurrentUser().getCredentials().getToken(tokenAlias)
>     currentToken == null && UserGroupInformation.isSecurityEnabled &&
>       hiveConf(hadoopConf).getTrimmed("hive.metastore.uris", "").nonEmpty &&
>       (SparkHadoopUtil.get.isProxyUser(UserGroupInformation.getCurrentUser()) 
> ||
>         (!Utils.isClientMode(sparkConf) && !sparkConf.contains(KEYTAB))) 
> {code}
> There should be a way to force to get HIVE_DELEGATION_TOKEN even when using 
> spark-sql or spark-shell.
> Possible way is to get HIVE_DELEGATION_TOKEN if the configuration below is 
> set?
> {code:java}
> spark.security.credentials.hive.enabled   true {code}
>  
> {code:java}
> 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
> (machine1.example.com executor 2): 
> org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
> Metastore
> ...
> Caused by: MetaException(message:Could not connect to meta store using any of 
> the URIs provided. Most recent failure: 
> org.apache.thrift.transport.TTransportException: GSS initiate failed {code}
>  
>  
> {code:java}
> spark-sql> select * from temp.test_hive_catalog;
> ...
> ...
> 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
> (machine1.example.com executor 2): 
> org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
> Metastore
>         at 
> org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:84)
>         at 
> org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:34)
>         at org.apache.iceberg.ClientPoolImpl.get(ClientPoolImpl.java:125)
>         at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:56)
>         at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51)
>         at 
> org.apache.iceberg.hive.CachedClientPool.run(CachedClientPool.java:122)
>         at 
> org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:158)
>         at 
> org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97)
>         at 
> org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80)
>         at 
> org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47)
>         at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:124)
>         at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:111)
>         at 
> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.overlayTableProperties(HiveIcebergStorageHandler.java:276)
>         at 
> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.configureInputJobProperties(HiveIcebergStorageHandler.java:86)
>         at 
> org.apache.spark.sql.hive.HiveTableUtil$.configureJobPropertiesForStorageHandler(TableReader.scala:426)
>         at 
> org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:456)
>         at 
> org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1(TableReader.scala:342)
>         at 
> org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1$adapted(TableReader.scala:342)
>         at 
> org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8(HadoopRDD.scala:181)
>         at 
> org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8$adapted(HadoopRDD.scala:181)
>         at scala.Option.foreach(Option.scala:407)
>         at 
> org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$6(HadoopRDD.scala:181)
>         at scala.Option.getOrElse(Option.scala:189)
>         at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:178)
>         at 

[jira] [Updated] (SPARK-47197) Failed to connect HiveMetastore when using iceberg with HiveCatalog by spark-sql or spark-shell

2024-02-27 Thread YUBI LEE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-47197:
-
Component/s: SQL

> Failed to connect HiveMetastore when using iceberg with HiveCatalog by 
> spark-sql or spark-shell
> ---
>
> Key: SPARK-47197
> URL: https://issues.apache.org/jira/browse/SPARK-47197
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 3.2.3, 3.5.1
>Reporter: YUBI LEE
>Priority: Major
>
> I can't connect to kerberized HiveMetastore when using iceberg with 
> HiveCatalog by spark-sql or spark-shell.
> I think this issue is caused by the fact that there is no way to get 
> HIVE_DELEGATION_TOKEN when using spark-sql or spark-shell.
> ([https://github.com/apache/spark/blob/v3.5.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/security/HiveDelegationTokenProvider.scala#L78-L83)]
>  
> {code:java}
>     val currentToken = 
> UserGroupInformation.getCurrentUser().getCredentials().getToken(tokenAlias)
>     currentToken == null && UserGroupInformation.isSecurityEnabled &&
>       hiveConf(hadoopConf).getTrimmed("hive.metastore.uris", "").nonEmpty &&
>       (SparkHadoopUtil.get.isProxyUser(UserGroupInformation.getCurrentUser()) 
> ||
>         (!Utils.isClientMode(sparkConf) && !sparkConf.contains(KEYTAB))) 
> {code}
> There should be a way to force to get HIVE_DELEGATION_TOKEN even when using 
> spark-sql or spark-shell.
> Possible way is to get HIVE_DELEGATION_TOKEN if the configuration below is 
> set?
> {code:java}
> spark.security.credentials.hive.enabled   true {code}
>  
> {code:java}
> 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
> (machine1.example.com executor 2): 
> org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
> Metastore
> ...
> Caused by: MetaException(message:Could not connect to meta store using any of 
> the URIs provided. Most recent failure: 
> org.apache.thrift.transport.TTransportException: GSS initiate failed {code}
>  
>  
> {code:java}
> spark-sql> select * from temp.test_hive_catalog;
> ...
> ...
> 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
> (machine1.example.com executor 2): 
> org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
> Metastore
>         at 
> org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:84)
>         at 
> org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:34)
>         at org.apache.iceberg.ClientPoolImpl.get(ClientPoolImpl.java:125)
>         at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:56)
>         at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51)
>         at 
> org.apache.iceberg.hive.CachedClientPool.run(CachedClientPool.java:122)
>         at 
> org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:158)
>         at 
> org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97)
>         at 
> org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80)
>         at 
> org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47)
>         at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:124)
>         at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:111)
>         at 
> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.overlayTableProperties(HiveIcebergStorageHandler.java:276)
>         at 
> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.configureInputJobProperties(HiveIcebergStorageHandler.java:86)
>         at 
> org.apache.spark.sql.hive.HiveTableUtil$.configureJobPropertiesForStorageHandler(TableReader.scala:426)
>         at 
> org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:456)
>         at 
> org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1(TableReader.scala:342)
>         at 
> org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1$adapted(TableReader.scala:342)
>         at 
> org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8(HadoopRDD.scala:181)
>         at 
> org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8$adapted(HadoopRDD.scala:181)
>         at scala.Option.foreach(Option.scala:407)
>         at 
> org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$6(HadoopRDD.scala:181)
>         at scala.Option.getOrElse(Option.scala:189)
>         at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:178)
>         at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:247)
>         at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243)
>         at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96)
>     

[jira] [Updated] (SPARK-47197) Failed to connect HiveMetastore when using iceberg with HiveCatalog by spark-sql or spark-shell

2024-02-27 Thread YUBI LEE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-47197:
-
Description: 
I can't connect to kerberized HiveMetastore when using iceberg with HiveCatalog 
by spark-sql or spark-shell.

I think this issue is caused by the fact that there is no way to get 
HIVE_DELEGATION_TOKEN when using spark-sql or spark-shell.

([https://github.com/apache/spark/blob/v3.5.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/security/HiveDelegationTokenProvider.scala#L78-L83)]

 
{code:java}
    val currentToken = 
UserGroupInformation.getCurrentUser().getCredentials().getToken(tokenAlias)
    currentToken == null && UserGroupInformation.isSecurityEnabled &&
      hiveConf(hadoopConf).getTrimmed("hive.metastore.uris", "").nonEmpty &&
      (SparkHadoopUtil.get.isProxyUser(UserGroupInformation.getCurrentUser()) ||
        (!Utils.isClientMode(sparkConf) && !sparkConf.contains(KEYTAB))) {code}
There should be a way to force to get HIVE_DELEGATION_TOKEN even when using 
spark-sql or spark-shell.

Possible way is to get HIVE_DELEGATION_TOKEN if the configuration below is set?
{code:java}
spark.security.credentials.hive.enabled   true {code}
 
{code:java}
24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
(machine1.example.com executor 2): 
org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
Metastore
...
Caused by: MetaException(message:Could not connect to meta store using any of 
the URIs provided. Most recent failure: 
org.apache.thrift.transport.TTransportException: GSS initiate failed {code}
 

 
{code:java}
spark-sql> select * from temp.test_hive_catalog;
...
...
24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
(machine1.example.com executor 2): 
org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
Metastore
        at 
org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:84)
        at 
org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:34)
        at org.apache.iceberg.ClientPoolImpl.get(ClientPoolImpl.java:125)
        at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:56)
        at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51)
        at 
org.apache.iceberg.hive.CachedClientPool.run(CachedClientPool.java:122)
        at 
org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:158)
        at 
org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97)
        at 
org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80)
        at 
org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47)
        at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:124)
        at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:111)
        at 
org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.overlayTableProperties(HiveIcebergStorageHandler.java:276)
        at 
org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.configureInputJobProperties(HiveIcebergStorageHandler.java:86)
        at 
org.apache.spark.sql.hive.HiveTableUtil$.configureJobPropertiesForStorageHandler(TableReader.scala:426)
        at 
org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:456)
        at 
org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1(TableReader.scala:342)
        at 
org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1$adapted(TableReader.scala:342)
        at 
org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8(HadoopRDD.scala:181)
        at 
org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8$adapted(HadoopRDD.scala:181)
        at scala.Option.foreach(Option.scala:407)
        at 
org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$6(HadoopRDD.scala:181)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:178)
        at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:247)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at 

[jira] [Updated] (SPARK-47197) Failed to connect HiveMetastore when using iceberg with HiveCatalog by spark-sql or spark-shell

2024-02-27 Thread YUBI LEE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-47197:
-
Description: 
I can't connect to kerberized HiveMetastore when using iceberg with HiveCatalog 
by spark-sql or spark-shell.

I think this issue is caused by the fact that there is no way to get 
HIVE_DELEGATION_TOKEN when using spark-sql or spark-shell.

([https://github.com/apache/spark/blob/v3.5.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/security/HiveDelegationTokenProvider.scala#L78-L83)]

 
{code:java}
    val currentToken = 
UserGroupInformation.getCurrentUser().getCredentials().getToken(tokenAlias)
    currentToken == null && UserGroupInformation.isSecurityEnabled &&
      hiveConf(hadoopConf).getTrimmed("hive.metastore.uris", "").nonEmpty &&
      (SparkHadoopUtil.get.isProxyUser(UserGroupInformation.getCurrentUser()) ||
        (!Utils.isClientMode(sparkConf) && !sparkConf.contains(KEYTAB))) {code}
There should be a way to force to get HIVE_DELEGATION_TOKEN even when using 
spark-sql or spark-shell.

Possible way is to get HIVE_DELEGATION_TOKEN if the configuration below is set?
{code:java}
spark.security.credentials.hive.enabled   true {code}
 

 

 
{code:java}
spark-sql> select * from temp.test_hive_catalog;
...
...
24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
(machine1.example.com executor 2): 
org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
Metastore
        at 
org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:84)
        at 
org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:34)
        at org.apache.iceberg.ClientPoolImpl.get(ClientPoolImpl.java:125)
        at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:56)
        at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51)
        at 
org.apache.iceberg.hive.CachedClientPool.run(CachedClientPool.java:122)
        at 
org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:158)
        at 
org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97)
        at 
org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80)
        at 
org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47)
        at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:124)
        at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:111)
        at 
org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.overlayTableProperties(HiveIcebergStorageHandler.java:276)
        at 
org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.configureInputJobProperties(HiveIcebergStorageHandler.java:86)
        at 
org.apache.spark.sql.hive.HiveTableUtil$.configureJobPropertiesForStorageHandler(TableReader.scala:426)
        at 
org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:456)
        at 
org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1(TableReader.scala:342)
        at 
org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1$adapted(TableReader.scala:342)
        at 
org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8(HadoopRDD.scala:181)
        at 
org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8$adapted(HadoopRDD.scala:181)
        at scala.Option.foreach(Option.scala:407)
        at 
org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$6(HadoopRDD.scala:181)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:178)
        at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:247)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 

[jira] [Created] (SPARK-47197) Failed to connect HiveMetastore when using iceberg with HiveCatalog by spark-sql or spark-shell

2024-02-27 Thread YUBI LEE (Jira)
YUBI LEE created SPARK-47197:


 Summary: Failed to connect HiveMetastore when using iceberg with 
HiveCatalog by spark-sql or spark-shell
 Key: SPARK-47197
 URL: https://issues.apache.org/jira/browse/SPARK-47197
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 3.5.1, 3.2.3
Reporter: YUBI LEE


I can't connect to kerberized HiveMetastore when using iceberg with HiveCatalog 
by spark-sql or spark-shell.

I think this issue is caused by the fact that there is no way to get 
HIVE_DELEGATION_TOKEN when using spark-sql or spark-shell.

([https://github.com/apache/spark/blob/v3.5.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/security/HiveDelegationTokenProvider.scala#L78-L83)]

 
{code:java}
    val currentToken = 
UserGroupInformation.getCurrentUser().getCredentials().getToken(tokenAlias)
    currentToken == null && UserGroupInformation.isSecurityEnabled &&
      hiveConf(hadoopConf).getTrimmed("hive.metastore.uris", "").nonEmpty &&
      (SparkHadoopUtil.get.isProxyUser(UserGroupInformation.getCurrentUser()) ||
        (!Utils.isClientMode(sparkConf) && !sparkConf.contains(KEYTAB))) {code}
There should be a way to force to get HIVE_DELEGATION_TOKEN even when using 
spark-sql or spark-shell.

Possible way is to get HIVE_DELEGATION_TOKEN if the configuration below is set?
{code:java}
spark.security.credentials.hive.enabled   true {code}
 

 

 
{code:java}
spark-sql> select * from temp.test_hive_catalog;
...
...
24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
(machine1.example.com executor 2): 
org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
Metastore
        at 
org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:84)
        at 
org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:34)
        at org.apache.iceberg.ClientPoolImpl.get(ClientPoolImpl.java:125)
        at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:56)
        at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51)
        at 
org.apache.iceberg.hive.CachedClientPool.run(CachedClientPool.java:122)
        at 
org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:158)
        at 
org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97)
        at 
org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80)
        at 
org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47)
        at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:124)
        at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:111)
        at 
org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.overlayTableProperties(HiveIcebergStorageHandler.java:276)
        at 
org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.configureInputJobProperties(HiveIcebergStorageHandler.java:86)
        at 
org.apache.spark.sql.hive.HiveTableUtil$.configureJobPropertiesForStorageHandler(TableReader.scala:426)
        at 
org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:456)
        at 
org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1(TableReader.scala:342)
        at 
org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1$adapted(TableReader.scala:342)
        at 
org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8(HadoopRDD.scala:181)
        at 
org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8$adapted(HadoopRDD.scala:181)
        at scala.Option.foreach(Option.scala:407)
        at 
org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$6(HadoopRDD.scala:181)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:178)
        at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:247)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at 

[jira] [Comment Edited] (SPARK-44976) Preserve full principal user name on executor side

2023-12-07 Thread YUBI LEE (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759201#comment-17759201
 ] 

YUBI LEE edited comment on SPARK-44976 at 12/8/23 12:34 AM:


[https://github.com/apache/spark/pull/44244]


was (Author: eub):
 

[https://github.com/apache/spark/pull/44244]

> Preserve full principal user name on executor side
> --
>
> Key: SPARK-44976
> URL: https://issues.apache.org/jira/browse/SPARK-44976
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.3, 3.3.3, 3.4.1
>Reporter: YUBI LEE
>Priority: Major
>  Labels: pull-request-available
>
> SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use 
> shortname instead of full principal name.
> Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the 
> side of non-kerberized hdfs namenode.
> For example, I use 2 hdfs cluster. One is kerberized, the other one is not 
> kerberized.
> I make a rule to add some prefix to username on the non-kerberized cluster if 
> some one access it from the kerberized cluster.
> {code}
>   
> hadoop.security.auth_to_local
> 
> RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
> RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
> DEFAULT
>   
> {code}
> However, if I submit spark job with keytab & principal option, hdfs directory 
> and files ownership is not coherent.
> (I change some words for privacy.)
> {code}
> $ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23
> Found 52 items
> -rw-rw-rw-   3 _ex_eub hdfs  0 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/_SUCCESS
> -rw-r--r--   3 eub  hdfs  134418857 2023-05-11 00:15 
> hdfs:///user/eub/some/path/20230510/23/part-0-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> -rw-r--r--   3 eub  hdfs  153410049 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/part-1-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> -rw-r--r--   3 eub  hdfs  157260989 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/part-2-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> -rw-r--r--   3 eub  hdfs  156222760 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/part-3-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> {code}
> Another interesting point is that if I submit spark job without keytab and 
> principal option but with kerberos authentication with {{kinit}}, it will not 
> follow {{hadoop.security.auth_to_local}} rule completely.
> {code}
> $ hdfs dfs -ls  hdfs:///user/eub/output/
> Found 3 items
> -rw-rw-r--+  3 eub hdfs  0 2023-08-25 12:31 
> hdfs:///user/eub/output/_SUCCESS
> -rw-rw-r--+  3 eub hdfs512 2023-08-25 12:31 
> hdfs:///user/eub/output/part-0.gz
> -rw-rw-r--+  3 eub hdfs574 2023-08-25 12:31 
> hdfs:///user/eub/output/part-1.gz
> {code}
> I finally found that if I submit spark job with {{--principal}} and 
> {{--keytab}} option, ugi will be different.
> (refer to 
> https://github.com/apache/spark/blob/2583bd2c16a335747895c0843f438d0966f47ecd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L905).
> Only file ({{_SUCCESS}}) and output directory created by driver (application 
> master side) will respect {{hadoop.security.auth_to_local}} on the 
> non-kerberized namenode only if {{--principal}} and {{--keytab}] options are 
> provided.
> No matter how hdfs files or directory are created by executor or driver, 
> those should respect {{hadoop.security.auth_to_local}} rule and should be the 
> same.
> Workaround is to pass additional argument to change {{SPARK_USER}} on the 
> executor side.
> e.g. {{--conf spark.executorEnv.SPARK_USER=_ex_eub}}
> {{--conf spark.yarn.appMasterEnv.SPARK_USER=_ex_eub}} will make an error. 
> There are some logics to append environment value with {{:}} (colon) as a 
> separator.
> - 
> https://github.com/apache/spark/blob/4748d858b4478ea7503b792050d4735eae83b3cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L893
> - 
> https://github.com/apache/spark/blob/4748d858b4478ea7503b792050d4735eae83b3cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala#L52



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-44976) Preserve full principal user name on executor side

2023-12-07 Thread YUBI LEE (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759201#comment-17759201
 ] 

YUBI LEE edited comment on SPARK-44976 at 12/8/23 12:33 AM:


-https://github.com/apache/spark/pull/42690-

https://github.com/apache/spark/pull/44244


was (Author: eub):
https://github.com/apache/spark/pull/42690

> Preserve full principal user name on executor side
> --
>
> Key: SPARK-44976
> URL: https://issues.apache.org/jira/browse/SPARK-44976
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.3, 3.3.3, 3.4.1
>Reporter: YUBI LEE
>Priority: Major
>  Labels: pull-request-available
>
> SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use 
> shortname instead of full principal name.
> Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the 
> side of non-kerberized hdfs namenode.
> For example, I use 2 hdfs cluster. One is kerberized, the other one is not 
> kerberized.
> I make a rule to add some prefix to username on the non-kerberized cluster if 
> some one access it from the kerberized cluster.
> {code}
>   
> hadoop.security.auth_to_local
> 
> RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
> RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
> DEFAULT
>   
> {code}
> However, if I submit spark job with keytab & principal option, hdfs directory 
> and files ownership is not coherent.
> (I change some words for privacy.)
> {code}
> $ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23
> Found 52 items
> -rw-rw-rw-   3 _ex_eub hdfs  0 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/_SUCCESS
> -rw-r--r--   3 eub  hdfs  134418857 2023-05-11 00:15 
> hdfs:///user/eub/some/path/20230510/23/part-0-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> -rw-r--r--   3 eub  hdfs  153410049 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/part-1-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> -rw-r--r--   3 eub  hdfs  157260989 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/part-2-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> -rw-r--r--   3 eub  hdfs  156222760 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/part-3-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> {code}
> Another interesting point is that if I submit spark job without keytab and 
> principal option but with kerberos authentication with {{kinit}}, it will not 
> follow {{hadoop.security.auth_to_local}} rule completely.
> {code}
> $ hdfs dfs -ls  hdfs:///user/eub/output/
> Found 3 items
> -rw-rw-r--+  3 eub hdfs  0 2023-08-25 12:31 
> hdfs:///user/eub/output/_SUCCESS
> -rw-rw-r--+  3 eub hdfs512 2023-08-25 12:31 
> hdfs:///user/eub/output/part-0.gz
> -rw-rw-r--+  3 eub hdfs574 2023-08-25 12:31 
> hdfs:///user/eub/output/part-1.gz
> {code}
> I finally found that if I submit spark job with {{--principal}} and 
> {{--keytab}} option, ugi will be different.
> (refer to 
> https://github.com/apache/spark/blob/2583bd2c16a335747895c0843f438d0966f47ecd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L905).
> Only file ({{_SUCCESS}}) and output directory created by driver (application 
> master side) will respect {{hadoop.security.auth_to_local}} on the 
> non-kerberized namenode only if {{--principal}} and {{--keytab}] options are 
> provided.
> No matter how hdfs files or directory are created by executor or driver, 
> those should respect {{hadoop.security.auth_to_local}} rule and should be the 
> same.
> Workaround is to pass additional argument to change {{SPARK_USER}} on the 
> executor side.
> e.g. {{--conf spark.executorEnv.SPARK_USER=_ex_eub}}
> {{--conf spark.yarn.appMasterEnv.SPARK_USER=_ex_eub}} will make an error. 
> There are some logics to append environment value with {{:}} (colon) as a 
> separator.
> - 
> https://github.com/apache/spark/blob/4748d858b4478ea7503b792050d4735eae83b3cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L893
> - 
> https://github.com/apache/spark/blob/4748d858b4478ea7503b792050d4735eae83b3cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala#L52



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-44976) Preserve full principal user name on executor side

2023-12-07 Thread YUBI LEE (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759201#comment-17759201
 ] 

YUBI LEE edited comment on SPARK-44976 at 12/8/23 12:33 AM:


 

[https://github.com/apache/spark/pull/44244]


was (Author: eub):
-https://github.com/apache/spark/pull/42690-

https://github.com/apache/spark/pull/44244

> Preserve full principal user name on executor side
> --
>
> Key: SPARK-44976
> URL: https://issues.apache.org/jira/browse/SPARK-44976
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.3, 3.3.3, 3.4.1
>Reporter: YUBI LEE
>Priority: Major
>  Labels: pull-request-available
>
> SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use 
> shortname instead of full principal name.
> Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the 
> side of non-kerberized hdfs namenode.
> For example, I use 2 hdfs cluster. One is kerberized, the other one is not 
> kerberized.
> I make a rule to add some prefix to username on the non-kerberized cluster if 
> some one access it from the kerberized cluster.
> {code}
>   
> hadoop.security.auth_to_local
> 
> RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
> RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
> DEFAULT
>   
> {code}
> However, if I submit spark job with keytab & principal option, hdfs directory 
> and files ownership is not coherent.
> (I change some words for privacy.)
> {code}
> $ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23
> Found 52 items
> -rw-rw-rw-   3 _ex_eub hdfs  0 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/_SUCCESS
> -rw-r--r--   3 eub  hdfs  134418857 2023-05-11 00:15 
> hdfs:///user/eub/some/path/20230510/23/part-0-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> -rw-r--r--   3 eub  hdfs  153410049 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/part-1-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> -rw-r--r--   3 eub  hdfs  157260989 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/part-2-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> -rw-r--r--   3 eub  hdfs  156222760 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/part-3-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> {code}
> Another interesting point is that if I submit spark job without keytab and 
> principal option but with kerberos authentication with {{kinit}}, it will not 
> follow {{hadoop.security.auth_to_local}} rule completely.
> {code}
> $ hdfs dfs -ls  hdfs:///user/eub/output/
> Found 3 items
> -rw-rw-r--+  3 eub hdfs  0 2023-08-25 12:31 
> hdfs:///user/eub/output/_SUCCESS
> -rw-rw-r--+  3 eub hdfs512 2023-08-25 12:31 
> hdfs:///user/eub/output/part-0.gz
> -rw-rw-r--+  3 eub hdfs574 2023-08-25 12:31 
> hdfs:///user/eub/output/part-1.gz
> {code}
> I finally found that if I submit spark job with {{--principal}} and 
> {{--keytab}} option, ugi will be different.
> (refer to 
> https://github.com/apache/spark/blob/2583bd2c16a335747895c0843f438d0966f47ecd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L905).
> Only file ({{_SUCCESS}}) and output directory created by driver (application 
> master side) will respect {{hadoop.security.auth_to_local}} on the 
> non-kerberized namenode only if {{--principal}} and {{--keytab}] options are 
> provided.
> No matter how hdfs files or directory are created by executor or driver, 
> those should respect {{hadoop.security.auth_to_local}} rule and should be the 
> same.
> Workaround is to pass additional argument to change {{SPARK_USER}} on the 
> executor side.
> e.g. {{--conf spark.executorEnv.SPARK_USER=_ex_eub}}
> {{--conf spark.yarn.appMasterEnv.SPARK_USER=_ex_eub}} will make an error. 
> There are some logics to append environment value with {{:}} (colon) as a 
> separator.
> - 
> https://github.com/apache/spark/blob/4748d858b4478ea7503b792050d4735eae83b3cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L893
> - 
> https://github.com/apache/spark/blob/4748d858b4478ea7503b792050d4735eae83b3cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala#L52



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44976) Preserve full principal user name on executor side

2023-08-28 Thread YUBI LEE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-44976:
-
Summary: Preserve full principal user name on executor side  (was: 
Utils.getCurrentUserName should return the full principal name)

> Preserve full principal user name on executor side
> --
>
> Key: SPARK-44976
> URL: https://issues.apache.org/jira/browse/SPARK-44976
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.3, 3.3.3, 3.4.1
>Reporter: YUBI LEE
>Priority: Major
>
> SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use 
> shortname instead of full principal name.
> Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the 
> side of non-kerberized hdfs namenode.
> For example, I use 2 hdfs cluster. One is kerberized, the other one is not 
> kerberized.
> I make a rule to add some prefix to username on the non-kerberized cluster if 
> some one access it from the kerberized cluster.
> {code}
>   
> hadoop.security.auth_to_local
> 
> RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
> RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
> DEFAULT
>   
> {code}
> However, if I submit spark job with keytab & principal option, hdfs directory 
> and files ownership is not coherent.
> (I change some words for privacy.)
> {code}
> $ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23
> Found 52 items
> -rw-rw-rw-   3 _ex_eub hdfs  0 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/_SUCCESS
> -rw-r--r--   3 eub  hdfs  134418857 2023-05-11 00:15 
> hdfs:///user/eub/some/path/20230510/23/part-0-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> -rw-r--r--   3 eub  hdfs  153410049 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/part-1-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> -rw-r--r--   3 eub  hdfs  157260989 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/part-2-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> -rw-r--r--   3 eub  hdfs  156222760 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/part-3-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> {code}
> Another interesting point is that if I submit spark job without keytab and 
> principal option but with kerberos authentication with {{kinit}}, it will not 
> follow {{hadoop.security.auth_to_local}} rule completely.
> {code}
> $ hdfs dfs -ls  hdfs:///user/eub/output/
> Found 3 items
> -rw-rw-r--+  3 eub hdfs  0 2023-08-25 12:31 
> hdfs:///user/eub/output/_SUCCESS
> -rw-rw-r--+  3 eub hdfs512 2023-08-25 12:31 
> hdfs:///user/eub/output/part-0.gz
> -rw-rw-r--+  3 eub hdfs574 2023-08-25 12:31 
> hdfs:///user/eub/output/part-1.gz
> {code}
> I finally found that if I submit spark job with {{--principal}} and 
> {{--keytab}} option, ugi will be different.
> (refer to 
> https://github.com/apache/spark/blob/2583bd2c16a335747895c0843f438d0966f47ecd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L905).
> Only file ({{_SUCCESS}}) and output directory created by driver (application 
> master side) will respect {{hadoop.security.auth_to_local}} on the 
> non-kerberized namenode only if {{--principal}} and {{--keytab}] options are 
> provided.
> No matter how hdfs files or directory are created by executor or driver, 
> those should respect {{hadoop.security.auth_to_local}} rule and should be the 
> same.
> Workaround is to pass additional argument to change {{SPARK_USER}} on the 
> executor side.
> e.g. {{--conf spark.executorEnv.SPARK_USER=_ex_eub}}
> {{--conf spark.yarn.appMasterEnv.SPARK_USER=_ex_eub}} will make an error. 
> There are some logics to append environment value with {{:}} (colon) as a 
> separator.
> - 
> https://github.com/apache/spark/blob/4748d858b4478ea7503b792050d4735eae83b3cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L893
> - 
> https://github.com/apache/spark/blob/4748d858b4478ea7503b792050d4735eae83b3cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala#L52



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44976) Utils.getCurrentUserName should return the full principal name

2023-08-26 Thread YUBI LEE (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759255#comment-17759255
 ] 

YUBI LEE commented on SPARK-44976:
--

I think it is also related to https://issues.apache.org/jira/browse/SPARK-31551.

> Utils.getCurrentUserName should return the full principal name
> --
>
> Key: SPARK-44976
> URL: https://issues.apache.org/jira/browse/SPARK-44976
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.3, 3.3.3, 3.4.1
>Reporter: YUBI LEE
>Priority: Major
>
> SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use 
> shortname instead of full principal name.
> Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the 
> side of non-kerberized hdfs namenode.
> For example, I use 2 hdfs cluster. One is kerberized, the other one is not 
> kerberized.
> I make a rule to add some prefix to username on the non-kerberized cluster if 
> some one access it from the kerberized cluster.
> {code}
>   
> hadoop.security.auth_to_local
> 
> RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
> RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
> DEFAULT
>   
> {code}
> However, if I submit spark job with keytab & principal option, hdfs directory 
> and files ownership is not coherent.
> (I change some words for privacy.)
> {code}
> $ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23
> Found 52 items
> -rw-rw-rw-   3 _ex_eub hdfs  0 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/_SUCCESS
> -rw-r--r--   3 eub  hdfs  134418857 2023-05-11 00:15 
> hdfs:///user/eub/some/path/20230510/23/part-0-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> -rw-r--r--   3 eub  hdfs  153410049 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/part-1-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> -rw-r--r--   3 eub  hdfs  157260989 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/part-2-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> -rw-r--r--   3 eub  hdfs  156222760 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/part-3-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> {code}
> Another interesting point is that if I submit spark job without keytab and 
> principal option but with kerberos authentication with {{kinit}}, it will not 
> follow {{hadoop.security.auth_to_local}} rule completely.
> {code}
> $ hdfs dfs -ls  hdfs:///user/eub/output/
> Found 3 items
> -rw-rw-r--+  3 eub hdfs  0 2023-08-25 12:31 
> hdfs:///user/eub/output/_SUCCESS
> -rw-rw-r--+  3 eub hdfs512 2023-08-25 12:31 
> hdfs:///user/eub/output/part-0.gz
> -rw-rw-r--+  3 eub hdfs574 2023-08-25 12:31 
> hdfs:///user/eub/output/part-1.gz
> {code}
> I finally found that if I submit spark job with {{--principal}} and 
> {{--keytab}} option, ugi will be different.
> (refer to 
> https://github.com/apache/spark/blob/2583bd2c16a335747895c0843f438d0966f47ecd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L905).
> Only file ({{_SUCCESS}}) and output directory created by driver (application 
> master side) will respect {{hadoop.security.auth_to_local}} on the 
> non-kerberized namenode only if {{--principal}} and {{--keytab}] options are 
> provided.
> No matter how hdfs files or directory are created by executor or driver, 
> those should respect {{hadoop.security.auth_to_local}} rule and should be the 
> same.
> Workaround is to pass additional argument to change {{SPARK_USER}} on the 
> executor side.
> e.g. {{--conf spark.executorEnv.SPARK_USER=_ex_eub}}
> {{--conf spark.yarn.appMasterEnv.SPARK_USER=_ex_eub}} will make an error. 
> There are some logics to append environment value with {{:}} (colon) as a 
> separator.
> - 
> https://github.com/apache/spark/blob/4748d858b4478ea7503b792050d4735eae83b3cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L893
> - 
> https://github.com/apache/spark/blob/4748d858b4478ea7503b792050d4735eae83b3cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala#L52



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44976) Utils.getCurrentUserName should return the full principal name

2023-08-25 Thread YUBI LEE (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759201#comment-17759201
 ] 

YUBI LEE commented on SPARK-44976:
--

https://github.com/apache/spark/pull/42690

> Utils.getCurrentUserName should return the full principal name
> --
>
> Key: SPARK-44976
> URL: https://issues.apache.org/jira/browse/SPARK-44976
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.3, 3.3.3, 3.4.1
>Reporter: YUBI LEE
>Priority: Major
>
> SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use 
> shortname instead of full principal name.
> Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the 
> side of non-kerberized hdfs namenode.
> For example, I use 2 hdfs cluster. One is kerberized, the other one is not 
> kerberized.
> I make a rule to add some prefix to username on the non-kerberized cluster if 
> some one access it from the kerberized cluster.
> {code}
>   
> hadoop.security.auth_to_local
> 
> RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
> RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
> DEFAULT
>   
> {code}
> However, if I submit spark job with keytab & principal option, hdfs directory 
> and files ownership is not coherent.
> (I change some words for privacy.)
> {code}
> $ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23
> Found 52 items
> -rw-rw-rw-   3 _ex_eub hdfs  0 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/_SUCCESS
> -rw-r--r--   3 eub  hdfs  134418857 2023-05-11 00:15 
> hdfs:///user/eub/some/path/20230510/23/part-0-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> -rw-r--r--   3 eub  hdfs  153410049 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/part-1-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> -rw-r--r--   3 eub  hdfs  157260989 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/part-2-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> -rw-r--r--   3 eub  hdfs  156222760 2023-05-11 00:16 
> hdfs:///user/eub/some/path/20230510/23/part-3-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
> {code}
> Another interesting point is that if I submit spark job without keytab and 
> principal option but with kerberos authentication with {{kinit}}, it will not 
> follow {{hadoop.security.auth_to_local}} rule completely.
> {code}
> $ hdfs dfs -ls  hdfs:///user/eub/output/
> Found 3 items
> -rw-rw-r--+  3 eub hdfs  0 2023-08-25 12:31 
> hdfs:///user/eub/output/_SUCCESS
> -rw-rw-r--+  3 eub hdfs512 2023-08-25 12:31 
> hdfs:///user/eub/output/part-0.gz
> -rw-rw-r--+  3 eub hdfs574 2023-08-25 12:31 
> hdfs:///user/eub/output/part-1.gz
> {code}
> I finally found that if I submit spark job with {{--principal}} and 
> {{--keytab}} option, ugi will be different.
> (refer to 
> https://github.com/apache/spark/blob/2583bd2c16a335747895c0843f438d0966f47ecd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L905).
> Only file ({{_SUCCESS}}) and output directory created by driver (application 
> master side) will respect {{hadoop.security.auth_to_local}} on the 
> non-kerberized namenode only if {{--principal}} and {{--keytab}] options are 
> provided.
> No matter how hdfs files or directory are created by executor or driver, 
> those should respect {{hadoop.security.auth_to_local}} rule and should be the 
> same.
> Workaround is to pass additional argument to change {{SPARK_USER}} on the 
> executor side.
> e.g. {{--conf spark.executorEnv.SPARK_USER=_ex_eub}}
> {{--conf spark.yarn.appMasterEnv.SPARK_USER=_ex_eub}} will make an error. 
> There are some logics to append environment value with {{:}} (colon) as a 
> separator.
> - 
> https://github.com/apache/spark/blob/4748d858b4478ea7503b792050d4735eae83b3cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L893
> - 
> https://github.com/apache/spark/blob/4748d858b4478ea7503b792050d4735eae83b3cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala#L52



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44976) Utils.getCurrentUserName should return the full principal name

2023-08-25 Thread YUBI LEE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-44976:
-
Description: 
SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use 
shortname instead of full principal name.
Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the 
side of non-kerberized hdfs namenode.
For example, I use 2 hdfs cluster. One is kerberized, the other one is not 
kerberized.
I make a rule to add some prefix to username on the non-kerberized cluster if 
some one access it from the kerberized cluster.


{code}
  
hadoop.security.auth_to_local

RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
DEFAULT
  
{code}

However, if I submit spark job with keytab & principal option, hdfs directory 
and files ownership is not coherent.

(I change some words for privacy.)

{code}
$ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23
Found 52 items
-rw-rw-rw-   3 _ex_eub hdfs  0 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/_SUCCESS
-rw-r--r--   3 eub  hdfs  134418857 2023-05-11 00:15 
hdfs:///user/eub/some/path/20230510/23/part-0-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  153410049 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-1-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  157260989 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-2-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  156222760 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-3-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
{code}

Another interesting point is that if I submit spark job without keytab and 
principal option but with kerberos authentication with {{kinit}}, it will not 
follow {{hadoop.security.auth_to_local}} rule completely.

{code}
$ hdfs dfs -ls  hdfs:///user/eub/output/
Found 3 items
-rw-rw-r--+  3 eub hdfs  0 2023-08-25 12:31 
hdfs:///user/eub/output/_SUCCESS
-rw-rw-r--+  3 eub hdfs512 2023-08-25 12:31 
hdfs:///user/eub/output/part-0.gz
-rw-rw-r--+  3 eub hdfs574 2023-08-25 12:31 
hdfs:///user/eub/output/part-1.gz
{code}


I finally found that if I submit spark job with {{--principal}} and 
{{--keytab}} option, ugi will be different.
(refer to 
https://github.com/apache/spark/blob/2583bd2c16a335747895c0843f438d0966f47ecd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L905).

Only file ({{_SUCCESS}}) and output directory created by driver (application 
master side) will respect {{hadoop.security.auth_to_local}} on the 
non-kerberized namenode only if {{--principal}} and {{--keytab}] options are 
provided.

No matter how hdfs files or directory are created by executor or driver, those 
should respect {{hadoop.security.auth_to_local}} rule and should be the same.


Workaround is to pass additional argument to change {{SPARK_USER}} on the 
executor side.
e.g. {{--conf spark.executorEnv.SPARK_USER=_ex_eub}}

{{--conf spark.yarn.appMasterEnv.SPARK_USER=_ex_eub}} will make an error. There 
are some logics to append environment value with {{:}} (colon) as a separator.

- 
https://github.com/apache/spark/blob/4748d858b4478ea7503b792050d4735eae83b3cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L893
- 
https://github.com/apache/spark/blob/4748d858b4478ea7503b792050d4735eae83b3cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala#L52


  was:
SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use 
shortname instead of full principal name.
Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the 
side of non-kerberized hdfs namenode.
For example, I use 2 hdfs cluster. One is kerberized, the other one is not 
kerberized.
I make a rule to add some prefix to username on the non-kerberized cluster if 
some one access it from the kerberized cluster.


{code}
  
hadoop.security.auth_to_local

RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
DEFAULT
  
{code}

However, if I submit spark job with keytab & principal option, hdfs directory 
and files ownership is not coherent.

(I change some words for privacy.)

{code}
$ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23
Found 52 items
-rw-rw-rw-   3 _ex_eub hdfs  0 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/_SUCCESS
-rw-r--r--   3 eub  hdfs  134418857 2023-05-11 00:15 
hdfs:///user/eub/some/path/20230510/23/part-0-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  153410049 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-1-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  

[jira] [Updated] (SPARK-44976) Utils.getCurrentUserName should return the full principal name

2023-08-25 Thread YUBI LEE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-44976:
-
Description: 
SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use 
shortname instead of full principal name.
Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the 
side of non-kerberized hdfs namenode.
For example, I use 2 hdfs cluster. One is kerberized, the other one is not 
kerberized.
I make a rule to add some prefix to username on the non-kerberized cluster if 
some one access it from the kerberized cluster.


{code}
  
hadoop.security.auth_to_local

RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
DEFAULT
  
{code}

However, if I submit spark job with keytab & principal option, hdfs directory 
and files ownership is not coherent.

(I change some words for privacy.)

{code}
$ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23
Found 52 items
-rw-rw-rw-   3 _ex_eub hdfs  0 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/_SUCCESS
-rw-r--r--   3 eub  hdfs  134418857 2023-05-11 00:15 
hdfs:///user/eub/some/path/20230510/23/part-0-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  153410049 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-1-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  157260989 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-2-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  156222760 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-3-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
{code}

Another interesting point is that if I submit spark job without keytab and 
principal option but with kerberos authentication with {{kinit}}, it will not 
follow {{hadoop.security.auth_to_local}} rule completely.

{code}
$ hdfs dfs -ls  hdfs:///user/eub/output/
Found 3 items
-rw-rw-r--+  3 eub hdfs  0 2023-08-25 12:31 
hdfs:///user/eub/output/_SUCCESS
-rw-rw-r--+  3 eub hdfs512 2023-08-25 12:31 
hdfs:///user/eub/output/part-0.gz
-rw-rw-r--+  3 eub hdfs574 2023-08-25 12:31 
hdfs:///user/eub/output/part-1.gz
{code}


I finally found that if I submit spark job with {{--principal}} and 
{{--keytab}} option, ugi will be different.
(refer to 
https://github.com/apache/spark/blob/2583bd2c16a335747895c0843f438d0966f47ecd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L905).

Only file ({{_SUCCESS}}) and output directory created by driver (application 
master side) will respect {{hadoop.security.auth_to_local}} on the 
non-kerberized namenode only if {{--principal}} and {{--keytab}] options are 
provided.

No matter how hdfs files or directory are created by executor or driver, those 
should respect {{hadoop.security.auth_to_local}} rule and should be the same.


This issue is related to https://issues.apache.org/jira/browse/SPARK-6558.

  was:
SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use 
shortname instead of full principal name.
Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the 
side of non-kerberized hdfs namenode.
For example, I use 2 hdfs cluster. One is kerberized, the other one is not 
kerberized.
I make a rule to add some prefix to username on the non-kerberized cluster if 
some one access it from the kerberized cluster.


{code}
  
hadoop.security.auth_to_local

RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
DEFAULT
  
{code}

However, if I submit spark job with keytab & principal option, hdfs directory 
and files ownership is not coherent.

(I change some words for privacy.)

{code}
$ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23
Found 52 items
-rw-rw-rw-   3 _ex_eub hdfs  0 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/_SUCCESS
-rw-r--r--   3 eub  hdfs  134418857 2023-05-11 00:15 
hdfs:///user/eub/some/path/20230510/23/part-0-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  153410049 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-1-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  157260989 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-2-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  156222760 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-3-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
{code}

Another interesting point is that if I submit spark job without keytab and 
principal option but with kerberos authentication with {{kinit}}, it will not 
follow {{hadoop.security.auth_to_local}} rule completely.

{code}
$ hdfs dfs -ls  hdfs:///user/eub/output/
Found 3 items
-rw-rw-r--+  3 

[jira] [Updated] (SPARK-44976) Utils.getCurrentUserName should return the full principal name

2023-08-25 Thread YUBI LEE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-44976:
-
Description: 
SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use 
shortname instead of full principal name.
Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the 
side of non-kerberized hdfs namenode.
For example, I use 2 hdfs cluster. One is kerberized, the other one is not 
kerberized.
I make a rule to add some prefix to username on the non-kerberized cluster if 
some one access it from the kerberized cluster.


{code}
  
hadoop.security.auth_to_local

RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
DEFAULT
  
{code}

However, if I submit spark job with keytab & principal option, hdfs directory 
and files ownership is not coherent.

(I change some words for privacy.)

{code}
$ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23
Found 52 items
-rw-rw-rw-   3 _ex_eub hdfs  0 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/_SUCCESS
-rw-r--r--   3 eub  hdfs  134418857 2023-05-11 00:15 
hdfs:///user/eub/some/path/20230510/23/part-0-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  153410049 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-1-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  157260989 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-2-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  156222760 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-3-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
{code}

Another interesting point is that if I submit spark job without keytab and 
principal option but with kerberos authentication with {{kinit}}, it will not 
follow {{hadoop.security.auth_to_local}} rule completely.

{code}
$ hdfs dfs -ls  hdfs:///user/eub/output/
Found 3 items
-rw-rw-r--+  3 eub hdfs  0 2023-08-25 12:31 
hdfs:///user/eub/output/_SUCCESS
-rw-rw-r--+  3 eub hdfs512 2023-08-25 12:31 
hdfs:///user/eub/output/part-0.gz
-rw-rw-r--+  3 eub hdfs574 2023-08-25 12:31 
hdfs:///user/eub/output/part-1.gz
{code}


I finally found that if I submit spark job with {{--principal}} and 
{{--keytab}} option, ugi will be different.
(refer to 
https://github.com/apache/spark/blob/2583bd2c16a335747895c0843f438d0966f47ecd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L905).

Only file ({{_SUCCESS}}) and output directory created by driver (application 
master side) will respect {{hadoop.security.auth_to_local}} on the 
non-kerberized namenode only if {{--principal}} and {{--keytab}] options are 
provided.

No matter how hdfs files or directory are created by executor or driver, those 
should respect {{hadoop.security.auth_to_local}} rule and should be the same.


  was:
SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use 
shortname instead of full principal name.
Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the 
side of non-kerberized hdfs namenode.
For example, I use 2 hdfs cluster. One is kerberized, the other one is not 
kerberized.
I make a rule to add some prefix to username on the non-kerberized cluster if 
some one access it from the kerberized cluster.


{code}
  
hadoop.security.auth_to_local

RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
DEFAULT
  
{code}

However, if I submit spark job with keytab & principal option, hdfs directory 
and files ownership is not coherent.

(I change some words for privacy.)

{code}
$ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23
Found 52 items
-rw-rw-rw-   3 _ex_eub hdfs  0 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/_SUCCESS
-rw-r--r--   3 eub  hdfs  134418857 2023-05-11 00:15 
hdfs:///user/eub/some/path/20230510/23/part-0-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  153410049 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-1-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  157260989 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-2-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  156222760 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-3-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
{code}

Another interesting point is that if I submit spark job without keytab and 
principal option but with kerberos authentication with {{kinit}}, it will not 
follow {{hadoop.security.auth_to_local}} rule completely.

{code}
$ hdfs dfs -ls  hdfs:///user/eub/output/
Found 3 items
-rw-rw-r--+  3 eub hdfs  0 2023-08-25 12:31 
hdfs:///user/eub/output/_SUCCESS

[jira] [Created] (SPARK-44976) Utils.getCurrentUserName should return the full principal name

2023-08-25 Thread YUBI LEE (Jira)
YUBI LEE created SPARK-44976:


 Summary: Utils.getCurrentUserName should return the full principal 
name
 Key: SPARK-44976
 URL: https://issues.apache.org/jira/browse/SPARK-44976
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.1, 3.3.3, 3.2.3
Reporter: YUBI LEE


SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use 
shortname instead of full principal name.
Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the 
side of non-kerberized hdfs namenode.
For example, I use 2 hdfs cluster. One is kerberized, the other one is not 
kerberized.
I make a rule to add some prefix to username on the non-kerberized cluster if 
some one access it from the kerberized cluster.


{code}
  
hadoop.security.auth_to_local

RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
DEFAULT
  
{code}

However, if I submit spark job with keytab & principal option, hdfs directory 
and files ownership is not coherent.

(I change some words for privacy.)

{code}
$ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23
Found 52 items
-rw-rw-rw-   3 _ex_eub hdfs  0 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/_SUCCESS
-rw-r--r--   3 eub  hdfs  134418857 2023-05-11 00:15 
hdfs:///user/eub/some/path/20230510/23/part-0-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  153410049 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-1-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  157260989 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-2-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub  hdfs  156222760 2023-05-11 00:16 
hdfs:///user/eub/some/path/20230510/23/part-3-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
{code}

Another interesting point is that if I submit spark job without keytab and 
principal option but with kerberos authentication with {{kinit}}, it will not 
follow {{hadoop.security.auth_to_local}} rule completely.

{code}
$ hdfs dfs -ls  hdfs:///user/eub/output/
Found 3 items
-rw-rw-r--+  3 eub hdfs  0 2023-08-25 12:31 
hdfs:///user/eub/output/_SUCCESS
-rw-rw-r--+  3 eub hdfs512 2023-08-25 12:31 
hdfs:///user/eub/output/part-0.gz
-rw-rw-r--+  3 eub hdfs574 2023-08-25 12:31 
hdfs:///user/eub/output/part-1.gz
{code}


I finally found that if I submit spark job with {{--principal}} and 
{{--keytab}} option, ugi will be different.
(refer to 
https://github.com/apache/spark/blob/2583bd2c16a335747895c0843f438d0966f47ecd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L905).

Only file ({{_SUCCESS}}) and output directory created by driver (application 
master side) will respect {{hadoop.security.auth_to_local}} on the 
non-kerberized namenode only if {{--principal}} and {{--keytab}] options are 
provided.

No matter how hdfs files or directory are created by executor or driver, those 
should respect {{hadoop.security.auth_to_local}} rule and should be the same.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40964) Cannot run spark history server with shaded hadoop jar

2022-10-28 Thread YUBI LEE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-40964:
-
Description: 
Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+.
If you try to start Spark History Server with shaded client jars and enable 
security using 
org.apache.hadoop.security.authentication.server.AuthenticationFilter, you will 
meet following exception.

{code}
# spark-env.sh
export 
SPARK_HISTORY_OPTS='-Dspark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter
 
-Dspark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=kerberos,kerberos.principal=HTTP/some.example@example.com,kerberos.keytab=/etc/security/keytabs/spnego.service.keytab"'
{code}


{code}
# spark history server's out file
22/10/27 15:29:48 INFO AbstractConnector: Started 
ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081}
22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on 
port 18081.
22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: 
org.apache.hadoop.security.authentication.server.AuthenticationFilter
22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer
java.lang.IllegalStateException: class 
org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a 
javax.servlet.Filter
at 
org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at 
org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730)
at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at 
java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at 
org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:755)
at 
org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:379)
at 
org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:910)
at 
org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:288)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:491)
at org.apache.spark.ui.WebUI.$anonfun$bind$3(WebUI.scala:148)
at org.apache.spark.ui.WebUI.$anonfun$bind$3$adapted(WebUI.scala:148)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:148)
at 
org.apache.spark.deploy.history.HistoryServer.bind(HistoryServer.scala:164)
at 
org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:310)
at 
org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
{code}


I think "AuthenticationFilter" in the shaded jar imports 
"org.apache.hadoop.shaded.javax.servlet.Filter", not "javax.servlet.Filter".

{code}
❯ grep -r org.apache.hadoop.shaded.javax.servlet.Filter *
Binary file hadoop-client-runtime-3.3.1.jar matches
{code}

It causes the exception I mentioned.

I'm not sure what is the best answer.
Workaround is not to use spark with pre-built for Apache Hadoop, specify 
`HADOOP_HOME` or `SPARK_DIST_CLASSPATH` in spark-env.sh for Spark History 
Server.

May be the possible options are:
- Not to shade "javax.servlet.Filter" at hadoop shaded jar
- Or, shade "javax.servlet.Filter" also at jetty.

  was:
Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+.
If you try to start Spark History Server with shaded client jars and enable 
security using 
org.apache.hadoop.security.authentication.server.AuthenticationFilter, you will 
meet following exception.

{code}
export 
SPARK_HISTORY_OPTS='-Dspark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter
 
-Dspark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=kerberos,kerberos.principal=HTTP/some.example@example.com,kerberos.keytab=/etc/security/keytabs/spnego.service.keytab"'
{code}


{code}
22/10/27 15:29:48 INFO AbstractConnector: Started 
ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081}
22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on 
port 18081.
22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: 
org.apache.hadoop.security.authentication.server.AuthenticationFilter
22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer
java.lang.IllegalStateException: 

[jira] [Updated] (SPARK-40964) Cannot run spark history server with shaded hadoop jar

2022-10-28 Thread YUBI LEE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-40964:
-
Description: 
Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+.
If you try to start Spark History Server with shaded client jars and enable 
security using 
org.apache.hadoop.security.authentication.server.AuthenticationFilter, you will 
meet following exception.

{code}
export 
SPARK_HISTORY_OPTS='-Dspark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter
 
-Dspark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=kerberos,kerberos.principal=HTTP/some.example@example.com,kerberos.keytab=/etc/security/keytabs/spnego.service.keytab"'
{code}


{code}
22/10/27 15:29:48 INFO AbstractConnector: Started 
ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081}
22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on 
port 18081.
22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: 
org.apache.hadoop.security.authentication.server.AuthenticationFilter
22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer
java.lang.IllegalStateException: class 
org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a 
javax.servlet.Filter
at 
org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at 
org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730)
at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at 
java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at 
org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:755)
at 
org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:379)
at 
org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:910)
at 
org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:288)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:491)
at org.apache.spark.ui.WebUI.$anonfun$bind$3(WebUI.scala:148)
at org.apache.spark.ui.WebUI.$anonfun$bind$3$adapted(WebUI.scala:148)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:148)
at 
org.apache.spark.deploy.history.HistoryServer.bind(HistoryServer.scala:164)
at 
org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:310)
at 
org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
{code}


I think "AuthenticationFilter" in the shaded jar imports 
"org.apache.hadoop.shaded.javax.servlet.Filter", not "javax.servlet.Filter".

{code}
❯ grep -r org.apache.hadoop.shaded.javax.servlet.Filter *
Binary file hadoop-client-runtime-3.3.1.jar matches
{code}

It causes the exception I mentioned.

I'm not sure what is the best answer.
Workaround is not to use spark with pre-built for Apache Hadoop, specify 
`HADOOP_HOME` or `SPARK_DIST_CLASSPATH` in spark-env.sh for Spark History 
Server.

May be the possible options are:
- Not to shade "javax.servlet.Filter" at hadoop shaded jar
- Or, shade "javax.servlet.Filter" also at jetty.

  was:
Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+.
If you try to start Spark History Server with shaded client jars and enable 
security using 
org.apache.hadoop.security.authentication.server.AuthenticationFilter, you will 
meet following exception.


{code}
22/10/27 15:29:48 INFO AbstractConnector: Started 
ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081}
22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on 
port 18081.
22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: 
org.apache.hadoop.security.authentication.server.AuthenticationFilter
22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer
java.lang.IllegalStateException: class 
org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a 
javax.servlet.Filter
at 
org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at 
org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730)
  

[jira] [Updated] (SPARK-40964) Cannot run spark history server with shaded hadoop jar

2022-10-28 Thread YUBI LEE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-40964:
-
Description: 
Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+.
If you try to start Spark History Server with shaded client jars and enable 
security using 
org.apache.hadoop.security.authentication.server.AuthenticationFilter, you will 
meet following exception.


{code}
22/10/27 15:29:48 INFO AbstractConnector: Started 
ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081}
22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on 
port 18081.
22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: 
org.apache.hadoop.security.authentication.server.AuthenticationFilter
22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer
java.lang.IllegalStateException: class 
org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a 
javax.servlet.Filter
at 
org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at 
org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730)
at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at 
java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at 
org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:755)
at 
org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:379)
at 
org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:910)
at 
org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:288)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:491)
at org.apache.spark.ui.WebUI.$anonfun$bind$3(WebUI.scala:148)
at org.apache.spark.ui.WebUI.$anonfun$bind$3$adapted(WebUI.scala:148)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:148)
at 
org.apache.spark.deploy.history.HistoryServer.bind(HistoryServer.scala:164)
at 
org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:310)
at 
org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
{code}


I think "AuthenticationFilter" in the shaded jar imports 
"org.apache.hadoop.shaded.javax.servlet.Filter", not "javax.servlet.Filter".

{code}
❯ grep -r org.apache.hadoop.shaded.javax.servlet.Filter *
Binary file hadoop-client-runtime-3.3.1.jar matches
{code}

It causes the exception I mentioned.

I'm not sure what is the best answer.
Workaround is not to use spark with pre-built for Apache Hadoop, specify 
`HADOOP_HOME` or `SPARK_DIST_CLASSPATH` in spark-env.sh for Spark History 
Server.

May be the possible options are:
- Not to shade "javax.servlet.Filter" at hadoop shaded jar
- Or, shade "javax.servlet.Filter" also at jetty.

  was:
Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+.
In this situation, if you try to start Spark History Server with shaded client 
jars and enable security using 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.
You will meet following exception.


{code}
22/10/27 15:29:48 INFO AbstractConnector: Started 
ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081}
22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on 
port 18081.
22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: 
org.apache.hadoop.security.authentication.server.AuthenticationFilter
22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer
java.lang.IllegalStateException: class 
org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a 
javax.servlet.Filter
at 
org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at 
org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730)
at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at 
java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at 

[jira] [Updated] (SPARK-40964) Cannot run spark history server with shaded hadoop jar

2022-10-28 Thread YUBI LEE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-40964:
-
Description: 
Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+.
In this situation, if you try to start Spark History Server with shaded client 
jars and enable security using 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.
You will meet following exception.


{code}
22/10/27 15:29:48 INFO AbstractConnector: Started 
ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081}
22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on 
port 18081.
22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: 
org.apache.hadoop.security.authentication.server.AuthenticationFilter
22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer
java.lang.IllegalStateException: class 
org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a 
javax.servlet.Filter
at 
org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at 
org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730)
at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at 
java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at 
org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:755)
at 
org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:379)
at 
org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:910)
at 
org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:288)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:491)
at org.apache.spark.ui.WebUI.$anonfun$bind$3(WebUI.scala:148)
at org.apache.spark.ui.WebUI.$anonfun$bind$3$adapted(WebUI.scala:148)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:148)
at 
org.apache.spark.deploy.history.HistoryServer.bind(HistoryServer.scala:164)
at 
org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:310)
at 
org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
{code}


I think "AuthenticationFilter" in the shaded jar imports 
"org.apache.hadoop.shaded.javax.servlet.Filter", not "javax.servlet.Filter".

```
❯ grep -r org.apache.hadoop.shaded.javax.servlet.Filter *
Binary file hadoop-client-runtime-3.3.1.jar matches
```

It causes the exception I mentioned.

I'm not sure what is the best answer.
Workaround is not to use spark with pre-built for Apache Hadoop, specify 
`HADOOP_HOME` or `SPARK_DIST_CLASSPATH` in spark-env.sh for Spark History 
Server.

May be the possible options are:
- Not to shade "javax.servlet.Filter" at hadoop shaded jar
- Or, shade "javax.servlet.Filter" also at jetty.

  was:
Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+.
In this situation, if you try to start Spark History Server with shaded client 
jars and enable security using 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.
You will meet following exception.


{code}
22/10/27 15:29:48 INFO AbstractConnector: Started 
ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081}
22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on 
port 18081.
22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: 
org.apache.hadoop.security.authentication.server.AuthenticationFilter
22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer
java.lang.IllegalStateException: class 
org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a 
javax.servlet.Filter
at 
org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at 
org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730)
at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at 
java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at 

[jira] [Updated] (SPARK-40964) Cannot run spark history server with shaded hadoop jar

2022-10-28 Thread YUBI LEE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-40964:
-
Description: 
Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+.
In this situation, if you try to start Spark History Server with shaded client 
jars and enable security using 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.
You will meet following exception.


{code}
22/10/27 15:29:48 INFO AbstractConnector: Started 
ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081}
22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on 
port 18081.
22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: 
org.apache.hadoop.security.authentication.server.AuthenticationFilter
22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer
java.lang.IllegalStateException: class 
org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a 
javax.servlet.Filter
at 
org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at 
org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730)
at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at 
java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at 
org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:755)
at 
org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:379)
at 
org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:910)
at 
org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:288)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:491)
at org.apache.spark.ui.WebUI.$anonfun$bind$3(WebUI.scala:148)
at org.apache.spark.ui.WebUI.$anonfun$bind$3$adapted(WebUI.scala:148)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:148)
at 
org.apache.spark.deploy.history.HistoryServer.bind(HistoryServer.scala:164)
at 
org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:310)
at 
org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
{code}


I think "AuthenticationFilter" in the shaded jar imports 
"org.apache.hadoop.shaded.javax.servlet.Filter", not "javax.servlet.Filter".

{code}
❯ grep -r org.apache.hadoop.shaded.javax.servlet.Filter *
Binary file hadoop-client-runtime-3.3.1.jar matches
{code}

It causes the exception I mentioned.

I'm not sure what is the best answer.
Workaround is not to use spark with pre-built for Apache Hadoop, specify 
`HADOOP_HOME` or `SPARK_DIST_CLASSPATH` in spark-env.sh for Spark History 
Server.

May be the possible options are:
- Not to shade "javax.servlet.Filter" at hadoop shaded jar
- Or, shade "javax.servlet.Filter" also at jetty.

  was:
Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+.
In this situation, if you try to start Spark History Server with shaded client 
jars and enable security using 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.
You will meet following exception.


{code}
22/10/27 15:29:48 INFO AbstractConnector: Started 
ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081}
22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on 
port 18081.
22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: 
org.apache.hadoop.security.authentication.server.AuthenticationFilter
22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer
java.lang.IllegalStateException: class 
org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a 
javax.servlet.Filter
at 
org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at 
org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730)
at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at 
java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at 

[jira] [Created] (SPARK-40964) Cannot run spark history server with shaded hadoop jar

2022-10-28 Thread YUBI LEE (Jira)
YUBI LEE created SPARK-40964:


 Summary: Cannot run spark history server with shaded hadoop jar
 Key: SPARK-40964
 URL: https://issues.apache.org/jira/browse/SPARK-40964
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.2.2
Reporter: YUBI LEE


Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+.
In this situation, if you try to start Spark History Server with shaded client 
jars and enable security using 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.
You will meet following exception.


{code}
22/10/27 15:29:48 INFO AbstractConnector: Started 
ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081}
22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on 
port 18081.
22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: 
org.apache.hadoop.security.authentication.server.AuthenticationFilter
22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer
java.lang.IllegalStateException: class 
org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a 
javax.servlet.Filter
at 
org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at 
org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730)
at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at 
java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at 
org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:755)
at 
org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:379)
at 
org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:910)
at 
org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:288)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:491)
at org.apache.spark.ui.WebUI.$anonfun$bind$3(WebUI.scala:148)
at org.apache.spark.ui.WebUI.$anonfun$bind$3$adapted(WebUI.scala:148)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:148)
at 
org.apache.spark.deploy.history.HistoryServer.bind(HistoryServer.scala:164)
at 
org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:310)
at 
org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
{code}


I think "AuthenticationFilter" in the shaded jar imports 
"org.apache.hadoop.shaded.javax.servlet.Filter", not "javax.servlet.Filter".
It causes the exception I mentioned.

I'm not sure what is the best answer.
Workaround is not to use spark with pre-built for Apache Hadoop, specify 
`HADOOP_HOME` or `SPARK_DIST_CLASSPATH` in spark-env.sh for Spark History 
Server.

May be the possible options are:
- Not to shade "javax.servlet.Filter" at hadoop shaded jar
- Or, shade "javax.servlet.Filter" also at jetty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40072) MAVEN_OPTS in make-distributions.sh is different from one specified in pom.xml

2022-08-14 Thread YUBI LEE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-40072:
-
Description: 
Building spark with make-distribution.sh is failed with default setting because 
default MAVEN_OPTS is different from the one in pom.xml.
 It is related to 
[SPARK-35825|https://issues.apache.org/jira/browse/SPARK-35825].


PR: https://github.com/apache/spark/pull/37510

  was:
Building spark with make-distribution.sh is failed with default setting because 
default MAVEN_OPTS is different from the one in pom.xml.
 It is related to 
[SPARK-35825|https://issues.apache.org/jira/browse/SPARK-35825].



> MAVEN_OPTS in make-distributions.sh is different from one specified in pom.xml
> --
>
> Key: SPARK-40072
> URL: https://issues.apache.org/jira/browse/SPARK-40072
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.2
>Reporter: YUBI LEE
>Priority: Minor
>
> Building spark with make-distribution.sh is failed with default setting 
> because default MAVEN_OPTS is different from the one in pom.xml.
>  It is related to 
> [SPARK-35825|https://issues.apache.org/jira/browse/SPARK-35825].
> PR: https://github.com/apache/spark/pull/37510



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40072) MAVEN_OPTS in make-distributions.sh is different from one specified in pom.xml

2022-08-14 Thread YUBI LEE (Jira)
YUBI LEE created SPARK-40072:


 Summary: MAVEN_OPTS in make-distributions.sh is different from one 
specified in pom.xml
 Key: SPARK-40072
 URL: https://issues.apache.org/jira/browse/SPARK-40072
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.2.2
Reporter: YUBI LEE


Building spark with make-distribution.sh is failed with default setting because 
default MAVEN_OPTS is different from the one in pom.xml.
 It is related to 
[SPARK-35825|https://issues.apache.org/jira/browse/SPARK-35825].




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org