[jira] [Created] (SPARK-48241) CSV parsing failure with char/varchar type columns

2024-05-11 Thread Jiayi Liu (Jira)
Jiayi Liu created SPARK-48241:
-

 Summary: CSV parsing failure with char/varchar type columns
 Key: SPARK-48241
 URL: https://issues.apache.org/jira/browse/SPARK-48241
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.1
Reporter: Jiayi Liu
 Fix For: 4.0.0


CSV table containing char and varchar columns will result in the following 
error when selecting from the CSV table:
{code:java}
java.lang.IllegalArgumentException: requirement failed: requiredSchema 
(struct) should be the subset of dataSchema 
(struct).
    at scala.Predef$.require(Predef.scala:281)
    at 
org.apache.spark.sql.catalyst.csv.UnivocityParser.(UnivocityParser.scala:56)
    at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.$anonfun$buildReader$2(CSVFileFormat.scala:127)
    at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:155)
    at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:140)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125){code}
The reason for the error is that the StringType columns in the dataSchema and 
requiredSchema of UnivocityParser are not consistent. It is due to the metadata 
contained in the StringType StructField of the dataSchema, which is missing in 
the requiredSchema. We need to retain the metadata when resolving schema.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45834) Fix Pearson correlation calculation more stable

2023-11-07 Thread Jiayi Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiayi Liu updated SPARK-45834:
--
Description: 
Spark uses the formula {{ck / sqrt(xMk * yMk)}} to calculate the Pearson 
Correlation Coefficient. If {{xMk}} and {{yMk}} are very small, it can lead to 
double multiplication overflow, resulting in a denominator of 0. This leads to 
an Infinity result in the calculation.

For example, when calculating the correlation for the same columns a and b in a 
table, the result will be Infinity, but the correlation for identical columns 
should be 1.0 instead.
||a||b||
|1e-200|1e-200|
|1e-200|1e-200|
|1e-100|1e-100|

Modifying the formula to {{ck / sqrt(xMk) / sqrt(yMk)}} can indeed solve this 
issue and improve the stability of the calculation. The benefit of this 
modification is that it splits the square root of the denominator into two 
parts: {{sqrt(xMk)}} and {{{}sqrt(yMk){}}}. This helps avoid multiplication 
overflow or cases where the product of extremely small values becomes zero.
 
 

  was:
Spark uses the formula {{ck / sqrt(xMk * yMk)}} to calculate the Pearson 
Correlation Coefficient. If {{xMk}} and {{yMk}} are very small, it can lead to 
double multiplication overflow, resulting in a denominator of 0. This leads to 
a NaN result in the calculation.

For example, when calculating the correlation for the same columns a and b in a 
table, the result will be Infinity, but the correlation for identical columns 
should be 1.0 instead.
||a||b||
|1e-200|1e-200|
|1e-200|1e-200|
|1e-100|1e-100|

Modifying the formula to {{ck / sqrt(xMk) / sqrt(yMk)}} can indeed solve this 
issue and improve the stability of the calculation. The benefit of this 
modification is that it splits the square root of the denominator into two 
parts: {{sqrt(xMk)}} and {{{}sqrt(yMk){}}}. This helps avoid multiplication 
overflow or cases where the product of extremely small values becomes zero.
 
 


> Fix Pearson correlation calculation more stable
> ---
>
> Key: SPARK-45834
> URL: https://issues.apache.org/jira/browse/SPARK-45834
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Jiayi Liu
>Priority: Major
>
> Spark uses the formula {{ck / sqrt(xMk * yMk)}} to calculate the Pearson 
> Correlation Coefficient. If {{xMk}} and {{yMk}} are very small, it can lead 
> to double multiplication overflow, resulting in a denominator of 0. This 
> leads to an Infinity result in the calculation.
> For example, when calculating the correlation for the same columns a and b in 
> a table, the result will be Infinity, but the correlation for identical 
> columns should be 1.0 instead.
> ||a||b||
> |1e-200|1e-200|
> |1e-200|1e-200|
> |1e-100|1e-100|
> Modifying the formula to {{ck / sqrt(xMk) / sqrt(yMk)}} can indeed solve this 
> issue and improve the stability of the calculation. The benefit of this 
> modification is that it splits the square root of the denominator into two 
> parts: {{sqrt(xMk)}} and {{{}sqrt(yMk){}}}. This helps avoid multiplication 
> overflow or cases where the product of extremely small values becomes zero.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42947) Spark Thriftserver LDAP should not use DN pattern if user contains domain

2023-03-28 Thread Jiayi Liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705990#comment-17705990
 ] 

Jiayi Liu commented on SPARK-42947:
---

issue fixed by https://github.com/apache/spark/pull/40577

> Spark Thriftserver LDAP should not use DN pattern if user contains domain
> -
>
> Key: SPARK-42947
> URL: https://issues.apache.org/jira/browse/SPARK-42947
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jiayi Liu
>Priority: Major
>
> When the LDAP provider has domain configuration, such as Active Directory, 
> the principal should not be constructed according to the DN pattern, but the 
> username containing the domain should be directly passed to the LDAP provider 
> as the principal. We can refer to the implementation of Hive LdapUtils.
> When the username contains a domain or domain passes from 
> hive.server2.authentication.ldap.Domain configuration, if we construct the 
> principal according to the DN pattern (For example, 
> uid=user@domain,dc=test,dc=com), we will get the following error:
> {code:java}
> 23/03/28 11:01:48 ERROR TSaslTransport: SASL negotiation failure
> javax.security.sasl.SaslException: Error validating the login
>   at 
> org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:108)
>  ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
>   at 
> org.apache.thrift.transport.TSaslTransport$SaslParticipant.evaluateChallengeOrResponse(TSaslTransport.java:537)
>  ~[libthrift-0.12.0.jar:0.12.0]
>   at 
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:283) 
> ~[libthrift-0.12.0.jar:0.12.0]
>   at 
> org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:43)
>  ~[libthrift-0.12.0.jar:0.12.0]
>   at 
> org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:223)
>  ~[libthrift-0.12.0.jar:0.12.0]
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:293)
>  ~[libthrift-0.12.0.jar:0.12.0]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  ~[?:1.8.0_352]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  ~[?:1.8.0_352]
>   at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_352]
> Caused by: javax.security.sasl.AuthenticationException: Error validating LDAP 
> user
>   at 
> org.apache.hive.service.auth.LdapAuthenticationProviderImpl.Authenticate(LdapAuthenticationProviderImpl.java:76)
>  ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
>   at 
> org.apache.hive.service.auth.PlainSaslHelper$PlainServerCallbackHandler.handle(PlainSaslHelper.java:105)
>  ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
>   at 
> org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:101)
>  ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
>   ... 8 more
> Caused by: javax.naming.AuthenticationException: [LDAP: error code 49 - 
> 80090308: LdapErr: DSID-0C0903D9, comment: AcceptSecurityContext error, data 
> 52e, v2580]
>   at com.sun.jndi.ldap.LdapCtx.mapErrorCode(LdapCtx.java:3261) 
> ~[?:1.8.0_352]
>   at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:3207) 
> ~[?:1.8.0_352]
>   at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:2993) 
> ~[?:1.8.0_352]
>   at com.sun.jndi.ldap.LdapCtx.connect(LdapCtx.java:2907) ~[?:1.8.0_352]
>   at com.sun.jndi.ldap.LdapCtx.(LdapCtx.java:347) ~[?:1.8.0_352]
>   at 
> com.sun.jndi.ldap.LdapCtxFactory.getLdapCtxFromUrl(LdapCtxFactory.java:229) 
> ~[?:1.8.0_352]
>   at 
> com.sun.jndi.ldap.LdapCtxFactory.getUsingURL(LdapCtxFactory.java:189) 
> ~[?:1.8.0_352]
>   at 
> com.sun.jndi.ldap.LdapCtxFactory.getUsingURLs(LdapCtxFactory.java:247) 
> ~[?:1.8.0_352]
>   at 
> com.sun.jndi.ldap.LdapCtxFactory.getLdapCtxInstance(LdapCtxFactory.java:154) 
> ~[?:1.8.0_352]
>   at 
> com.sun.jndi.ldap.LdapCtxFactory.getInitialContext(LdapCtxFactory.java:84) 
> ~[?:1.8.0_352]
>   at 
> javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:695) 
> ~[?:1.8.0_352]
>   at 
> javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:313) 
> ~[?:1.8.0_352]
>   at javax.naming.InitialContext.init(InitialContext.java:244) 
> ~[?:1.8.0_352]
>   at javax.naming.InitialContext.(InitialContext.java:216) 
> ~[?:1.8.0_352]
>   at 
> javax.naming.directory.InitialDirContext.(InitialDirContext.java:101) 
> ~[?:1.8.0_352]
>   at 
> org.apache.hive.service.auth.LdapAuthenticationProviderImpl.Authenticate(LdapAuthenticationProviderImpl.java:73)
>  ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
>   at 
> 

[jira] [Updated] (SPARK-42947) Spark Thriftserver LDAP should not use DN pattern if user contains domain

2023-03-28 Thread Jiayi Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiayi Liu updated SPARK-42947:
--
Description: 
When the LDAP provider has domain configuration, such as Active Directory, the 
principal should not be constructed according to the DN pattern, but the 
username containing the domain should be directly passed to the LDAP provider 
as the principal. We can refer to the implementation of Hive LdapUtils.

When the username contains a domain or domain passes from 
hive.server2.authentication.ldap.Domain configuration, if we construct the 
principal according to the DN pattern (For example, 
uid=user@domain,dc=test,dc=com), we will get the following error:


{code:java}
23/03/28 11:01:48 ERROR TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: Error validating the login
at 
org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:108)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
at 
org.apache.thrift.transport.TSaslTransport$SaslParticipant.evaluateChallengeOrResponse(TSaslTransport.java:537)
 ~[libthrift-0.12.0.jar:0.12.0]
at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:283) 
~[libthrift-0.12.0.jar:0.12.0]
at 
org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:43)
 ~[libthrift-0.12.0.jar:0.12.0]
at 
org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:223)
 ~[libthrift-0.12.0.jar:0.12.0]
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:293)
 ~[libthrift-0.12.0.jar:0.12.0]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
~[?:1.8.0_352]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
~[?:1.8.0_352]
at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_352]
Caused by: javax.security.sasl.AuthenticationException: Error validating LDAP 
user
at 
org.apache.hive.service.auth.LdapAuthenticationProviderImpl.Authenticate(LdapAuthenticationProviderImpl.java:76)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
at 
org.apache.hive.service.auth.PlainSaslHelper$PlainServerCallbackHandler.handle(PlainSaslHelper.java:105)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
at 
org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:101)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
... 8 more
Caused by: javax.naming.AuthenticationException: [LDAP: error code 49 - 
80090308: LdapErr: DSID-0C0903D9, comment: AcceptSecurityContext error, data 
52e, v2580]
at com.sun.jndi.ldap.LdapCtx.mapErrorCode(LdapCtx.java:3261) 
~[?:1.8.0_352]
at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:3207) 
~[?:1.8.0_352]
at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:2993) 
~[?:1.8.0_352]
at com.sun.jndi.ldap.LdapCtx.connect(LdapCtx.java:2907) ~[?:1.8.0_352]
at com.sun.jndi.ldap.LdapCtx.(LdapCtx.java:347) ~[?:1.8.0_352]
at 
com.sun.jndi.ldap.LdapCtxFactory.getLdapCtxFromUrl(LdapCtxFactory.java:229) 
~[?:1.8.0_352]
at 
com.sun.jndi.ldap.LdapCtxFactory.getUsingURL(LdapCtxFactory.java:189) 
~[?:1.8.0_352]
at 
com.sun.jndi.ldap.LdapCtxFactory.getUsingURLs(LdapCtxFactory.java:247) 
~[?:1.8.0_352]
at 
com.sun.jndi.ldap.LdapCtxFactory.getLdapCtxInstance(LdapCtxFactory.java:154) 
~[?:1.8.0_352]
at 
com.sun.jndi.ldap.LdapCtxFactory.getInitialContext(LdapCtxFactory.java:84) 
~[?:1.8.0_352]
at 
javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:695) 
~[?:1.8.0_352]
at 
javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:313) 
~[?:1.8.0_352]
at javax.naming.InitialContext.init(InitialContext.java:244) 
~[?:1.8.0_352]
at javax.naming.InitialContext.(InitialContext.java:216) 
~[?:1.8.0_352]
at 
javax.naming.directory.InitialDirContext.(InitialDirContext.java:101) 
~[?:1.8.0_352]
at 
org.apache.hive.service.auth.LdapAuthenticationProviderImpl.Authenticate(LdapAuthenticationProviderImpl.java:73)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
at 
org.apache.hive.service.auth.PlainSaslHelper$PlainServerCallbackHandler.handle(PlainSaslHelper.java:105)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
at 
org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:101)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
... 8 more
{code}


we should pass user@domain directly to the LDAP provider, just like HiveServer 
did.

  was:
When the LDAP provider has domain configuration, such as Active Directory, the 
principal should not be constructed according to the DN pattern, but the user 
containing the domain should be directly passed to the LDAP 

[jira] [Updated] (SPARK-42947) Spark Thriftserver LDAP should not use DN pattern if user contains domain

2023-03-28 Thread Jiayi Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiayi Liu updated SPARK-42947:
--
Description: 
When the LDAP provider has domain configuration, such as Active Directory, the 
principal should not be constructed according to the DN pattern, but the user 
containing the domain should be directly passed to the LDAP provider as the 
principal. We can refer to the implementation of Hive LdapUtils.

When the username contains a domain or domain passes from 
hive.server2.authentication.ldap.Domain configuration, if we construct the 
principal according to the DN pattern (For example, 
uid=user@domain,dc=test,dc=com), we will get the following error:


{code:java}
23/03/28 11:01:48 ERROR TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: Error validating the login
at 
org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:108)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
at 
org.apache.thrift.transport.TSaslTransport$SaslParticipant.evaluateChallengeOrResponse(TSaslTransport.java:537)
 ~[libthrift-0.12.0.jar:0.12.0]
at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:283) 
~[libthrift-0.12.0.jar:0.12.0]
at 
org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:43)
 ~[libthrift-0.12.0.jar:0.12.0]
at 
org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:223)
 ~[libthrift-0.12.0.jar:0.12.0]
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:293)
 ~[libthrift-0.12.0.jar:0.12.0]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
~[?:1.8.0_352]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
~[?:1.8.0_352]
at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_352]
Caused by: javax.security.sasl.AuthenticationException: Error validating LDAP 
user
at 
org.apache.hive.service.auth.LdapAuthenticationProviderImpl.Authenticate(LdapAuthenticationProviderImpl.java:76)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
at 
org.apache.hive.service.auth.PlainSaslHelper$PlainServerCallbackHandler.handle(PlainSaslHelper.java:105)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
at 
org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:101)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
... 8 more
Caused by: javax.naming.AuthenticationException: [LDAP: error code 49 - 
80090308: LdapErr: DSID-0C0903D9, comment: AcceptSecurityContext error, data 
52e, v2580]
at com.sun.jndi.ldap.LdapCtx.mapErrorCode(LdapCtx.java:3261) 
~[?:1.8.0_352]
at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:3207) 
~[?:1.8.0_352]
at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:2993) 
~[?:1.8.0_352]
at com.sun.jndi.ldap.LdapCtx.connect(LdapCtx.java:2907) ~[?:1.8.0_352]
at com.sun.jndi.ldap.LdapCtx.(LdapCtx.java:347) ~[?:1.8.0_352]
at 
com.sun.jndi.ldap.LdapCtxFactory.getLdapCtxFromUrl(LdapCtxFactory.java:229) 
~[?:1.8.0_352]
at 
com.sun.jndi.ldap.LdapCtxFactory.getUsingURL(LdapCtxFactory.java:189) 
~[?:1.8.0_352]
at 
com.sun.jndi.ldap.LdapCtxFactory.getUsingURLs(LdapCtxFactory.java:247) 
~[?:1.8.0_352]
at 
com.sun.jndi.ldap.LdapCtxFactory.getLdapCtxInstance(LdapCtxFactory.java:154) 
~[?:1.8.0_352]
at 
com.sun.jndi.ldap.LdapCtxFactory.getInitialContext(LdapCtxFactory.java:84) 
~[?:1.8.0_352]
at 
javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:695) 
~[?:1.8.0_352]
at 
javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:313) 
~[?:1.8.0_352]
at javax.naming.InitialContext.init(InitialContext.java:244) 
~[?:1.8.0_352]
at javax.naming.InitialContext.(InitialContext.java:216) 
~[?:1.8.0_352]
at 
javax.naming.directory.InitialDirContext.(InitialDirContext.java:101) 
~[?:1.8.0_352]
at 
org.apache.hive.service.auth.LdapAuthenticationProviderImpl.Authenticate(LdapAuthenticationProviderImpl.java:73)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
at 
org.apache.hive.service.auth.PlainSaslHelper$PlainServerCallbackHandler.handle(PlainSaslHelper.java:105)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
at 
org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:101)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
... 8 more
{code}


we should pass user@domain directly to the LDAP provider, just like HiveServer 
did.

  was:
When the LDAP provider includes domain configuration, such as Active Directory, 
the principal should not be constructed according to the DN pattern, but the 
user containing the domain should be directly passed to the LDAP 

[jira] [Updated] (SPARK-42947) Spark Thriftserver LDAP should not use DN pattern if user contains domain

2023-03-28 Thread Jiayi Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiayi Liu updated SPARK-42947:
--
Description: 
When the LDAP provider includes domain configuration, such as Active Directory, 
the principal should not be constructed according to the DN pattern, but the 
user containing the domain should be directly passed to the LDAP provider as 
the principal. We can refer to the implementation of Hive LdapUtils.

When the username contains a domain or domain passes from 
hive.server2.authentication.ldap.Domain configuration, if we construct the 
principal according to the DN pattern (For example, 
uid=user@domain,dc=test,dc=com), we will get the following error:


{code:java}
23/03/28 11:01:48 ERROR TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: Error validating the login
at 
org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:108)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
at 
org.apache.thrift.transport.TSaslTransport$SaslParticipant.evaluateChallengeOrResponse(TSaslTransport.java:537)
 ~[libthrift-0.12.0.jar:0.12.0]
at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:283) 
~[libthrift-0.12.0.jar:0.12.0]
at 
org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:43)
 ~[libthrift-0.12.0.jar:0.12.0]
at 
org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:223)
 ~[libthrift-0.12.0.jar:0.12.0]
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:293)
 ~[libthrift-0.12.0.jar:0.12.0]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
~[?:1.8.0_352]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
~[?:1.8.0_352]
at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_352]
Caused by: javax.security.sasl.AuthenticationException: Error validating LDAP 
user
at 
org.apache.hive.service.auth.LdapAuthenticationProviderImpl.Authenticate(LdapAuthenticationProviderImpl.java:76)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
at 
org.apache.hive.service.auth.PlainSaslHelper$PlainServerCallbackHandler.handle(PlainSaslHelper.java:105)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
at 
org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:101)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
... 8 more
Caused by: javax.naming.AuthenticationException: [LDAP: error code 49 - 
80090308: LdapErr: DSID-0C0903D9, comment: AcceptSecurityContext error, data 
52e, v2580]
at com.sun.jndi.ldap.LdapCtx.mapErrorCode(LdapCtx.java:3261) 
~[?:1.8.0_352]
at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:3207) 
~[?:1.8.0_352]
at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:2993) 
~[?:1.8.0_352]
at com.sun.jndi.ldap.LdapCtx.connect(LdapCtx.java:2907) ~[?:1.8.0_352]
at com.sun.jndi.ldap.LdapCtx.(LdapCtx.java:347) ~[?:1.8.0_352]
at 
com.sun.jndi.ldap.LdapCtxFactory.getLdapCtxFromUrl(LdapCtxFactory.java:229) 
~[?:1.8.0_352]
at 
com.sun.jndi.ldap.LdapCtxFactory.getUsingURL(LdapCtxFactory.java:189) 
~[?:1.8.0_352]
at 
com.sun.jndi.ldap.LdapCtxFactory.getUsingURLs(LdapCtxFactory.java:247) 
~[?:1.8.0_352]
at 
com.sun.jndi.ldap.LdapCtxFactory.getLdapCtxInstance(LdapCtxFactory.java:154) 
~[?:1.8.0_352]
at 
com.sun.jndi.ldap.LdapCtxFactory.getInitialContext(LdapCtxFactory.java:84) 
~[?:1.8.0_352]
at 
javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:695) 
~[?:1.8.0_352]
at 
javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:313) 
~[?:1.8.0_352]
at javax.naming.InitialContext.init(InitialContext.java:244) 
~[?:1.8.0_352]
at javax.naming.InitialContext.(InitialContext.java:216) 
~[?:1.8.0_352]
at 
javax.naming.directory.InitialDirContext.(InitialDirContext.java:101) 
~[?:1.8.0_352]
at 
org.apache.hive.service.auth.LdapAuthenticationProviderImpl.Authenticate(LdapAuthenticationProviderImpl.java:73)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
at 
org.apache.hive.service.auth.PlainSaslHelper$PlainServerCallbackHandler.handle(PlainSaslHelper.java:105)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
at 
org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:101)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
... 8 more
{code}


we should pass user@domain directly to the LDAP provider, just like HiveServer 
did.

  was:
When the LDAP provider includes domain configuration, such as Active Directory, 
the principal should not be constructed according to the DN pattern, but the 
user containing the domain should be directly passed to the LDAP 

[jira] [Updated] (SPARK-42947) Spark Thriftserver LDAP should not use DN pattern if user contains domain

2023-03-28 Thread Jiayi Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiayi Liu updated SPARK-42947:
--
Description: 
When the LDAP provider includes domain configuration, such as Active Directory, 
the principal should not be constructed according to the DN pattern, but the 
user containing the domain should be directly passed to the LDAP provider as 
the principal. We can refer to the implementation of Hive LdapUtils.

When the username contains a domain or domain passes from 
hive.server2.authentication.ldap.Domain configuration, if we construct the 
principal according to the DN pattern (For example, 
uid=user@domain,dc=test,dc=com), we will get the following error:
```
23/03/28 11:01:48 ERROR TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: Error validating the login
at 
org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:108)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
at 
org.apache.thrift.transport.TSaslTransport$SaslParticipant.evaluateChallengeOrResponse(TSaslTransport.java:537)
 ~[libthrift-0.12.0.jar:0.12.0]
at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:283) 
~[libthrift-0.12.0.jar:0.12.0]
at 
org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:43)
 ~[libthrift-0.12.0.jar:0.12.0]
at 
org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:223)
 ~[libthrift-0.12.0.jar:0.12.0]
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:293)
 ~[libthrift-0.12.0.jar:0.12.0]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
~[?:1.8.0_352]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
~[?:1.8.0_352]
at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_352]
Caused by: javax.security.sasl.AuthenticationException: Error validating LDAP 
user
at 
org.apache.hive.service.auth.LdapAuthenticationProviderImpl.Authenticate(LdapAuthenticationProviderImpl.java:76)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
at 
org.apache.hive.service.auth.PlainSaslHelper$PlainServerCallbackHandler.handle(PlainSaslHelper.java:105)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
at 
org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:101)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
... 8 more
Caused by: javax.naming.AuthenticationException: [LDAP: error code 49 - 
80090308: LdapErr: DSID-0C0903D9, comment: AcceptSecurityContext error, data 
52e, v2580]
at com.sun.jndi.ldap.LdapCtx.mapErrorCode(LdapCtx.java:3261) 
~[?:1.8.0_352]
at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:3207) 
~[?:1.8.0_352]
at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:2993) 
~[?:1.8.0_352]
at com.sun.jndi.ldap.LdapCtx.connect(LdapCtx.java:2907) ~[?:1.8.0_352]
at com.sun.jndi.ldap.LdapCtx.(LdapCtx.java:347) ~[?:1.8.0_352]
at 
com.sun.jndi.ldap.LdapCtxFactory.getLdapCtxFromUrl(LdapCtxFactory.java:229) 
~[?:1.8.0_352]
at 
com.sun.jndi.ldap.LdapCtxFactory.getUsingURL(LdapCtxFactory.java:189) 
~[?:1.8.0_352]
at 
com.sun.jndi.ldap.LdapCtxFactory.getUsingURLs(LdapCtxFactory.java:247) 
~[?:1.8.0_352]
at 
com.sun.jndi.ldap.LdapCtxFactory.getLdapCtxInstance(LdapCtxFactory.java:154) 
~[?:1.8.0_352]
at 
com.sun.jndi.ldap.LdapCtxFactory.getInitialContext(LdapCtxFactory.java:84) 
~[?:1.8.0_352]
at 
javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:695) 
~[?:1.8.0_352]
at 
javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:313) 
~[?:1.8.0_352]
at javax.naming.InitialContext.init(InitialContext.java:244) 
~[?:1.8.0_352]
at javax.naming.InitialContext.(InitialContext.java:216) 
~[?:1.8.0_352]
at 
javax.naming.directory.InitialDirContext.(InitialDirContext.java:101) 
~[?:1.8.0_352]
at 
org.apache.hive.service.auth.LdapAuthenticationProviderImpl.Authenticate(LdapAuthenticationProviderImpl.java:73)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
at 
org.apache.hive.service.auth.PlainSaslHelper$PlainServerCallbackHandler.handle(PlainSaslHelper.java:105)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
at 
org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:101)
 ~[spark-hive-thriftserver_2.12-3.3.1.jar:3.3.1]
... 8 more
```

we should pass user@domain directly to the LDAP provider, just like HiveServer 
did.

  was:When the LDAP provider includes domain configuration, such as Active 
Directory, the principal should not be constructed according to the DN pattern, 
but the user containing the domain should be directly passed to the LDAP 
provider as the 

[jira] [Commented] (SPARK-42947) Spark Thriftserver LDAP should not use DN pattern if user contains domain

2023-03-28 Thread Jiayi Liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705938#comment-17705938
 ] 

Jiayi Liu commented on SPARK-42947:
---

I will try to fix this.

> Spark Thriftserver LDAP should not use DN pattern if user contains domain
> -
>
> Key: SPARK-42947
> URL: https://issues.apache.org/jira/browse/SPARK-42947
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jiayi Liu
>Priority: Major
>
> When the LDAP provider includes domain configuration, such as Active 
> Directory, the principal should not be constructed according to the DN 
> pattern, but the user containing the domain should be directly passed to the 
> LDAP provider as the principal. We can refer to the implementation of Hive 
> LdapUtils.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42947) Spark Thriftserver LDAP should not use DN pattern if user contains domain

2023-03-28 Thread Jiayi Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiayi Liu updated SPARK-42947:
--
Summary: Spark Thriftserver LDAP should not use DN pattern if user contains 
domain  (was: Spark Thriftserver should not use dn pattern if user contains 
domain)

> Spark Thriftserver LDAP should not use DN pattern if user contains domain
> -
>
> Key: SPARK-42947
> URL: https://issues.apache.org/jira/browse/SPARK-42947
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jiayi Liu
>Priority: Major
>
> When the LDAP provider includes domain configuration, such as Active 
> Directory, the principal should not be constructed according to the DN 
> pattern, but the user containing the domain should be directly passed to the 
> LDAP provider as the principal. We can refer to the implementation of Hive 
> LdapUtils.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42947) Spark Thriftserver should not use dn pattern if user contains domain

2023-03-28 Thread Jiayi Liu (Jira)
Jiayi Liu created SPARK-42947:
-

 Summary: Spark Thriftserver should not use dn pattern if user 
contains domain
 Key: SPARK-42947
 URL: https://issues.apache.org/jira/browse/SPARK-42947
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Jiayi Liu


When the LDAP provider includes domain configuration, such as Active Directory, 
the principal should not be constructed according to the DN pattern, but the 
user containing the domain should be directly passed to the LDAP provider as 
the principal. We can refer to the implementation of Hive LdapUtils.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38217) insert overwrite failed for external table with dynamic partition table

2022-12-12 Thread Jiayi Liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646464#comment-17646464
 ] 

Jiayi Liu commented on SPARK-38217:
---

This is because spark deletes the overwrite partition, but hive does not know 
this information, and throws an exception when listStatus or deletes a file 
that does not exist, causing loadPartition to terminate.

> insert overwrite failed for external table with dynamic partition table
> ---
>
> Key: SPARK-38217
> URL: https://issues.apache.org/jira/browse/SPARK-38217
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: YuanGuanhu
>Priority: Major
>
> can't insert overwrite dynamic partition table, reproduce step with 
> spark3.2.1 hadoop 3.2:
> sql("CREATE EXTERNAL TABLE exttb01(id int) PARTITIONED BY (p1 string, p2 
> string) STORED AS PARQUET LOCATION '/tmp/exttb01'")
> sql("set spark.sql.hive.convertMetastoreParquet=false")
> sql("set hive.exec.dynamic.partition.mode=nonstrict")
> val insertsql = "INSERT OVERWRITE TABLE exttb01 PARTITION(p1='n1', p2) SELECT 
> * FROM VALUES (1, 'n2'), (2, 'n3'), (3, 'n4') AS t(id, p2)"
> sql(insertsql)
> sql(insertsql)
> when execute insert overwrite 2th time, it failed
>  
> WARN Hive: Directory file:/tmp/exttb01/p1=n1/p2=n4 cannot be cleaned: 
> java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n4 does not 
> exist
> java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n4 does not 
> exist
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
>         at 
> org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:3440)
>         at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1657)
>         at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1929)
>         at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1920)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> 22/02/15 17:59:19 WARN Hive: Directory file:/tmp/exttb01/p1=n1/p2=n3 cannot 
> be cleaned: java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n3 
> does not exist
> java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n3 does not 
> exist
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
>         at 
> org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:3440)
>         at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1657)
>         at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1929)
>         at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1920)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> 22/02/15 17:59:19 WARN Hive: Directory file:/tmp/exttb01/p1=n1/p2=n2 cannot 
> be cleaned: java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n2 
> does not exist
> java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n2 does not 
> exist
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
>         at