[jira] [Commented] (SPARK-25355) Support --proxy-user for Spark on K8s

2022-05-11 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534889#comment-17534889
 ] 

Gabor Somogyi commented on SPARK-25355:
---

I've had a deeper look and HADOOP_TOKEN_FILE_LOCATION is a generic issue 
together w/ --proxy-user. Fixing only the K8S side is good direction but not 
enough from my perspective.

> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Stavros Kontopoulos
>Assignee: Pedro Rossi
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: client.log, driver.log, screenshot-1.png, 
> with_proxy_extradebugLogs.log
>
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25355) Support --proxy-user for Spark on K8s

2022-05-05 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532191#comment-17532191
 ] 

Gabor Somogyi commented on SPARK-25355:
---

[~unamesk15] I think the only option for now is to obtain tokens externally 
from Spark point of view and use "spark.kubernetes.kerberos.tokenSecret..." 
configs.

> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Stavros Kontopoulos
>Assignee: Pedro Rossi
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: client.log, driver.log, screenshot-1.png, 
> with_proxy_extradebugLogs.log
>
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25355) Support --proxy-user for Spark on K8s

2022-05-05 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532135#comment-17532135
 ] 

Gabor Somogyi commented on SPARK-25355:
---

After some playground work, code digging and your additional log analysis I see 
what's going on:
 * Spark obtains the already mentioned 3 tokens on submit side
 * Adds them as HADOOP_TOKEN_FILE_LOCATION to the driver
 * Driver starts and here comes the trick
 * UserGroupInformation [loads 
tokens|https://github.com/apache/hadoop/blob/2b9a8c1d3a2caf1e733d57f346af3ff0d5ba529c/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L740-L766]
 in case of loginUser creation so far before proxy user exits (actually this is 
UGI initialization)
 * Later on proxy user created w/ no tokens
 * Finally authentication fails on driver side because no credentials

I've taken a look at the design doc found in 
[https://github.com/mesosphere/spark/pull/26] and it states the following:
!screenshot-1.png! 
The bullet point 7 was maybe true for mesos in 2018 but it's not working w/ K8S 
now for sure.
In the current Spark codebase only executors are using runAsSparkUser but 
driver is not (so runs as proxy user w/o tokens).

So my general opinion considering the facts what we have which may change.
Adding --proxy-user param for K8S was a good idea but:
 * either not tested on cluster at all or tested on a different execution path
 * tested and was working on cluster but after the merge (Mar 17, 2020) 
something has really changed in other parts of the code
 * all in all what I see is that the feature now completely broken

[~pedro.rossi] any comments because according to the latest facts this is a 
feature blocker?


> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Stavros Kontopoulos
>Assignee: Pedro Rossi
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: client.log, driver.log, screenshot-1.png, 
> with_proxy_extradebugLogs.log
>
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25355) Support --proxy-user for Spark on K8s

2022-05-05 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-25355:
--
Attachment: screenshot-1.png

> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Stavros Kontopoulos
>Assignee: Pedro Rossi
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: client.log, driver.log, screenshot-1.png, 
> with_proxy_extradebugLogs.log
>
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25355) Support --proxy-user for Spark on K8s

2022-05-04 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531777#comment-17531777
 ] 

Gabor Somogyi commented on SPARK-25355:
---

Hmmm, seems like I've already debugged such thing and created a project to 
print out UGI credentials: https://github.com/gaborgsomogyi/hadoop-token-trace

> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Stavros Kontopoulos
>Assignee: Pedro Rossi
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: client.log, driver.log
>
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25355) Support --proxy-user for Spark on K8s

2022-05-04 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531743#comment-17531743
 ] 

Gabor Somogyi commented on SPARK-25355:
---

Now I see where the driver blows up and why no Spark logs are available.

So submit side obtains token for the HA namenodes one-by-one:
{code:java}
22/05/04 04:13:07 DEBUG KMSClientProvider: Getting new token from 
http://nn.com:9292/kms/v1/, renewer:proxyUser
...
22/05/04 04:13:07 DEBUG KMSClientProvider: Getting new token from 
http://:9292/kms/v1/, renewer:proxyUser
{code}
Driver however tries to reach the following HDFS file during jar globbing:
{code:java}
22/04/26 08:54:39 DEBUG HAUtilClient: No HA service delegation token found for 
logical URI 
hdfs://dpinonprod:8020/tmp/spark-upload-bf713a0c-166b-43fc-a5e6-24957e75b224/spark-examples_2.12-3.0.1.jar
{code}
Not sure from where this "dpinonprod" node comes from but Spark not obtained a 
token for that host.
Previously I've seen that if HA and single namenode addresses are mixed up in 
configs then such AccessControlException happened.
I would go through the configs and would double check them... (default FS, 
additional FS, etc...)

If this doesn't help then maybe trace-agent can be used to dump the UGI 
credentials: https://github.com/gaborgsomogyi/trace-agent
Please note, UGI is not yet supported just security credentials so it requires 
really some effort...
 

> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Stavros Kontopoulos
>Assignee: Pedro Rossi
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: client.log, driver.log
>
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25355) Support --proxy-user for Spark on K8s

2022-05-04 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531646#comment-17531646
 ] 

Gabor Somogyi commented on SPARK-25355:
---

> (IllegalArgumentException: Empty cookie header string) -> It's not supposed 
> to have any impact: https://issues.apache.org/jira/browse/HDFS-15136

OK, this is fixed in the used 3.1.1 hadoop version: 
https://github.com/apache/hadoop/blob/7caf768a8c9a639b6139b2cae8656c89e3d8c58d/hadoop-common-project/hadoop-auth/src/main/java/org/apache/hadoop/security/authentication/client/AuthenticatedURL.java#L101
Let's assume things looks good on the submit side.

The driver needs to re-obtain tokens because in case of failure the submit 
client provided tokens are outdated and streaming workloads failed such 
scenarios. When I take a look at the driver logs I've no idea what's going on 
there because there are no Spark related logs available. Presume the 
log4j.properties stripped off all the useful info. I would expect to see at 
least the Spark version when SparkContext is created: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L195

My ask is to enable Spark related log entries in the driver log to see what's 
going on. The most important is "org.apache.spark.deploy.security" package on 
DEBUG level where token handling sits.


> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Stavros Kontopoulos
>Assignee: Pedro Rossi
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: client.log, driver.log
>
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25355) Support --proxy-user for Spark on K8s

2022-05-04 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531523#comment-17531523
 ] 

Gabor Somogyi edited comment on SPARK-25355 at 5/4/22 8:00 AM:
---

After the attached logs now I see more.

HADOOP_TOKEN_FILE_LOCATION with proxy user is never worked and it remains this.
You have guys 2 options:
 * You provide tokens in HADOOP_TOKEN_FILE_LOCATION: this case UGI picks up the 
tokens for the current user and does authentication w/ that. Nothing blocks you 
guys that these tokens are generated for the proxy user manually from your 
custom code. This case --proxy-user config is not needed and will work like 
charm.
 * You set --proxy-user config and such case Spark obtains token for the proxy 
user authenticating w/ the real user Kerberos credentials. When I take a look 
at the logs Spark tries to obtain tokens for the following external service 
types
{code:java}
22/05/04 04:13:07 DEBUG HadoopDelegationTokenManager: Using the following 
builtin delegation token providers: hadoopfs, hbase, hive.
22/05/04 04:13:07 DEBUG UserGroupInformation: PrivilegedAction as:proxyUser 
(auth:PROXY) via / (auth:KERBEROS) 
from:org.apache.spark.deploy.security.HadoopDelegationTokenManager.obtainDelegationTokens(HadoopDelegationTokenManager.scala:146)
{code}
After a while Spark's build-in Hadoop FS delegation token provider kicks-in and 
tries to obtain a token as expected:
{code:java}
22/05/04 04:13:07 DEBUG HadoopFSDelegationTokenProvider: Delegation token 
renewer is: proxyUser
22/05/04 04:13:07 INFO HadoopFSDelegationTokenProvider: getting token for: 
DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1812449855_1, ugi=proxyUser 
(auth:PROXY) via / (auth:KERBEROS)]] with renewer 
proxyUser
22/05/04 04:13:07 DEBUG Client: IPC Client (1939869193) connection to 
nn.com/:8020 from proxyUser sending #6 
org.apache.hadoop.hdfs.protocol.ClientProtocol.getDelegationToken
22/05/04 04:13:07 DEBUG Client: IPC Client (1939869193) connection to 
nn.com/:8020 from proxyUser got value #6
22/05/04 04:13:07 DEBUG ProtobufRpcEngine: Call: getDelegationToken took 2ms
22/05/04 04:13:07 INFO DFSClient: Created token for proxyUser: 
HDFS_DELEGATION_TOKEN owner=proxyUser, renewer=proxyUser, 
realUser=/, issueDate=1651637587347, 
maxDate=1652242387347, sequenceNumber=183545, masterKeyId=606 on ha-hdfs:
22/05/04 04:13:07 DEBUG Client: IPC Client (1939869193) connection to 
nn.com/:8020 from proxyUser sending #7 
org.apache.hadoop.hdfs.protocol.ClientProtocol.getServerDefaults
22/05/04 04:13:07 DEBUG Client: IPC Client (1939869193) connection to 
nn.com/:8020 from proxyUser got value #7
22/05/04 04:13:07 DEBUG ProtobufRpcEngine: Call: getServerDefaults took 0ms
22/05/04 04:13:07 DEBUG KMSClientProvider: KMSClientProvider for KMS url: 
http://nn.com:9292/kms/v1/ delegation token service: :9292 created.
22/05/04 04:13:07 DEBUG KMSClientProvider: KMSClientProvider for KMS url: 
http://:9292/kms/v1/ delegation token service: 10.207.184.25:9292 
created.
22/05/04 04:13:07 DEBUG KMSClientProvider: Current UGI: proxyUser (auth:PROXY) 
via / (auth:KERBEROS)
22/05/04 04:13:07 DEBUG KMSClientProvider: Real UGI: / 
(auth:KERBEROS)
22/05/04 04:13:07 DEBUG KMSClientProvider: Login UGI: / 
(auth:KERBEROS)
22/05/04 04:13:07 DEBUG UserGroupInformation: PrivilegedAction 
as:/ (auth:KERBEROS) 
from:org.apache.hadoop.crypto.key.kms.KMSClientProvider.addDelegationTokens(KMSClientProvider.java:1037)
22/05/04 04:13:07 DEBUG KMSClientProvider: Getting new token from 
http://nn.com:9292/kms/v1/, renewer:proxyUser
22/05/04 04:13:07 DEBUG DelegationTokenAuthenticator: No delegation token found 
for 
url=http://nn.com:9292/kms/v1/?op=GETDELEGATIONTOKEN=proxyUser=proxyUser,
 token=, authenticating with class 
org.apache.hadoop.security.token.delegation.web.KerberosDelegationTokenAuthenticator$1
22/05/04 04:13:07 DEBUG KerberosAuthenticator: JDK performed authentication on 
our behalf.
22/05/04 04:13:07 DEBUG AuthenticatedURL: Cannot parse cookie header: 
java.lang.IllegalArgumentException: Empty cookie header string
at java.net.HttpCookie.parseInternal(HttpCookie.java:826)
at java.net.HttpCookie.parse(HttpCookie.java:202)
at java.net.HttpCookie.parse(HttpCookie.java:178)
at 
org.apache.hadoop.security.authentication.client.AuthenticatedURL$AuthCookieHandler.put(AuthenticatedURL.java:99)
at 
org.apache.hadoop.security.authentication.client.AuthenticatedURL.extractToken(AuthenticatedURL.java:390)
at 
org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:196)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.authenticate(DelegationTokenAuthenticator.java:147)
at 
org.apache.hadoop.security.authentication.client.AuthenticatedURL.openConnection(AuthenticatedURL.java:348)
at 

[jira] [Commented] (SPARK-25355) Support --proxy-user for Spark on K8s

2022-05-04 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531523#comment-17531523
 ] 

Gabor Somogyi commented on SPARK-25355:
---

After the attached logs now I see more.

HADOOP_TOKEN_FILE_LOCATION with proxy user is never worked and it remains this.
You have guys 2 options:
 * You provide tokens in HADOOP_TOKEN_FILE_LOCATION: this case UGI picks up the 
tokens for the current user and does authentication w/ that. Nothing blocks you 
guys that these tokens are generated for the proxy user manually from your 
custom code. This case --proxy-user config is not needed and will work like 
charm.
 * You set --proxy-user config and such case Spark obtains token for the proxy 
user authenticating w/ the real user Kerberos credentials. When I take a look 
at the logs Spark tries to obtain tokens for the following external service 
types
{code:java}
22/05/04 04:13:07 DEBUG HadoopDelegationTokenManager: Using the following 
builtin delegation token providers: hadoopfs, hbase, hive.
22/05/04 04:13:07 DEBUG UserGroupInformation: PrivilegedAction as:proxyUser 
(auth:PROXY) via / (auth:KERBEROS) 
from:org.apache.spark.deploy.security.HadoopDelegationTokenManager.obtainDelegationTokens(HadoopDelegationTokenManager.scala:146)
{code}
After a while Spark's build-in Hadoop FS delegation token provider kicks-in and 
tries to obtain a token as expected:
{code:java}
22/05/04 04:13:07 DEBUG HadoopFSDelegationTokenProvider: Delegation token 
renewer is: proxyUser
22/05/04 04:13:07 INFO HadoopFSDelegationTokenProvider: getting token for: 
DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1812449855_1, ugi=proxyUser 
(auth:PROXY) via / (auth:KERBEROS)]] with renewer 
proxyUser
22/05/04 04:13:07 DEBUG Client: IPC Client (1939869193) connection to 
nn.com/:8020 from proxyUser sending #6 
org.apache.hadoop.hdfs.protocol.ClientProtocol.getDelegationToken
22/05/04 04:13:07 DEBUG Client: IPC Client (1939869193) connection to 
nn.com/:8020 from proxyUser got value #6
22/05/04 04:13:07 DEBUG ProtobufRpcEngine: Call: getDelegationToken took 2ms
22/05/04 04:13:07 INFO DFSClient: Created token for proxyUser: 
HDFS_DELEGATION_TOKEN owner=proxyUser, renewer=proxyUser, 
realUser=/, issueDate=1651637587347, 
maxDate=1652242387347, sequenceNumber=183545, masterKeyId=606 on ha-hdfs:
22/05/04 04:13:07 DEBUG Client: IPC Client (1939869193) connection to 
nn.com/:8020 from proxyUser sending #7 
org.apache.hadoop.hdfs.protocol.ClientProtocol.getServerDefaults
22/05/04 04:13:07 DEBUG Client: IPC Client (1939869193) connection to 
nn.com/:8020 from proxyUser got value #7
22/05/04 04:13:07 DEBUG ProtobufRpcEngine: Call: getServerDefaults took 0ms
22/05/04 04:13:07 DEBUG KMSClientProvider: KMSClientProvider for KMS url: 
http://nn.com:9292/kms/v1/ delegation token service: :9292 created.
22/05/04 04:13:07 DEBUG KMSClientProvider: KMSClientProvider for KMS url: 
http://:9292/kms/v1/ delegation token service: 10.207.184.25:9292 
created.
22/05/04 04:13:07 DEBUG KMSClientProvider: Current UGI: proxyUser (auth:PROXY) 
via / (auth:KERBEROS)
22/05/04 04:13:07 DEBUG KMSClientProvider: Real UGI: / 
(auth:KERBEROS)
22/05/04 04:13:07 DEBUG KMSClientProvider: Login UGI: / 
(auth:KERBEROS)
22/05/04 04:13:07 DEBUG UserGroupInformation: PrivilegedAction 
as:/ (auth:KERBEROS) 
from:org.apache.hadoop.crypto.key.kms.KMSClientProvider.addDelegationTokens(KMSClientProvider.java:1037)
22/05/04 04:13:07 DEBUG KMSClientProvider: Getting new token from 
http://nn.com:9292/kms/v1/, renewer:proxyUser
22/05/04 04:13:07 DEBUG DelegationTokenAuthenticator: No delegation token found 
for 
url=http://nn.com:9292/kms/v1/?op=GETDELEGATIONTOKEN=proxyUser=proxyUser,
 token=, authenticating with class 
org.apache.hadoop.security.token.delegation.web.KerberosDelegationTokenAuthenticator$1
22/05/04 04:13:07 DEBUG KerberosAuthenticator: JDK performed authentication on 
our behalf.
22/05/04 04:13:07 DEBUG AuthenticatedURL: Cannot parse cookie header: 
java.lang.IllegalArgumentException: Empty cookie header string
at java.net.HttpCookie.parseInternal(HttpCookie.java:826)
at java.net.HttpCookie.parse(HttpCookie.java:202)
at java.net.HttpCookie.parse(HttpCookie.java:178)
at 
org.apache.hadoop.security.authentication.client.AuthenticatedURL$AuthCookieHandler.put(AuthenticatedURL.java:99)
at 
org.apache.hadoop.security.authentication.client.AuthenticatedURL.extractToken(AuthenticatedURL.java:390)
at 
org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:196)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.authenticate(DelegationTokenAuthenticator.java:147)
at 
org.apache.hadoop.security.authentication.client.AuthenticatedURL.openConnection(AuthenticatedURL.java:348)
at 

[jira] [Commented] (SPARK-39033) Support --proxy-user for Spark on K8s not working

2022-05-03 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531430#comment-17531430
 ] 

Gabor Somogyi commented on SPARK-39033:
---

Simply the logs are trash just like I've mentioned this in SPARK-25355 + in the 
dev list.
Here ConnectException stays which is even worse than in SPARK-25355. Please 
improve or your issue is going to be left as-is... or maybe somebody would be 
s kind that does your job and repro the issue which is less probable...

> Support --proxy-user for Spark on K8s not working
> -
>
> Key: SPARK-39033
> URL: https://issues.apache.org/jira/browse/SPARK-39033
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: jagadeesh
>Priority: Major
>
> we are running into problem when we submit spark job with --proxy-user on 
> K8s. 
> here are the setups follows, 
>  * Service id is configured properly in HDFS side .
>  
> {code:java}
>     
>       hadoop.proxyuser.serviceid.groups
>       *
>         
> 
>       hadoop.proxyuser.serviceid.hosts
>       *
>         
>       hadoop.proxyuser.serviceid.users
>       *
>     {code}
>  
>  * Getting service id Kerberos ticket in spark client.
>  * Running spark job without --proxy-user connecting to Kerberized HDFS 
> cluster  - {color:#00875a}WORKS AS EXPECTED .{color}
>  * Running spark job with --proxy-user= connecting to Kerberized 
> HDFS cluster - {color:#de350b}FAILS{color}
> {code:java}
> $SPARK_HOME/bin/spark-submit \
>     --master  \
>     --deploy-mode cluster \
>     --proxy-user  \
>     --name spark-javawordcount \
>     --class org.apache.spark.examples.JavaWordCount \
>     --conf spark.kubernetes.container.image=\
>     --conf spark.kubernetes.driver.podTemplateFile=driver.yaml \
>     --conf spark.kubernetes.executor.podTemplateFile=executor.yaml \
>     --conf spark.kubernetes.container.image.pullPolicy=Always \
>     --conf spark.kubernetes.driver.limit.cores=1 \
>     --conf spark.executor.instances=2 \
>     --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf \
>     --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
>     --conf spark.kubernetes.namespace= \
>     --conf spark.eventLog.enabled=true \
>     --conf spark.eventLog.dir=hdfs://:8020/scaas/shs_logs \
>     --conf spark.kubernetes.file.upload.path=hdfs://:8020/tmp \
>     $SPARK_HOME/examples/jars/spark-examples_2.12-3.2.0-1.jar 
> /user//input{code}
>  
>  * ERROR logs from Driver pod
>  
> {code:java}
> ++ id -u
> + myuid=185
> ++ id -g
> + mygid=0
> + set +e
> ++ getent passwd 185
> + uidentry=
> + set -e
> + '[' -z '' ']'
> + '[' -w /etc/passwd ']'
> + echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
> + SPARK_CLASSPATH=':/opt/spark/jars/*'
> + env
> + grep SPARK_JAVA_OPT_
> + sort -t_ -k4 -n
> + sed 's/[^=]*=\(.*\)/\1/g'
> + readarray -t SPARK_EXECUTOR_JAVA_OPTS
> + '[' -n '' ']'
> + '[' -z ']'
> + '[' -z ']'
> + '[' -n '' ']'
> + '[' -z x ']'
> + SPARK_CLASSPATH='/opt/hadoop/conf::/opt/spark/jars/*'
> + '[' -z x ']'
> + SPARK_CLASSPATH='/opt/spark/conf:/opt/hadoop/conf::/opt/spark/jars/*'
> + case "$1" in
> + shift 1
> + CMD=("$SPARK_HOME/bin/spark-submit" --conf 
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client 
> "$@")
> + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress= --deploy-mode client --proxy-user 
>  --properties-file /opt/spark/conf/spark.properties --class 
> org.apache.spark.examples.JavaWordCount spark-internal /user//input
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> (file:/opt/spark/jars/spark-unsafe_2.12-3.2.0-1.jar) to constructor 
> java.nio.DirectByteBuffer(long,int)
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.spark.unsafe.Platform
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> 22/04/21 17:50:30 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 22/04/21 17:50:30 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> 22/04/21 17:50:30 WARN Client: Exception encountered while connecting to the 
> server : org.apache.hadoop.security.AccessControlException: Client cannot 
> authenticate via:[TOKEN, KERBEROS]
> 22/04/21 17:50:31 WARN Client: Exception encountered while connecting to the 
> server : org.apache.hadoop.security.AccessControlException: Client cannot 
> authenticate via:[TOKEN, KERBEROS]
> 22/04/21 17:50:37 WARN Client: Exception 

[jira] [Commented] (SPARK-25355) Support --proxy-user for Spark on K8s

2022-05-02 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530812#comment-17530812
 ] 

Gabor Somogyi commented on SPARK-25355:
---

> Thanks for looking further. Your assumption that 3 tokens loaded from 
> HADOOP_TOKEN_FILE_LOCATION are not compatible to do the authentication is 
> wrong.

Please be aware that I'm one of the authors of this delegation token framework 
and I'm not guessing but knowing exactly what's going on. The only question is 
what you guys are planning and doing :)

Since you've not yet provided full logs, what is the master plan, how the 
authentication is planned I'm asking simple questions. If not answered then I'm 
not able to help you forward.
 * I've asked full driver and executor logs but we've received a hadoop 
specific snippet. Can we get a full log as asked? If too large then stored 
externally or something.
 * In spark-submit you provide cluster mode deployment
{code:java}
...
--deploy-mode cluster \
...
{code}
but in the log I see client mode:
{code:java}
...
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress=10.4.201.155 --deploy-mode client --proxy-user 
shrprasa --properties-file /opt/spark/conf/spark.properties --class 
org.apache.spark.examples.SparkPi spark-internal
...
{code}
So which one is the source of truth because it has major influence how security 
is working? Hunting multiple issues is not fun (same issue like 
ConnectionRefused in the dev mailing list). So the ask here is to provide full 
logs and submit command which belongs together.

 * What is the master plan to provide a TGT for the current user on the driver 
POD? I'm asking it because this is the only way to ask Spark to obtain a 
delegation token for the proxy user. But since the logs are partial I'm also 
not able to tell what happened there.
 * What is the main intention to use HADOOP_TOKEN_FILE_LOCATION? That is mainly 
used to load tokens for the current user and not for the proxy user. Taking 
over any token to the proxy user is never going to happen because that would 
mean a security breach.
 * And finally which token do you expect to do authentication against HDFS? 
(Spark obtained one or loaded by HADOOP_TOKEN_FILE_LOCATION)

[~pedro.rossi] how it is tested on cluster because the description of the PR 
doesn't tell anything about that?

> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Stavros Kontopoulos
>Assignee: Pedro Rossi
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25355) Support --proxy-user for Spark on K8s

2022-05-02 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530692#comment-17530692
 ] 

Gabor Somogyi commented on SPARK-25355:
---

I've had a further look and found the following:
{code:java}
...
22/04/26 08:54:40 DEBUG SaslRpcClient: Sending sasl message state: 
NEGOTIATE22/04/26 08:54:40 DEBUG SaslRpcClient: Get token info proto:interface 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolPB 
info:@org.apache.hadoop.security.token.TokenInfo(value=org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenSelector.class)
22/04/26 08:54:40 DEBUG SaslRpcClient: tokens aren't supported for this 
protocol or user doesn't have one
22/04/26 08:54:40 DEBUG SaslRpcClient: client isn't using kerberos
22/04/26 08:54:40 DEBUG UserGroupInformation: PrivilegedActionException as:185 
(auth:SIMPLE) cause:org.apache.hadoop.security.AccessControlException: Client 
cannot authenticate via:[TOKEN, KERBEROS]
...
{code}

This means you guys ordered Spark to load 3 tokens from 
HADOOP_TOKEN_FILE_LOCATION which are not compatible to do the authentication.

Since I've no idea what was the original intention all I can suggest please 
provide the proper tokens in HADOOP_TOKEN_FILE_LOCATION file.


> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Stavros Kontopoulos
>Assignee: Pedro Rossi
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25355) Support --proxy-user for Spark on K8s

2022-05-02 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530678#comment-17530678
 ] 

Gabor Somogyi edited comment on SPARK-25355 at 5/2/22 11:32 AM:


Guys, when I take a look at the logs and hear what you say honestly not fully 
understand what you do :)

You're telling that you do kinit which creates a TGT in the users credentials 
cache on the local machine. Please be aware that this TGT is NOT transferred by 
default to the cluster.
On the other hand the driver is reading credentials from file:
{code:java}
...
22/04/26 08:54:39 DEBUG UserGroupInformation: Loaded 3 tokens
22/04/26 08:54:39 DEBUG UserGroupInformation: UGI loginUser:185 (auth:SIMPLE)
22/04/26 08:54:39 DEBUG UserGroupInformation: PrivilegedAction as:shrprasa 
(auth:PROXY) via 185 (auth:SIMPLE) 
...
22/04/26 08:54:38 DEBUG UserGroupInformation: Reading credentials from location 
set in HADOOP_TOKEN_FILE_LOCATION: 
/mnt/secrets/hadoop-credentials/..2022_04_26_08_54_34.1262645511/hadoop-tokens
...
{code}

One can authenticate from both credentials (TGT and HADOOP_TOKEN_FILE_LOCATION) 
so which one is the plan and which one is a side effect?

As a general suggestion client mode kerberos authentication suffers from many 
issue especially with TGT so not advised. If you want a peaceful life then I 
warmly suggest to use keytab :)



was (Author: gaborgsomogyi):
Guys, when I take a look at the logs and hear what you say honestly not fully 
understand what you do :)

You're telling that you do kinit which creates a TGT in the users credentials 
cache on the local machine. Please be aware that this TGT is NOT transferred by 
default to the cluster.
On the other hand the driver is reading credentials from file:
{code:java}
...
22/04/26 08:54:38 DEBUG UserGroupInformation: Reading credentials from location 
set in HADOOP_TOKEN_FILE_LOCATION: 
/mnt/secrets/hadoop-credentials/..2022_04_26_08_54_34.1262645511/hadoop-tokens
...
{code}

One can authenticate from both credentials (TGT and HADOOP_TOKEN_FILE_LOCATION) 
so which one is the plan and which one is a side effect?

As a general suggestion client mode kerberos authentication suffers from many 
issue especially with TGT so not advised. If you want a peaceful life then I 
warmly suggest to use keytab :)


> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Stavros Kontopoulos
>Assignee: Pedro Rossi
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25355) Support --proxy-user for Spark on K8s

2022-05-02 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530678#comment-17530678
 ] 

Gabor Somogyi commented on SPARK-25355:
---

Guys, when I take a look at the logs and hear what you say honestly not fully 
understand what you do :)

You're telling that you do kinit which creates a TGT in the users credentials 
cache on the local machine. Please be aware that this TGT is NOT transferred by 
default to the cluster.
On the other hand the driver is reading credentials from file:
{code:java}
...
22/04/26 08:54:38 DEBUG UserGroupInformation: Reading credentials from location 
set in HADOOP_TOKEN_FILE_LOCATION: 
/mnt/secrets/hadoop-credentials/..2022_04_26_08_54_34.1262645511/hadoop-tokens
...
{code}

One can authenticate from both credentials (TGT and HADOOP_TOKEN_FILE_LOCATION) 
so which one is the plan and which one is a side effect?

As a general suggestion client mode kerberos authentication suffers from many 
issue especially with TGT so not advised. If you want a peaceful life then I 
warmly suggest to use keytab :)


> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Stavros Kontopoulos
>Assignee: Pedro Rossi
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25355) Support --proxy-user for Spark on K8s

2022-05-02 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530650#comment-17530650
 ] 

Gabor Somogyi commented on SPARK-25355:
---

If the issue still persists then please provide the community the submit 
command together with the driver and executor logs...

> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Stavros Kontopoulos
>Assignee: Pedro Rossi
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25355) Support --proxy-user for Spark on K8s

2022-05-02 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530649#comment-17530649
 ] 

Gabor Somogyi commented on SPARK-25355:
---

Then there are 2 issues on your side guys. In the following thread 
ConnectionRefused exception is mentioned: 
https://lists.apache.org/thread/lcn90cs9b0m848yfd5g4ksxsqwkmqbts

So which one is the issue or both?

> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Stavros Kontopoulos
>Assignee: Pedro Rossi
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25355) Support --proxy-user for Spark on K8s

2022-04-30 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530370#comment-17530370
 ] 

Gabor Somogyi edited comment on SPARK-25355 at 4/30/22 9:50 AM:


Please make sure that namenode host:port is available from driver pod. If 
that's working then the ConnectionRefused must go away and only 
AccessControlException remains (if authentication is still an issue).


was (Author: gaborgsomogyi):
Please make sure that namenode host:port is available from driver. If that's 
working then the ConnectionRefused must go away and only AccessControlException 
remains (if authentication is still an issue).

> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Stavros Kontopoulos
>Assignee: Pedro Rossi
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25355) Support --proxy-user for Spark on K8s

2022-04-30 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530370#comment-17530370
 ] 

Gabor Somogyi commented on SPARK-25355:
---

Please make sure that namenode host:port is available from driver. If that's 
working then the ConnectionRefused must go away and only AccessControlException 
remains (if authentication is still an issue).

> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Stavros Kontopoulos
>Assignee: Pedro Rossi
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25355) Support --proxy-user for Spark on K8s

2022-04-30 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530369#comment-17530369
 ] 

Gabor Somogyi commented on SPARK-25355:
---

I've answered this question in the community but I would copy it here to track 
it in the jira:

Please be aware that ConnectionRefused exception has nothing to do w/ 
authentication. See the description from Hadoop wiki:
"You get a ConnectionRefused Exception when there is a machine at the address 
specified, but there is no program listening on the specific TCP port the 
client is using -and there is no firewall in the way silently dropping TCP 
connection requests. If you do not know what a TCP connection request is, 
please consult the specification."

This means the namenode on host:port is not reachable in the TCP layer. Maybe 
there are multiple issues but I'm pretty sure that something is wrong in the 
K8S net config.


> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Stavros Kontopoulos
>Assignee: Pedro Rossi
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28173) Add Kafka delegation token proxy user support

2022-03-07 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502162#comment-17502162
 ] 

Gabor Somogyi commented on SPARK-28173:
---

Looks like there is a willingness to merge the Kafka part. Hope we can kick 
this forward relatively soon to finish this feature.

> Add Kafka delegation token proxy user support
> -
>
> Key: SPARK-28173
> URL: https://issues.apache.org/jira/browse/SPARK-28173
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> In SPARK-26592 I've turned off proxy user usage because 
> https://issues.apache.org/jira/browse/KAFKA-6945 is not yet implemented. 
> Since the KIP will be under discussion and hopefully implemented here is this 
> jira to track the Spark side effort.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37391) SIGNIFICANT bottleneck introduced by fix for SPARK-32001

2021-11-23 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447829#comment-17447829
 ] 

Gabor Somogyi commented on SPARK-37391:
---

[~hyukjin.kwon] thanks for pinging me. I've added my comment here: 
https://github.com/apache/spark/pull/29024/files#r754476290
The problem and the surrounding constraints are provided. If somebody has a 
meaningful solution then please share.

To sum it up here: A single JVM has only one security context and JDBC clients 
are able to read authentication credentials only from there which is a 
bottleneck.


> SIGNIFICANT bottleneck introduced by fix for SPARK-32001
> 
>
> Key: SPARK-37391
> URL: https://issues.apache.org/jira/browse/SPARK-37391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0
> Environment: N/A
>Reporter: Danny Guinther
>Priority: Major
> Attachments: so-much-blocking.jpg, spark-regression-dashes.jpg
>
>
> The fix for https://issues.apache.org/jira/browse/SPARK-32001 ( 
> [https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58]
>  ) does not seem to have consider the reality that some apps may rely on 
> being able to establish many JDBC connections simultaneously for performance 
> reasons.
> The fix forces concurrency to 1 when establishing database connections and 
> that strikes me as a *significant* user impacting change and a *significant* 
> bottleneck.
> Can anyone propose a workaround for this? I have an app that makes 
> connections to thousands of databases and I can't upgrade to any version 
> >3.1.x because of this significant bottleneck.
>  
> Thanks in advance for your help!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36765) Spark Support for MS Sql JDBC connector with Kerberos/Keytab

2021-09-17 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416519#comment-17416519
 ] 

Gabor Somogyi commented on SPARK-36765:
---

It was long time ago when I've done that and AFAIR it took me almost a month to 
make it work so definitely a horror task!
My knowledge is cloudy because it was not yesterday but I remember something 
like this:

The exception generally indicates that the driver can not find the appropriate 
sqljdbc_auth lib in the JVM library path.  To correct the problem, one can use 
use the java -D option to specify the "java.library.path" system property 
value.  Worth to mention full path must be set as path, otherwise it was not 
working.

All in all I've faced at least 5-6 different issues which were extremely hard 
to address. Hope others need less time to solve the issues.


> Spark Support for MS Sql JDBC connector with Kerberos/Keytab
> 
>
> Key: SPARK-36765
> URL: https://issues.apache.org/jira/browse/SPARK-36765
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
> Environment: Unix Redhat Environment
>Reporter: Dilip Thallam Sridhar
>Priority: Major
> Fix For: 3.1.2
>
>
> Hi Team,
>  
> We are using the Spark-3.0.2 to connect to MS SqlServer with the following 
> instruction  
> Also tried with the Spark-3.1.2 Version,
>  
>  1) download mssql-jdbc-9.4.0.jre8.jar
>  2) Generated Keytab using kinit
>  3) Validate Keytab using klist
>  4) Run the spark job with jdbc_library, principal and keytabs passed
> .config("spark.driver.extraClassPath", spark_jar_lib) \
> .config("spark.executor.extraClassPath", spark_jar_lib) \
>  5) connection_url = 
> "jdbc:sqlserver://{}:{};databaseName={};integratedSecurity=true;authenticationSchema=JavaKerberos"\
>  .format(jdbc_host_name, jdbc_port, jdbc_database_name)
> Note: without integratedSecurity=true;authenticationSchema=JavaKerberos it 
> looks for the usual username/password option to connect
> 6) passing the following options during spark read.
>  .option("principal", database_principal) \
>  .option("files", database_keytab) \
>  .option("keytab", database_keytab) \
>   
>  tried with files and keytab, just files, and with all above 3 parameters
>   
>  We are unable to connect to SqlServer from Spark and getting the following 
> error shown below. 
>   
>  A) Wanted to know if anybody was successful Spark to SqlServer? (as I see 
> the previous Jira has been closed)
>  https://issues.apache.org/jira/browse/SPARK-12312
>  https://issues.apache.org/jira/browse/SPARK-31337
>   
>  B) If yes, could you let us know if there are any additional configs needed 
> for Spark to connect to SqlServer please?
>  Appreciate if we can get inputs to resolve this error.
>   
>   
>  Full Stack Trace
> {code}
> Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: This driver is 
> not configured for integrated authentication. at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:1352)
>  at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.sendLogon(SQLServerConnection.java:2329)
>  at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.logon(SQLServerConnection.java:1905)
>  at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.access$000(SQLServerConnection.java:41)
>  at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection$LogonCommand.doExecute(SQLServerConnection.java:1893)
>  at 
> com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:4575) 
> at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1400)
>  at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.connectHelper(SQLServerConnection.java:1045)
>  at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.login(SQLServerConnection.java:817)
>  at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.connect(SQLServerConnection.java:700)
>  at 
> com.microsoft.sqlserver.jdbc.SQLServerDriver.connect(SQLServerDriver.java:842)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.connection.SecureConnectionProvider.getConnection(SecureConnectionProvider.scala:44)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.connection.MSSQLConnectionProvider.org$apache$spark$sql$execution$datasources$jdbc$connection$MSSQLConnectionProvider$$super$getConnection(MSSQLConnectionProvider.scala:69)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.connection.MSSQLConnectionProvider$$anon$1.run(MSSQLConnectionProvider.scala:69)
>  at 
> 

[jira] [Commented] (SPARK-31460) spark-sql-kafka source in spark 2.4.4 causes reading stream failure frequently

2021-08-16 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17399594#comment-17399594
 ] 

Gabor Somogyi commented on SPARK-31460:
---

When such thing happens then Kafka connector is super slow and/or stuck in 
infinite loop so Kafka logs need to be checked why it's happening...
Apart from that we've made quite some improvements on 3.x line so please double 
check the behavior w/ the latest Spark version.

> spark-sql-kafka source in spark 2.4.4 causes reading stream failure frequently
> --
>
> Key: SPARK-31460
> URL: https://issues.apache.org/jira/browse/SPARK-31460
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4
>Reporter: vinay
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In spark 2.4.4 , it provides a source "spark-sql-kafka-0-10_2.11".
>  
> When I wanted to read from my kafka-0.10.2.11 cluster, it throws out an error 
> "*java.util.concurrent.TimeoutException: Cannot fetch record for offset x 
> in 1000 milliseconds*"  frequently, and the job thus failed.
>  
> I see this issue was seen before in 2.3 according to ticket 23829 and an 
> upgrade to spark 2.4 was supposed to solve this.
>  
> {code:java}
> compile group: 'org.apache.spark', name: 'spark-sql-kafka-0-10_2.11', 
> version: '2.4.4'{code}
> Here is the error stack.
> {code:java}
> org.apache.spark.SparkException: Writing job aborted.
>  
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:92)
>  
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>  org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>  org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
>  org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296)
>  
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3389)
>  org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2788)
>  org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2788)
>  org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
>  
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
>  org.apache.spark.sql.Dataset.collect(Dataset.scala:2788)
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:540)
>  
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535)
>  
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
>  
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534)
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>  
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
>  
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
> 

[jira] [Created] (SPARK-35993) Flaky test: org.apache.spark.sql.execution.streaming.state.RocksDBSuite.ensure that concurrent update and cleanup consistent versions

2021-07-02 Thread Gabor Somogyi (Jira)
Gabor Somogyi created SPARK-35993:
-

 Summary: Flaky test: 
org.apache.spark.sql.execution.streaming.state.RocksDBSuite.ensure that 
concurrent update and cleanup consistent versions
 Key: SPARK-35993
 URL: https://issues.apache.org/jira/browse/SPARK-35993
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Tests
Affects Versions: 3.1.2
Reporter: Gabor Somogyi


Appeared in jenkins: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/140575/testReport/org.apache.spark.sql.execution.streaming.state/RocksDBSuite/ensure_that_concurrent_update_and_cleanup_consistent_versions/

{code:java}
Error Message
java.io.FileNotFoundException: File 
/home/jenkins/workspace/SparkPullRequestBuilder@2/target/tmp/spark-21674620-ac83-4ad3-a153-5a7adf909244/20.zip
 does not exist
Stacktrace
sbt.ForkMain$ForkError: java.io.FileNotFoundException: File 
/home/jenkins/workspace/SparkPullRequestBuilder@2/target/tmp/spark-21674620-ac83-4ad3-a153-5a7adf909244/20.zip
 does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:160)
at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:372)
at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:74)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:976)
at org.apache.spark.util.Utils$.unzipFilesFromFile(Utils.scala:3132)
at 
org.apache.spark.sql.execution.streaming.state.RocksDBFileManager.loadCheckpointFromDfs(RocksDBFileManager.scala:174)
at 
org.apache.spark.sql.execution.streaming.state.RocksDB.load(RocksDB.scala:103)
at 
org.apache.spark.sql.execution.streaming.state.RocksDBSuite.withDB(RocksDBSuite.scala:443)
at 
org.apache.spark.sql.execution.streaming.state.RocksDBSuite.$anonfun$new$57(RocksDBSuite.scala:397)
at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42)
at 
org.apache.spark.sql.execution.streaming.state.RocksDBSuite.$anonfun$new$56(RocksDBSuite.scala:341)
at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at 
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
at 
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at 
org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
at 
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
at 
org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
at 
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
at 
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
at scala.collection.immutable.List.foreach(List.scala:431)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
at 
org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
at 
org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563)
at org.scalatest.Suite.run(Suite.scala:1112)
at org.scalatest.Suite.run$(Suite.scala:1094)
at 
org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563)
at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
at 

[jira] [Commented] (SPARK-33223) Expose state information on SS UI

2021-05-04 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338901#comment-17338901
 ] 

Gabor Somogyi commented on SPARK-33223:
---

[~smilegator] sure, filed SPARK-35311 and preparing a PR.

> Expose state information on SS UI
> -
>
> Key: SPARK-33223
> URL: https://issues.apache.org/jira/browse/SPARK-33223
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming, Web UI
>Affects Versions: 3.0.1
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35311) Add exposed SS UI state information metrics to the documentation

2021-05-04 Thread Gabor Somogyi (Jira)
Gabor Somogyi created SPARK-35311:
-

 Summary: Add exposed SS UI state information metrics to the 
documentation
 Key: SPARK-35311
 URL: https://issues.apache.org/jira/browse/SPARK-35311
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.2.0
Reporter: Gabor Somogyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-34383) Optimize WAL commit phase on SS

2021-03-22 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi closed SPARK-34383.
-

> Optimize WAL commit phase on SS
> ---
>
> Key: SPARK-34383
> URL: https://issues.apache.org/jira/browse/SPARK-34383
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.2.0
>
>
> I found there're unnecessary access / expensive operation of file system in 
> WAL commit phase of SS.
> They can be optimized via caching (using driver memory a bit) & replacing FS 
> operation. This brings reduced latency per batch, especially checkpoint 
> against object store.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34383) Optimize WAL commit phase on SS

2021-03-22 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi resolved SPARK-34383.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31495
https://github.com/apache/spark/pull/31495

> Optimize WAL commit phase on SS
> ---
>
> Key: SPARK-34383
> URL: https://issues.apache.org/jira/browse/SPARK-34383
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.2.0
>
>
> I found there're unnecessary access / expensive operation of file system in 
> WAL commit phase of SS.
> They can be optimized via caching (using driver memory a bit) & replacing FS 
> operation. This brings reduced latency per batch, especially checkpoint 
> against object store.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34383) Optimize WAL commit phase on SS

2021-03-22 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi reassigned SPARK-34383:
-

Assignee: Jungtaek Lim

> Optimize WAL commit phase on SS
> ---
>
> Key: SPARK-34383
> URL: https://issues.apache.org/jira/browse/SPARK-34383
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> I found there're unnecessary access / expensive operation of file system in 
> WAL commit phase of SS.
> They can be optimized via caching (using driver memory a bit) & replacing FS 
> operation. This brings reduced latency per batch, especially checkpoint 
> against object store.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34580) Provide the relationship between batch ID and SQL executions (and/or Jobs) in SS UI page

2021-03-01 Thread Gabor Somogyi (Jira)
Gabor Somogyi created SPARK-34580:
-

 Summary: Provide the relationship between batch ID and SQL 
executions (and/or Jobs) in SS UI page
 Key: SPARK-34580
 URL: https://issues.apache.org/jira/browse/SPARK-34580
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.1.1
Reporter: Gabor Somogyi


The current SS UI page focuses to show the trends among the batches, which is 
great to figure out whether the streaming query is running healthy or not, and 
the oddness of specific batch.

One thing still bugging you is that what you can get from here is the batch ID 
(number), which means you have to find out related SQL executions and Jobs 
manually with the batch ID. It's high likely bound to the recent runs of SQL 
executions/Jobs so you may not need to find it with searching on lots of pages, 
but the fact you need to find it by yourself manually is still annoying.

It would be nice if we can provide the relationship between batch ID and SQL 
executions (probably Jobs as well if the space is enough) and links to these 
pages, like we see job page links from SQL execution page.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34580) Provide the relationship between batch ID and SQL executions (and/or Jobs) in SS UI page

2021-03-01 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17292787#comment-17292787
 ] 

Gabor Somogyi commented on SPARK-34580:
---

cc [~kabhwan] [~hyukjin.kwon] [~zsxwing] [~viirya]
It requires some time to come up with a makes sense solution but started.


> Provide the relationship between batch ID and SQL executions (and/or Jobs) in 
> SS UI page
> 
>
> Key: SPARK-34580
> URL: https://issues.apache.org/jira/browse/SPARK-34580
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.1
>Reporter: Gabor Somogyi
>Priority: Major
>
> The current SS UI page focuses to show the trends among the batches, which is 
> great to figure out whether the streaming query is running healthy or not, 
> and the oddness of specific batch.
> One thing still bugging you is that what you can get from here is the batch 
> ID (number), which means you have to find out related SQL executions and Jobs 
> manually with the batch ID. It's high likely bound to the recent runs of SQL 
> executions/Jobs so you may not need to find it with searching on lots of 
> pages, but the fact you need to find it by yourself manually is still 
> annoying.
> It would be nice if we can provide the relationship between batch ID and SQL 
> executions (probably Jobs as well if the space is enough) and links to these 
> pages, like we see job page links from SQL execution page.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34497) JDBC connection provider is not removing kerberos credentials from JVM security context

2021-02-24 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-34497:
--
Description: 
Some of the built-in JDBC connection providers are changing the JVM security 
context to do the authentication which is fine. The problematic part is that 
executors can be reused by another query. The following situation leads to 
incorrect behaviour:
 * Query1 opens JDBC connection and changes JVM security context in Executor1
 * Query2 tries to open JDBC connection but it realizes there is already an 
entry for that DB type in Executor1
 * Query2 is not changing JVM security context and uses Query1 keytab and 
principal
 * Query2 fails with authentication error

> JDBC connection provider is not removing kerberos credentials from JVM 
> security context
> ---
>
> Key: SPARK-34497
> URL: https://issues.apache.org/jira/browse/SPARK-34497
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.2
>Reporter: Gabor Somogyi
>Priority: Major
>
> Some of the built-in JDBC connection providers are changing the JVM security 
> context to do the authentication which is fine. The problematic part is that 
> executors can be reused by another query. The following situation leads to 
> incorrect behaviour:
>  * Query1 opens JDBC connection and changes JVM security context in Executor1
>  * Query2 tries to open JDBC connection but it realizes there is already an 
> entry for that DB type in Executor1
>  * Query2 is not changing JVM security context and uses Query1 keytab and 
> principal
>  * Query2 fails with authentication error



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34497) JDBC connection provider is not removing kerberos credentials from JVM security context

2021-02-24 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289815#comment-17289815
 ] 

Gabor Somogyi commented on SPARK-34497:
---

Thanks for the suggestion, filled.

> JDBC connection provider is not removing kerberos credentials from JVM 
> security context
> ---
>
> Key: SPARK-34497
> URL: https://issues.apache.org/jira/browse/SPARK-34497
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.2
>Reporter: Gabor Somogyi
>Priority: Major
>
> Some of the built-in JDBC connection providers are changing the JVM security 
> context to do the authentication which is fine. The problematic part is that 
> executors can be reused by another query. The following situation leads to 
> incorrect behaviour:
>  * Query1 opens JDBC connection and changes JVM security context in Executor1
>  * Query2 tries to open JDBC connection but it realizes there is already an 
> entry for that DB type in Executor1
>  * Query2 is not changing JVM security context and uses Query1 keytab and 
> principal
>  * Query2 fails with authentication error



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34497) JDBC connection provider is not removing kerberos credentials from JVM security context

2021-02-22 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17288387#comment-17288387
 ] 

Gabor Somogyi commented on SPARK-34497:
---

Working on this.

> JDBC connection provider is not removing kerberos credentials from JVM 
> security context
> ---
>
> Key: SPARK-34497
> URL: https://issues.apache.org/jira/browse/SPARK-34497
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34497) JDBC connection provider is not removing kerberos credentials from JVM security context

2021-02-22 Thread Gabor Somogyi (Jira)
Gabor Somogyi created SPARK-34497:
-

 Summary: JDBC connection provider is not removing kerberos 
credentials from JVM security context
 Key: SPARK-34497
 URL: https://issues.apache.org/jira/browse/SPARK-34497
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.2, 3.1.0
Reporter: Gabor Somogyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12312) Support JDBC Kerberos w/ keytab

2021-02-09 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi closed SPARK-12312.
-

> Support JDBC Kerberos w/ keytab
> ---
>
> Key: SPARK-12312
> URL: https://issues.apache.org/jira/browse/SPARK-12312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.4.2
>Reporter: nabacg
>Assignee: Gabor Somogyi
>Priority: Minor
> Fix For: 3.1.0
>
>
> When loading DataFrames from JDBC datasource with Kerberos authentication, 
> remote executors (yarn-client/cluster etc. modes) fail to establish a 
> connection due to lack of Kerberos ticket or ability to generate it. 
> This is a real issue when trying to ingest data from kerberized data sources 
> (SQL Server, Oracle) in enterprise environment where exposing simple 
> authentication access is not an option due to IT policy issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12312) Support JDBC Kerberos w/ keytab

2021-02-09 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi resolved SPARK-12312.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

> Support JDBC Kerberos w/ keytab
> ---
>
> Key: SPARK-12312
> URL: https://issues.apache.org/jira/browse/SPARK-12312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.4.2
>Reporter: nabacg
>Assignee: Gabor Somogyi
>Priority: Minor
> Fix For: 3.1.0
>
>
> When loading DataFrames from JDBC datasource with Kerberos authentication, 
> remote executors (yarn-client/cluster etc. modes) fail to establish a 
> connection due to lack of Kerberos ticket or ability to generate it. 
> This is a real issue when trying to ingest data from kerberized data sources 
> (SQL Server, Oracle) in enterprise environment where exposing simple 
> authentication access is not an option due to IT policy issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-31857) Support Azure SQLDB Kerberos login in JDBC connector

2021-02-09 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi closed SPARK-31857.
-

> Support Azure SQLDB Kerberos login in JDBC connector
> 
>
> Key: SPARK-31857
> URL: https://issues.apache.org/jira/browse/SPARK-31857
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31884) Support MongoDB Kerberos login in JDBC connector

2021-02-09 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi resolved SPARK-31884.
---
Resolution: Won't Do

> Support MongoDB Kerberos login in JDBC connector
> 
>
> Key: SPARK-31884
> URL: https://issues.apache.org/jira/browse/SPARK-31884
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-31884) Support MongoDB Kerberos login in JDBC connector

2021-02-09 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi closed SPARK-31884.
-

> Support MongoDB Kerberos login in JDBC connector
> 
>
> Key: SPARK-31884
> URL: https://issues.apache.org/jira/browse/SPARK-31884
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31884) Support MongoDB Kerberos login in JDBC connector

2021-02-09 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17282280#comment-17282280
 ] 

Gabor Somogyi commented on SPARK-31884:
---

Since the API is implemented I think it's now possible to add the mentioned 
feature as external plugin so closing this jira.
If committers or PMCs think we can add this as built-in provider feel free to 
re-open this Jira.


> Support MongoDB Kerberos login in JDBC connector
> 
>
> Key: SPARK-31884
> URL: https://issues.apache.org/jira/browse/SPARK-31884
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31857) Support Azure SQLDB Kerberos login in JDBC connector

2021-02-09 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi resolved SPARK-31857.
---
Resolution: Won't Do

> Support Azure SQLDB Kerberos login in JDBC connector
> 
>
> Key: SPARK-31857
> URL: https://issues.apache.org/jira/browse/SPARK-31857
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-31815) Support Hive Kerberos login in JDBC connector

2021-02-09 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi closed SPARK-31815.
-

> Support Hive Kerberos login in JDBC connector
> -
>
> Key: SPARK-31815
> URL: https://issues.apache.org/jira/browse/SPARK-31815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31857) Support Azure SQLDB Kerberos login in JDBC connector

2021-02-09 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17282279#comment-17282279
 ] 

Gabor Somogyi commented on SPARK-31857:
---

Since the API is implemented I think it's now possible to add the mentioned 
feature as external plugin so closing this jira.
If committers or PMCs think we can add this as built-in provider feel free to 
re-open this Jira.


> Support Azure SQLDB Kerberos login in JDBC connector
> 
>
> Key: SPARK-31857
> URL: https://issues.apache.org/jira/browse/SPARK-31857
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31815) Support Hive Kerberos login in JDBC connector

2021-02-09 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi resolved SPARK-31815.
---
Resolution: Won't Do

> Support Hive Kerberos login in JDBC connector
> -
>
> Key: SPARK-31815
> URL: https://issues.apache.org/jira/browse/SPARK-31815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31815) Support Hive Kerberos login in JDBC connector

2021-02-09 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17282277#comment-17282277
 ] 

Gabor Somogyi commented on SPARK-31815:
---

Since the API is implemented I think it's now possible to add the mentioned 
feature as external plugin so closing this jira.
If committers or PMCs think we can add this as built-in provider feel free to 
re-open this Jira.


> Support Hive Kerberos login in JDBC connector
> -
>
> Key: SPARK-31815
> URL: https://issues.apache.org/jira/browse/SPARK-31815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module

2021-01-25 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17271308#comment-17271308
 ] 

Gabor Somogyi commented on SPARK-34198:
---

+1 on this, I've started to review it back in the days when a guy wanted to add 
it embedded.

> Add RocksDB StateStore as external module
> -
>
> Key: SPARK-34198
> URL: https://issues.apache.org/jira/browse/SPARK-34198
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Currently Spark SS only has one built-in StateStore implementation 
> HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As 
> there are more and more streaming applications, some of them requires to use 
> large state in stateful operations such as streaming aggregation and join.
> Several other major streaming frameworks already use RocksDB for state 
> management. So it is proven to be good choice for large state usage. But 
> Spark SS still lacks of a built-in state store for the requirement.
> We would like to explore the possibility to add RocksDB-based StateStore into 
> Spark SS. For the concern about adding RocksDB as a direct dependency, our 
> plan is to add this StateStore as an external module first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34090) HadoopDelegationTokenManager.isServiceEnabled used in KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka stream processing in case of delegati

2021-01-12 Thread Gabor Somogyi (Jira)
Gabor Somogyi created SPARK-34090:
-

 Summary: HadoopDelegationTokenManager.isServiceEnabled used in 
KafkaTokenUtil.needTokenUpdate needs to be cached because it slows down Kafka 
stream processing in case of delegation token
 Key: SPARK-34090
 URL: https://issues.apache.org/jira/browse/SPARK-34090
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.1.1
Reporter: Gabor Somogyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34032) Add Kafka delegation token truststore and keystore type confiuration

2021-01-06 Thread Gabor Somogyi (Jira)
Gabor Somogyi created SPARK-34032:
-

 Summary: Add Kafka delegation token truststore and keystore type 
confiuration
 Key: SPARK-34032
 URL: https://issues.apache.org/jira/browse/SPARK-34032
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.0.1
Reporter: Gabor Somogyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33635) Performance regression in Kafka read

2020-12-19 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17252153#comment-17252153
 ] 

Gabor Somogyi commented on SPARK-33635:
---

{quote}Remember, based on all my testing, and raw kafka reads on my system - 
the 3.0.1 spark is performing in line with expectations.{quote}
Good to hear. You don't have to hurry since I'm on vacation this year unless a 
breaking issue appears in the upcoming Spark release.


> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Priority: Major
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been working though the spark-sql-kafka-0-10 code base, but change for 
> spark 3 have been ongoing for over a year and its difficult to pin point an 
> exact change or reason for the degradation.
> I am happy to help fix this problem, but will need some assitance as I am 
> unfamiliar with the spark-sql-kafka-0-10 project.
>  
> A sample of the data my test reads (note: its not parsing csv - this is just 
> test data)
>  
> 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine
>  Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) 
> [C],2017/12/15 00:00:00,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33635) Performance regression in Kafka read

2020-12-13 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248550#comment-17248550
 ] 

Gabor Somogyi commented on SPARK-33635:
---

{quote}The collect in this test case is only 13 items of data after the group 
by - so I know thats not going to impact it.
 But I can modify it to just read and write to kafka.
{quote}
Yeah, we need to reduce the use-case to the most minimal app to measure only 
what we need. Aggregations and all those stuff don't belong to Kafka read and 
write performance.

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Priority: Major
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been working though the spark-sql-kafka-0-10 code base, but change for 
> spark 3 have been ongoing for over a year and its difficult to pin point an 
> exact change or reason for the degradation.
> I am happy to help fix this problem, but will need some assitance as I am 
> unfamiliar with the spark-sql-kafka-0-10 project.
>  
> A sample of the data my test reads (note: its not parsing csv - this is just 
> test data)
>  
> 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine
>  Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) 
> [C],2017/12/15 00:00:00,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33635) Performance regression in Kafka read

2020-12-13 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248549#comment-17248549
 ] 

Gabor Somogyi commented on SPARK-33635:
---

Mixed up with DStreams, in Strutured Streaming and SQL there is no turn off 
flag.

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Priority: Major
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been working though the spark-sql-kafka-0-10 code base, but change for 
> spark 3 have been ongoing for over a year and its difficult to pin point an 
> exact change or reason for the degradation.
> I am happy to help fix this problem, but will need some assitance as I am 
> unfamiliar with the spark-sql-kafka-0-10 project.
>  
> A sample of the data my test reads (note: its not parsing csv - this is just 
> test data)
>  
> 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine
>  Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) 
> [C],2017/12/15 00:00:00,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28367) Kafka connector infinite wait because metadata never updated

2020-12-11 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-28367:
--
Fix Version/s: 3.1.0

> Kafka connector infinite wait because metadata never updated
> 
>
> Key: SPARK-28367
> URL: https://issues.apache.org/jira/browse/SPARK-28367
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.3, 2.2.3, 2.3.3, 2.4.3, 3.0.0, 3.1.0
>Reporter: Gabor Somogyi
>Priority: Critical
> Fix For: 3.1.0
>
>
> Spark uses an old and deprecated API named poll(long) which never returns and 
> stays in live lock if metadata is not updated (for instance when broker 
> disappears at consumer creation).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-28367) Kafka connector infinite wait because metadata never updated

2020-12-11 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi closed SPARK-28367.
-

> Kafka connector infinite wait because metadata never updated
> 
>
> Key: SPARK-28367
> URL: https://issues.apache.org/jira/browse/SPARK-28367
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.3, 2.2.3, 2.3.3, 2.4.3, 3.0.0, 3.1.0
>Reporter: Gabor Somogyi
>Priority: Critical
>
> Spark uses an old and deprecated API named poll(long) which never returns and 
> stays in live lock if metadata is not updated (for instance when broker 
> disappears at consumer creation).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28367) Kafka connector infinite wait because metadata never updated

2020-12-11 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi resolved SPARK-28367.
---
Resolution: Fixed

> Kafka connector infinite wait because metadata never updated
> 
>
> Key: SPARK-28367
> URL: https://issues.apache.org/jira/browse/SPARK-28367
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.3, 2.2.3, 2.3.3, 2.4.3, 3.0.0, 3.1.0
>Reporter: Gabor Somogyi
>Priority: Critical
>
> Spark uses an old and deprecated API named poll(long) which never returns and 
> stays in live lock if metadata is not updated (for instance when broker 
> disappears at consumer creation).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28367) Kafka connector infinite wait because metadata never updated

2020-12-11 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247813#comment-17247813
 ] 

Gabor Somogyi commented on SPARK-28367:
---

The issue solved in subtasks so closing this.

> Kafka connector infinite wait because metadata never updated
> 
>
> Key: SPARK-28367
> URL: https://issues.apache.org/jira/browse/SPARK-28367
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.3, 2.2.3, 2.3.3, 2.4.3, 3.0.0, 3.1.0
>Reporter: Gabor Somogyi
>Priority: Critical
>
> Spark uses an old and deprecated API named poll(long) which never returns and 
> stays in live lock if metadata is not updated (for instance when broker 
> disappears at consumer creation).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33635) Performance regression in Kafka read

2020-12-09 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246643#comment-17246643
 ] 

Gabor Somogyi edited comment on SPARK-33635 at 12/9/20, 4:39 PM:
-

{quote}I no longer believe this is a true regression in performance, I now 
think that 2.4.5 was "cheating".
{quote}
If you mean by cheating Spark uses one consumer from multiple threads then the 
answer is no. Kafka consumer is strictly forbidden to use from multiple threads.
 If such thing happens then Kafka realizes it and exception will be thrown 
which will stop the query immediately.


was (Author: gsomogyi):
{quote}I no longer believe this is a true regression in performance, I now 
think that 2.4.5 was "cheating".
{quote}
If you mean by cheating Spark uses one consumer from multiple threads then the 
answer is no. Kafka consumer is strictly forbidden to use from multiple threads.
 If such thing happens then Kafka realizes it and exception will be throws 
which will stop the query immediately.

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Priority: Major
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been working though the spark-sql-kafka-0-10 code base, but change for 
> spark 3 have been ongoing for over a year and its difficult to pin point an 
> exact change or reason for the degradation.
> I am happy to help fix this problem, but will need some assitance as I am 
> unfamiliar with the spark-sql-kafka-0-10 project.
>  
> A sample of the data my test reads (note: its not parsing csv - this is just 
> test data)
>  
> 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine
>  Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) 
> [C],2017/12/15 00:00:00,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33635) Performance regression in Kafka read

2020-12-09 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246643#comment-17246643
 ] 

Gabor Somogyi commented on SPARK-33635:
---

{quote}I no longer believe this is a true regression in performance, I now 
think that 2.4.5 was "cheating".
{quote}
If you mean by cheating Spark uses one consumer from multiple threads then the 
answer is no. Kafka consumer is strictly forbidden to use from multiple threads.
 If such thing happens then Kafka realizes it and exception will be throws 
which will stop the query immediately.

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Priority: Major
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been working though the spark-sql-kafka-0-10 code base, but change for 
> spark 3 have been ongoing for over a year and its difficult to pin point an 
> exact change or reason for the degradation.
> I am happy to help fix this problem, but will need some assitance as I am 
> unfamiliar with the spark-sql-kafka-0-10 project.
>  
> A sample of the data my test reads (note: its not parsing csv - this is just 
> test data)
>  
> 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine
>  Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) 
> [C],2017/12/15 00:00:00,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33635) Performance regression in Kafka read

2020-12-09 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246640#comment-17246640
 ] 

Gabor Somogyi commented on SPARK-33635:
---

I've changed to SQL because you're not executing a Structured Streaming query 
but an SQL batch.

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Priority: Major
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been working though the spark-sql-kafka-0-10 code base, but change for 
> spark 3 have been ongoing for over a year and its difficult to pin point an 
> exact change or reason for the degradation.
> I am happy to help fix this problem, but will need some assitance as I am 
> unfamiliar with the spark-sql-kafka-0-10 project.
>  
> A sample of the data my test reads (note: its not parsing csv - this is just 
> test data)
>  
> 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine
>  Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) 
> [C],2017/12/15 00:00:00,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33635) Performance regression in Kafka read

2020-12-09 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-33635:
--
Component/s: (was: Structured Streaming)
 SQL

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Priority: Major
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been working though the spark-sql-kafka-0-10 code base, but change for 
> spark 3 have been ongoing for over a year and its difficult to pin point an 
> exact change or reason for the degradation.
> I am happy to help fix this problem, but will need some assitance as I am 
> unfamiliar with the spark-sql-kafka-0-10 project.
>  
> A sample of the data my test reads (note: its not parsing csv - this is just 
> test data)
>  
> 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine
>  Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) 
> [C],2017/12/15 00:00:00,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33635) Performance regression in Kafka read

2020-12-09 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246633#comment-17246633
 ] 

Gabor Somogyi commented on SPARK-33635:
---

Since you're measuring speed I've ported the Kafka source from DSv1 to DSv2. 
DSv1 is the default but the DSv2 can be tried out by setting 
"spark.sql.sources.useV1SourceList" properly. If you can try it out I would 
appreciate it.

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Priority: Major
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been working though the spark-sql-kafka-0-10 code base, but change for 
> spark 3 have been ongoing for over a year and its difficult to pin point an 
> exact change or reason for the degradation.
> I am happy to help fix this problem, but will need some assitance as I am 
> unfamiliar with the spark-sql-kafka-0-10 project.
>  
> A sample of the data my test reads (note: its not parsing csv - this is just 
> test data)
>  
> 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine
>  Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) 
> [C],2017/12/15 00:00:00,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33635) Performance regression in Kafka read

2020-12-09 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246631#comment-17246631
 ] 

Gabor Somogyi commented on SPARK-33635:
---

BTW, I'm sure you know but using collect gathers all the data on the driver 
side which is not really suggested under any circumstances.

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Priority: Major
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been working though the spark-sql-kafka-0-10 code base, but change for 
> spark 3 have been ongoing for over a year and its difficult to pin point an 
> exact change or reason for the degradation.
> I am happy to help fix this problem, but will need some assitance as I am 
> unfamiliar with the spark-sql-kafka-0-10 project.
>  
> A sample of the data my test reads (note: its not parsing csv - this is just 
> test data)
>  
> 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine
>  Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) 
> [C],2017/12/15 00:00:00,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33635) Performance regression in Kafka read

2020-12-09 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246625#comment-17246625
 ] 

Gabor Somogyi commented on SPARK-33635:
---

[~david.wyles] try to turn off Kafka consumer caching. Apart from that there 
were no super significant changes which could cause this.

I've taken a look at your application and it does groupby and stuff like that. 
This is not related to Kafka read performance since Spark SQL engine contains 
huge amount of changes.
I suggest to create an application which just moves simple data from one topic 
into another and please use the exact same broker version.
If it's still slow we can measure further things.


> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Priority: Major
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been working though the spark-sql-kafka-0-10 code base, but change for 
> spark 3 have been ongoing for over a year and its difficult to pin point an 
> exact change or reason for the degradation.
> I am happy to help fix this problem, but will need some assitance as I am 
> unfamiliar with the spark-sql-kafka-0-10 project.
>  
> A sample of the data my test reads (note: its not parsing csv - this is just 
> test data)
>  
> 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine
>  Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) 
> [C],2017/12/15 00:00:00,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32910) Remove UninterruptibleThread usage from KafkaOffsetReader

2020-12-02 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-32910:
--
Affects Version/s: (was: 3.2.0)
   3.1.0

> Remove UninterruptibleThread usage from KafkaOffsetReader
> -
>
> Key: SPARK-32910
> URL: https://issues.apache.org/jira/browse/SPARK-32910
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> We've talked about this here: 
> https://github.com/apache/spark/pull/29729#discussion_r488690731
> This jira stands only if the mentioned PR is merged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33633) Expose fine grained state information on SS UI

2020-12-02 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-33633:
--
Affects Version/s: (was: 3.2.0)
   3.1.0

> Expose fine grained state information on SS UI
> --
>
> Key: SPARK-33633
> URL: https://issues.apache.org/jira/browse/SPARK-33633
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> SPARK-33223 provides aggregated information but in order to find out the 
> problematic parts not aggregated information must be also provided.
> In or order to find out what is the best way to do that some investigation 
> needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33633) Expose fine grained state information on SS UI

2020-12-02 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-33633:
--
Description: 
SPARK-33223 provides aggregated information but in order to find out the 
problematic parts not aggregated information must be also provided.
In or order to find out what is the best way to do that some investigation 
needed.

> Expose fine grained state information on SS UI
> --
>
> Key: SPARK-33633
> URL: https://issues.apache.org/jira/browse/SPARK-33633
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> SPARK-33223 provides aggregated information but in order to find out the 
> problematic parts not aggregated information must be also provided.
> In or order to find out what is the best way to do that some investigation 
> needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33633) Expose fine grained state information on SS UI

2020-12-02 Thread Gabor Somogyi (Jira)
Gabor Somogyi created SPARK-33633:
-

 Summary: Expose fine grained state information on SS UI
 Key: SPARK-33633
 URL: https://issues.apache.org/jira/browse/SPARK-33633
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 3.2.0
Reporter: Gabor Somogyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33629) spark.buffer.size not applied in driver from pyspark

2020-12-02 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242195#comment-17242195
 ] 

Gabor Somogyi commented on SPARK-33629:
---

I've started to work on this and going to file a PR soon.

> spark.buffer.size not applied in driver from pyspark
> 
>
> Key: SPARK-33629
> URL: https://issues.apache.org/jira/browse/SPARK-33629
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> The problem has been discovered here: 
> [https://github.com/apache/spark/pull/30389#issuecomment-729524618]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33629) spark.buffer.size not applied in driver from pyspark

2020-12-02 Thread Gabor Somogyi (Jira)
Gabor Somogyi created SPARK-33629:
-

 Summary: spark.buffer.size not applied in driver from pyspark
 Key: SPARK-33629
 URL: https://issues.apache.org/jira/browse/SPARK-33629
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.1.0
Reporter: Gabor Somogyi


The problem has been discovered here: 
[https://github.com/apache/spark/pull/30389#issuecomment-729524618]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32910) Remove UninterruptibleThread usage from KafkaOffsetReader

2020-12-01 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-32910:
--
Affects Version/s: (was: 3.1.0)
   3.2.0

> Remove UninterruptibleThread usage from KafkaOffsetReader
> -
>
> Key: SPARK-32910
> URL: https://issues.apache.org/jira/browse/SPARK-32910
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> We've talked about this here: 
> https://github.com/apache/spark/pull/29729#discussion_r488690731
> This jira stands only if the mentioned PR is merged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32910) Remove UninterruptibleThread usage from KafkaOffsetReader

2020-12-01 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241403#comment-17241403
 ] 

Gabor Somogyi commented on SPARK-32910:
---

I think there will be no time to put it into 3.1 so changing the target.

> Remove UninterruptibleThread usage from KafkaOffsetReader
> -
>
> Key: SPARK-32910
> URL: https://issues.apache.org/jira/browse/SPARK-32910
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> We've talked about this here: 
> https://github.com/apache/spark/pull/29729#discussion_r488690731
> This jira stands only if the mentioned PR is merged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33491) Update Structured Streaming UI documentation page

2020-11-19 Thread Gabor Somogyi (Jira)
Gabor Somogyi created SPARK-33491:
-

 Summary: Update Structured Streaming UI documentation page
 Key: SPARK-33491
 URL: https://issues.apache.org/jira/browse/SPARK-33491
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming, Web UI
Affects Versions: 3.1.0
Reporter: Gabor Somogyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33143) Make SocketAuthServer socket timeout configurable

2020-11-16 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232824#comment-17232824
 ] 

Gabor Somogyi commented on SPARK-33143:
---

[~mszurap] the OS and network guys are still working on it but one thing seems 
sure.

It has nothing to do with the RDD size. It's reproducible w/ relatively small 
RDDs.

> Make SocketAuthServer socket timeout configurable
> -
>
> Key: SPARK-33143
> URL: https://issues.apache.org/jira/browse/SPARK-33143
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Miklos Szurap
>Priority: Major
>
> In SPARK-21551 the socket timeout for the Pyspark applications has been 
> increased from 3 to 15 seconds. However it is still hardcoded.
> In certain situations even the 15 seconds is not enough, so it should be made 
> configurable. 
> This is requested after seeing it in real-life workload failures.
> Also it has been suggested and requested in an earlier comment in 
> [SPARK-18649|https://issues.apache.org/jira/browse/SPARK-18649?focusedCommentId=16493498=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16493498]
> In 
> Spark 2.4 it is under
> [PythonRDD.scala|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L899]
> in Spark 3.x the code has been moved to
> [SocketAuthServer.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/security/SocketAuthServer.scala#L51]
> {code}
> serverSocket.setSoTimeout(15000)
> {code}
> Please include this in both 2.4 and 3.x branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33143) Make SocketAuthServer socket timeout configurable

2020-11-13 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231270#comment-17231270
 ] 

Gabor Somogyi commented on SPARK-33143:
---

[~hyukjin.kwon] thanks for the confirmation, started to craft a PR.

As a separate thread will come back when we've found out the rootcause.

> Make SocketAuthServer socket timeout configurable
> -
>
> Key: SPARK-33143
> URL: https://issues.apache.org/jira/browse/SPARK-33143
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Miklos Szurap
>Priority: Major
>
> In SPARK-21551 the socket timeout for the Pyspark applications has been 
> increased from 3 to 15 seconds. However it is still hardcoded.
> In certain situations even the 15 seconds is not enough, so it should be made 
> configurable. 
> This is requested after seeing it in real-life workload failures.
> Also it has been suggested and requested in an earlier comment in 
> [SPARK-18649|https://issues.apache.org/jira/browse/SPARK-18649?focusedCommentId=16493498=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16493498]
> In 
> Spark 2.4 it is under
> [PythonRDD.scala|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L899]
> in Spark 3.x the code has been moved to
> [SocketAuthServer.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/security/SocketAuthServer.scala#L51]
> {code}
> serverSocket.setSoTimeout(15000)
> {code}
> Please include this in both 2.4 and 3.x branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33143) Make SocketAuthServer socket timeout configurable

2020-11-13 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231244#comment-17231244
 ] 

Gabor Somogyi commented on SPARK-33143:
---

I've had a look at this issue and under some circumstances at heavy users one 
of the following calls took more than 15 seconds ending up in timeout:
 * getaddrinfo
 * socket
 * settimeout

It needs more investigation what's the rootcause of this issue (I'm on it).
 There are 2 suspects:
 * DNS is involved in getaddrinfo which is not responding or super slow
 * The OS is super slow somehow

Either way is super hard to provide workaround with hardcoded timeout. I tend 
to believe it would be good to make it configurable otherwise such intermittent 
issues could make temporary workaround extremely hard.

[~hyukjin.kwon] WDYT?

> Make SocketAuthServer socket timeout configurable
> -
>
> Key: SPARK-33143
> URL: https://issues.apache.org/jira/browse/SPARK-33143
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Miklos Szurap
>Priority: Major
>
> In SPARK-21551 the socket timeout for the Pyspark applications has been 
> increased from 3 to 15 seconds. However it is still hardcoded.
> In certain situations even the 15 seconds is not enough, so it should be made 
> configurable. 
> This is requested after seeing it in real-life workload failures.
> Also it has been suggested and requested in an earlier comment in 
> [SPARK-18649|https://issues.apache.org/jira/browse/SPARK-18649?focusedCommentId=16493498=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16493498]
> In 
> Spark 2.4 it is under
> [PythonRDD.scala|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L899]
> in Spark 3.x the code has been moved to
> [SocketAuthServer.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/security/SocketAuthServer.scala#L51]
> {code}
> serverSocket.setSoTimeout(15000)
> {code}
> Please include this in both 2.4 and 3.x branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33287) Expose state custom metrics information on SS UI

2020-10-29 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-33287:
--
Affects Version/s: (was: 3.1.0)
   3.0.1

> Expose state custom metrics information on SS UI
> 
>
> Key: SPARK-33287
> URL: https://issues.apache.org/jira/browse/SPARK-33287
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming, Web UI
>Affects Versions: 3.0.1
>Reporter: Gabor Somogyi
>Priority: Major
>
> Since not all custom metrics hold useful information it would be good to add 
> exclude possibility.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33273) Fix Flaky Test: ThriftServerQueryTestSuite. subquery_scalar_subquery_scalar_subquery_select_sql

2020-10-29 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17222896#comment-17222896
 ] 

Gabor Somogyi commented on SPARK-33273:
---

I've just faced with this too: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130409/testReport/org.apache.spark.sql.hive.thriftserver/ThriftServerQueryTestSuite/subquery_scalar_subquery_scalar_subquery_select_sql/

> Fix Flaky Test: ThriftServerQueryTestSuite. 
> subquery_scalar_subquery_scalar_subquery_select_sql
> ---
>
> Key: SPARK-33273
> URL: https://issues.apache.org/jira/browse/SPARK-33273
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130369/testReport/org.apache.spark.sql.hive.thriftserver/ThriftServerQueryTestSuite/subquery_scalar_subquery_scalar_subquery_select_sql/
> {code}
> [info] - subquery/scalar-subquery/scalar-subquery-select.sql *** FAILED *** 
> (3 seconds, 877 milliseconds)
> [info]   Expected "[1]0   2017-05-04 01:01:0...", but got "[]0
> 2017-05-04 01:01:0..." Result did not match for query #3
> [info]   SELECT (SELECT min(t3d) FROM t3) min_t3d,
> [info]  (SELECT max(t2h) FROM t2) max_t2h
> [info]   FROM   t1
> [info]   WHERE  t1a = 'val1c' (ThriftServerQueryTestSuite.scala:197)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33287) Expose state custom metrics information on SS UI

2020-10-29 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-33287:
--
Description: Since not all custom metrics hold useful information it would 
be good to add exclude possibility.

> Expose state custom metrics information on SS UI
> 
>
> Key: SPARK-33287
> URL: https://issues.apache.org/jira/browse/SPARK-33287
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming, Web UI
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> Since not all custom metrics hold useful information it would be good to add 
> exclude possibility.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33287) Expose state custom metrics information on SS UI

2020-10-29 Thread Gabor Somogyi (Jira)
Gabor Somogyi created SPARK-33287:
-

 Summary: Expose state custom metrics information on SS UI
 Key: SPARK-33287
 URL: https://issues.apache.org/jira/browse/SPARK-33287
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming, Web UI
Affects Versions: 3.1.0
Reporter: Gabor Somogyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33222) Expose missing information (graphs) on SS UI

2020-10-22 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218974#comment-17218974
 ] 

Gabor Somogyi commented on SPARK-33222:
---

I've started to implement it.

> Expose missing information (graphs) on SS UI
> 
>
> Key: SPARK-33222
> URL: https://issues.apache.org/jira/browse/SPARK-33222
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming, Web UI
>Affects Versions: 3.0.1
>Reporter: Gabor Somogyi
>Priority: Major
>
> There are couple of things which not yet shown on Structured Streaming UI.
> I'm creating subtasks to add them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33224) Expose watermark information on SS UI

2020-10-22 Thread Gabor Somogyi (Jira)
Gabor Somogyi created SPARK-33224:
-

 Summary: Expose watermark information on SS UI
 Key: SPARK-33224
 URL: https://issues.apache.org/jira/browse/SPARK-33224
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming, Web UI
Affects Versions: 3.0.1
Reporter: Gabor Somogyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33223) Expose state information on SS UI

2020-10-22 Thread Gabor Somogyi (Jira)
Gabor Somogyi created SPARK-33223:
-

 Summary: Expose state information on SS UI
 Key: SPARK-33223
 URL: https://issues.apache.org/jira/browse/SPARK-33223
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming, Web UI
Affects Versions: 3.0.1
Reporter: Gabor Somogyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33222) Expose missing information (graphs) on SS UI

2020-10-22 Thread Gabor Somogyi (Jira)
Gabor Somogyi created SPARK-33222:
-

 Summary: Expose missing information (graphs) on SS UI
 Key: SPARK-33222
 URL: https://issues.apache.org/jira/browse/SPARK-33222
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming, Web UI
Affects Versions: 3.0.1
Reporter: Gabor Somogyi


There are couple of things which not yet shown on Structured Streaming UI.
I'm creating subtasks to add them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25547) Pluggable jdbc connection factory

2020-10-13 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213152#comment-17213152
 ] 

Gabor Somogyi commented on SPARK-25547:
---

[~fsauer65] JDBC connection provider API is added here: 
https://github.com/apache/spark/blob/dc697a8b598aea922ee6620d87f3ace2f7947231/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcConnectionProvider.scala#L36
Do you think we can close this jira?


> Pluggable jdbc connection factory
> -
>
> Key: SPARK-25547
> URL: https://issues.apache.org/jira/browse/SPARK-25547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Frank Sauer
>Priority: Major
>
> The ability to provide a custom connectionFactoryProvider via JDBCOptions so 
> that JdbcUtils.createConnectionFactory can produce a custom connection 
> factory would be very useful. In our case we needed to have the ability to 
> load balance connections to an AWS Aurora Postgres cluster by round-robining 
> through the endpoints of the read replicas since their own loan balancing was 
> insufficient. We got away with it by copying most of the spark jdbc package 
> and provide this feature there and changing the format from jdbc to our new 
> package. However it would be nice  if this were supported out of the box via 
> a new option in JDBCOptions providing the classname for a 
> ConnectionFactoryProvider. I'm creating this Jira in order to submit a PR 
> which I have ready to go.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32229) Application entry parsing fails because DriverWrapper registered instead of the normal driver

2020-10-12 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212327#comment-17212327
 ] 

Gabor Somogyi commented on SPARK-32229:
---

Started to work on this.

> Application entry parsing fails because DriverWrapper registered instead of 
> the normal driver
> -
>
> Key: SPARK-32229
> URL: https://issues.apache.org/jira/browse/SPARK-32229
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> In some cases DriverWrapper registered by DriverRegistry which causes 
> exception in PostgresConnectionProvider:
> https://github.com/apache/spark/blob/371b35d2e0ab08ebd853147c6673de3adfad0553/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/DriverRegistry.scala#L53



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33102) Use stringToSeq on SQL list typed parameters

2020-10-09 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1723#comment-1723
 ] 

Gabor Somogyi commented on SPARK-33102:
---

Filing a PR soon...

> Use stringToSeq on SQL list typed parameters
> 
>
> Key: SPARK-33102
> URL: https://issues.apache.org/jira/browse/SPARK-33102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33102) Use stringToSeq on SQL list typed parameters

2020-10-09 Thread Gabor Somogyi (Jira)
Gabor Somogyi created SPARK-33102:
-

 Summary: Use stringToSeq on SQL list typed parameters
 Key: SPARK-33102
 URL: https://issues.apache.org/jira/browse/SPARK-33102
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Gabor Somogyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32047) Add provider disable possibility just like in delegation token provider

2020-10-02 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17206208#comment-17206208
 ] 

Gabor Somogyi commented on SPARK-32047:
---

I'm intended to file a PR next week...

> Add provider disable possibility just like in delegation token provider
> ---
>
> Key: SPARK-32047
> URL: https://issues.apache.org/jira/browse/SPARK-32047
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> There is an enable flag in delegation provider area 
> "spark.security.credentials.%s.enabled".
> It would be good to add similar to the JDBC secure connection provider area 
> because this would make embedded providers interchangeable (embedded can be 
> turned off and another provider w/ a different name can be registered). This 
> make sense only if we create API for the secure JDBC connection provider.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28367) Kafka connector infinite wait because metadata never updated

2020-09-17 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-28367:
--
Description: 
Spark uses an old and deprecated API named poll(long) which never returns and 
stays in live lock if metadata is not updated (for instance when broker 
disappears at consumer creation).


  was:
Spark uses an old and deprecated API named poll(long) which never returns and 
stays in live lock if metadata is not updated (for instance when broker 
disappears at consumer creation).

I've created a small standalone application to test it and the alternatives: 
https://github.com/gaborgsomogyi/kafka-get-assignment



> Kafka connector infinite wait because metadata never updated
> 
>
> Key: SPARK-28367
> URL: https://issues.apache.org/jira/browse/SPARK-28367
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.3, 2.2.3, 2.3.3, 2.4.3, 3.0.0, 3.1.0
>Reporter: Gabor Somogyi
>Priority: Critical
>
> Spark uses an old and deprecated API named poll(long) which never returns and 
> stays in live lock if metadata is not updated (for instance when broker 
> disappears at consumer creation).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32910) Remove UninterruptibleThread usage from KafkaOffsetReader

2020-09-17 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197617#comment-17197617
 ] 

Gabor Somogyi commented on SPARK-32910:
---

Started to work on this.

> Remove UninterruptibleThread usage from KafkaOffsetReader
> -
>
> Key: SPARK-32910
> URL: https://issues.apache.org/jira/browse/SPARK-32910
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> We've talked about this here: 
> https://github.com/apache/spark/pull/29729#discussion_r488690731
> This jira stands only if the mentioned PR is merged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32910) Remove UninterruptibleThread usage from KafkaOffsetReader

2020-09-17 Thread Gabor Somogyi (Jira)
Gabor Somogyi created SPARK-32910:
-

 Summary: Remove UninterruptibleThread usage from KafkaOffsetReader
 Key: SPARK-32910
 URL: https://issues.apache.org/jira/browse/SPARK-32910
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 3.1.0
Reporter: Gabor Somogyi


We've talked about this here: 
https://github.com/apache/spark/pull/29729#discussion_r488690731
This jira stands only if the mentioned PR is merged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32032) Eliminate deprecated poll(long) API calls to avoid infinite wait in driver

2020-09-11 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194269#comment-17194269
 ] 

Gabor Somogyi commented on SPARK-32032:
---

[~Bartalos] query is not progressing.
I've just finished the PR preparation and going to file a PR soon...

> Eliminate deprecated poll(long) API calls to avoid infinite wait in driver
> --
>
> Key: SPARK-32032
> URL: https://issues.apache.org/jira/browse/SPARK-32032
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32032) Eliminate deprecated poll(long) API calls to avoid infinite wait in driver

2020-09-08 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191999#comment-17191999
 ] 

Gabor Somogyi commented on SPARK-32032:
---

As a temporary result of the effort I've created a document where I've 
summarised everything: 
https://docs.google.com/document/d/1gAh0pKgZUgyqO2Re3sAy-fdYpe_SxpJ6DkeXE8R1P7E/edit?usp=sharing

Feel free to comment and help the effort to fix this nasty issue.
[~zsxwing], I'm pretty sure you're interested in the details. This change 
touching key parts of the Kafka connector.


> Eliminate deprecated poll(long) API calls to avoid infinite wait in driver
> --
>
> Key: SPARK-32032
> URL: https://issues.apache.org/jira/browse/SPARK-32032
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32520) Flaky Test: KafkaSourceStressSuite.stress test with multiple topics and partitions

2020-08-03 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170073#comment-17170073
 ] 

Gabor Somogyi commented on SPARK-32520:
---

Commented on https://issues.apache.org/jira/browse/SPARK-32519. If it's comes 
often and Jungtaek can't take a look then I suggest a rollback and I can pick 
it up 3 weeks later (vacation).

> Flaky Test: KafkaSourceStressSuite.stress test with multiple topics and 
> partitions
> --
>
> Key: SPARK-32520
> URL: https://issues.apache.org/jira/browse/SPARK-32520
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming, Tests
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {{KafkaSourceStressSuite.stress test with multiple topics and partitions}} 
> seems flaky in GitHub Actions build:
> https://github.com/apache/spark/pull/29335/checks?check_run_id=940205463
> {code}
> KafkaSourceStressSuite:
> - stress test with multiple topics and partitions *** FAILED *** (2 minutes, 
> 7 seconds)
>   Timed out waiting for stream: The code passed to failAfter did not complete 
> within 30 seconds.
>   java.lang.Thread.getStackTrace(Thread.java:1559)
>   org.scalatest.concurrent.TimeLimits.failAfterImpl(TimeLimits.scala:234)
>   org.scalatest.concurrent.TimeLimits.failAfterImpl$(TimeLimits.scala:233)
>   
> org.apache.spark.sql.kafka010.KafkaSourceTest.failAfterImpl(KafkaMicroBatchSourceSuite.scala:53)
>   org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:230)
>   org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:229)
>   
> org.apache.spark.sql.kafka010.KafkaSourceTest.failAfter(KafkaMicroBatchSourceSuite.scala:53)
>   
> org.apache.spark.sql.streaming.StreamTest.$anonfun$testStream$7(StreamTest.scala:471)
>   
> org.apache.spark.sql.streaming.StreamTest.$anonfun$testStream$7$adapted(StreamTest.scala:470)
>   scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149)
>   Caused by:  null
>   
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2156)
>   
> org.apache.spark.sql.execution.streaming.StreamExecution.awaitOffset(StreamExecution.scala:483)
>   
> org.apache.spark.sql.streaming.StreamTest.$anonfun$testStream$8(StreamTest.scala:472)
>   
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   
> org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127)
>   
> org.scalatest.concurrent.TimeLimits.failAfterImpl(TimeLimits.scala:239)
>   
> org.scalatest.concurrent.TimeLimits.failAfterImpl$(TimeLimits.scala:233)
>   
> org.apache.spark.sql.kafka010.KafkaSourceTest.failAfterImpl(KafkaMicroBatchSourceSuite.scala:53)
>   
> org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:230)
>   
> org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:229)
>   == Progress ==
>  AssertOnQuery(, )
>  AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), 
> data = empty Range 0 until 0, message = )
>  CheckAnswer:
>  StopStream
>  AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), 
> data = Range 0 until 8, message = )
>  
> StartStream(ProcessingTimeTrigger(0),org.apache.spark.util.SystemClock@7dce5824,Map(),null)
>  CheckAnswer: [1],[2],[3],[4],[5],[6],[7],[8]
>  CheckAnswer: [1],[2],[3],[4],[5],[6],[7],[8]
>  StopStream
>  
> StartStream(ProcessingTimeTrigger(0),org.apache.spark.util.SystemClock@7255955e,Map(),null)
>  AddKafkaData(topics = Set(stress4, stress6, stress2, stress1, stress5, 
> stress3), data = Range 8 until 9, message = Add topic stress7)
>  AddKafkaData(topics = Set(stress4, stress6, stress2, stress1, stress5, 
> stress3), data = Range 9 until 10, message = )
>  AddKafkaData(topics = Set(stress4, stress6, stress2, stress1, stress5, 
> stress3), data = Range 10 until 15, message = Add partition)
>  AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, 
> stress5, stress3), data = empty Range 15 until 15, message = Add topic 
> stress9)
>  AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, 
> stress3), data = Range 15 until 16, message = Delete topic stress5)
>  AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, 
> stress3, stress10), data = Range 16 until 23, message = Add topic stress11)
>  CheckAnswer: 
> [1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12],[13],[14],[15],[16],[17],[18],[19],[20],[21],[22],[23]
>  StopStream
>  AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, 
> 

[jira] [Commented] (SPARK-32519) test of org.apache.spark.sql.kafka010.KafkaSourceStressSuite failed for aarch64

2020-08-03 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169850#comment-17169850
 ] 

Gabor Somogyi commented on SPARK-32519:
---

I'm on 3 weeks vacation but had a slight look at it.
This 30 seconds means the timeout is in place:
{code:java}
The code passed to failAfter did not complete within 30 seconds
{code}
Kafka is known to be flaky in general which is the case here as well.
I've seen this issue before so I don't think the change itself caused it.
It may happen that the change itself made this more frequent but I hardly can 
believe that this is the root cause.
As a side note timeout is timeout on all platforms but if I understand it well 
this happens only on aarch64, right?
cc [~kabhwan]

> test of org.apache.spark.sql.kafka010.KafkaSourceStressSuite failed for 
> aarch64
> ---
>
> Key: SPARK-32519
> URL: https://issues.apache.org/jira/browse/SPARK-32519
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Major
>
> The aarch64 maven job failed after the commit 
> https://github.com/apache/spark/commit/813532d10310027fee9e12680792cee2e1c2b7c7
>merged, see the log 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/353/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceStressSuite/stress_test_with_multiple_topics_and_partitions/
> I took test in my aarch64 instance, if I reset the commit 
> 813532d10310027fee9e12680792cee2e1c2b7c7 the test is ok. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32032) Eliminate deprecated poll(long) API calls to avoid infinite wait in driver

2020-07-29 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17167214#comment-17167214
 ] 

Gabor Somogyi commented on SPARK-32032:
---

I'm working on the solution.

> Eliminate deprecated poll(long) API calls to avoid infinite wait in driver
> --
>
> Key: SPARK-32032
> URL: https://issues.apache.org/jira/browse/SPARK-32032
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32032) Eliminate deprecated poll(long) API calls to avoid infinite wait in driver

2020-07-29 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17167213#comment-17167213
 ] 

Gabor Somogyi commented on SPARK-32032:
---

I've renamed the jira because the solution is not simply use a different API.

> Eliminate deprecated poll(long) API calls to avoid infinite wait in driver
> --
>
> Key: SPARK-32032
> URL: https://issues.apache.org/jira/browse/SPARK-32032
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32032) Eliminate deprecated poll(long) API calls to avoid infinite wait in driver

2020-07-29 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-32032:
--
Summary: Eliminate deprecated poll(long) API calls to avoid infinite wait 
in driver  (was: Use new poll API in Kafka connector diver side to avoid 
infinite wait)

> Eliminate deprecated poll(long) API calls to avoid infinite wait in driver
> --
>
> Key: SPARK-32032
> URL: https://issues.apache.org/jira/browse/SPARK-32032
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32482) Eliminate deprecated poll(long) API calls to avoid infinite wait in tests

2020-07-29 Thread Gabor Somogyi (Jira)
Gabor Somogyi created SPARK-32482:
-

 Summary: Eliminate deprecated poll(long) API calls to avoid 
infinite wait in tests
 Key: SPARK-32482
 URL: https://issues.apache.org/jira/browse/SPARK-32482
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming, Tests
Affects Versions: 3.1.0
Reporter: Gabor Somogyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32468) Fix timeout config issue in Kafka connector tests

2020-07-28 Thread Gabor Somogyi (Jira)
Gabor Somogyi created SPARK-32468:
-

 Summary: Fix timeout config issue in Kafka connector tests
 Key: SPARK-32468
 URL: https://issues.apache.org/jira/browse/SPARK-32468
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming, Tests
Affects Versions: 3.1.0
Reporter: Gabor Somogyi


While I'm implementing SPARK-32032 I've found a bug in Kafka: 
https://issues.apache.org/jira/browse/KAFKA-10318. This will cause issues only 
later when it's fixed but it would be good to fix it now because SPARK-32032 
would like to bring in AdminClient where the code blows up with the mentioned 
ConfigException. This would reduce the code changes in the mentioned jira.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   7   >