[jira] [Commented] (SPARK-32165) SessionState leaks SparkListener with multiple SparkSession

2022-01-14 Thread Denis Krivenko (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476442#comment-17476442
 ] 

Denis Krivenko commented on SPARK-32165:


The issue is still reproducible on Spark 3.2.0
[~Ngone51] could you please provide us with more details why your PR's have not 
been merged and were closed automatically?

I think the Priority could be changed to Critical because it "Crashes, loss of 
data, severe memory leak". It is so in case of running Spark Thrift Server.

> SessionState leaks SparkListener with multiple SparkSession
> ---
>
> Key: SPARK-32165
> URL: https://issues.apache.org/jira/browse/SPARK-32165
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianjin YE
>Priority: Major
>
> Copied from 
> [https://github.com/apache/spark/pull/28128#issuecomment-653102770]
> I'd like to point out that this pr 
> (https://github.com/apache/spark/pull/28128) doesn't fix the memory leaky 
> completely. Once {{SessionState}} is touched, it will add two more listeners 
> into the SparkContext, namely {{SQLAppStatusListener}} and 
> {{ExecutionListenerBus}}
> It can be reproduced easily as
> {code:java}
>   test("SPARK-31354: SparkContext only register one SparkSession 
> ApplicationEnd listener") {
> val conf = new SparkConf()
>   .setMaster("local")
>   .setAppName("test-app-SPARK-31354-1")
> val context = new SparkContext(conf)
> SparkSession
>   .builder()
>   .sparkContext(context)
>   .master("local")
>   .getOrCreate()
>   .sessionState // this touches the sessionState
> val postFirstCreation = context.listenerBus.listeners.size()
> SparkSession.clearActiveSession()
> SparkSession.clearDefaultSession()
> SparkSession
>   .builder()
>   .sparkContext(context)
>   .master("local")
>   .getOrCreate()
>   .sessionState // this touches the sessionState
> val postSecondCreation = context.listenerBus.listeners.size()
> SparkSession.clearActiveSession()
> SparkSession.clearDefaultSession()
> assert(postFirstCreation == postSecondCreation)
>   }
> {code}
> The problem can be reproduced by the above code.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35262) Memory leak when dataset is being persisted

2022-01-10 Thread Denis Krivenko (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472248#comment-17472248
 ] 

Denis Krivenko commented on SPARK-35262:


[~iamelin] Could you please check/confirm the issue still exists in 3.2.0?

> Memory leak when dataset is being persisted
> ---
>
> Key: SPARK-35262
> URL: https://issues.apache.org/jira/browse/SPARK-35262
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Igor Amelin
>Priority: Major
>
> If a Java- or Scala-application with SparkSession runs a long time and 
> persists a lot of datasets, it can crash because of a memory leak.
>  I've noticed the following. When we have a dataset and persist it, the 
> SparkSession used to load that dataset is cloned in CacheManager, and this 
> clone is added as a listener to `listenersPlusTimers` in `ListenerBus`. But 
> this clone isn't removed from the list of listeners after that, e.g. 
> unpersisting the dataset. If we persist a lot of datasets, the SparkSession 
> is cloned and added to `ListenerBus` many times. This leads to a memory leak 
> since the `listenersPlusTimers` list become very large.
> I've found out that the SparkSession is cloned is CacheManager when the 
> parameters `spark.sql.sources.bucketing.autoBucketedScan.enabled` and 
> `spark.sql.adaptive.enabled` are true. The first one is true by default, and 
> this default behavior leads to the problem. When auto bucketed scan is 
> disabled, the SparkSession isn't cloned, and there are no duplicates in 
> ListenerBus, so the memory leak doesn't occur.
> Here is a small Java application to reproduce the memory leak: 
> [https://github.com/iamelin/spark-memory-leak]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37856) Executor pods keep existing if driver container was restarted

2022-01-10 Thread Denis Krivenko (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Krivenko updated SPARK-37856:
---
Environment: 
Kubernetes 1.20 | Spark 3.1.2 | Hadoop 3.2.0 | Java 11 | Scala 2.12

Kubernetes 1.20 | Spark 3.2.0 | Hadoop 3.3.1 | Java 11 | Scala 2.12

  was:
* Kubernetes 1.20
 * Spark 3.1.2
 * Hadoop 3.2.0
 * Java 11
 * Scala 2.12

and
 * Kubernetes 1.20
 * Spark 3.2.0
 * Hadoop 3.3.1
 * Java 11
 * Scala 2.12


> Executor pods keep existing if driver container was restarted
> -
>
> Key: SPARK-37856
> URL: https://issues.apache.org/jira/browse/SPARK-37856
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.2, 3.2.0
> Environment: Kubernetes 1.20 | Spark 3.1.2 | Hadoop 3.2.0 | Java 11 | 
> Scala 2.12
> Kubernetes 1.20 | Spark 3.2.0 | Hadoop 3.3.1 | Java 11 | Scala 2.12
>Reporter: Denis Krivenko
>Priority: Minor
>
> I run Spark Thrift Server on Kubernetes cluster, so the driver pod runs 
> continuously and it creates and manages executor pods. From time to time OOM 
> issue occurs on a driver pod or executor pods.
> When it happens on
>  * executor - the executor pod is getting deleted and the driver creates a 
> new executor pod instead. It works as expected.
>  * driver     - Kubernetes restarts the driver container and the driver 
> creates new executor pods. All previous executors stop, but still exist with 
> *Error* state for Spark 3.1.2 or with *Completed* state for Spark 3.2.0
> The behavior can be reproduced by restarting a pod container with the command
> {code:java}
> kubectl exec POD_NAME -c CONTAINER_NAME -- /sbin/killall5{code}
> Property _spark.kubernetes.executor.deleteOnTermination_ is set to *true* by 
> default.
> If I delete driver pod all executor pods (in any state) are also deleted 
> completely.
> +Pod list+
> {code:java}
> NAME                                           READY   STATUS      RESTARTS   
> AGE
> spark-thrift-server-85cf5d689b-vvrwd           1/1     Running     1          
> 3d15h
> spark-thrift-server-198cc57e3f9a7400-exec-10   1/1     Running     0          
> 86m
> spark-thrift-server-198cc57e3f9a7400-exec-6    1/1     Running     0          
> 12h
> spark-thrift-server-198cc57e3f9a7400-exec-8    1/1     Running     0          
> 9h
> spark-thrift-server-198cc57e3f9a7400-exec-9    1/1     Running     0          
> 3h12m
> spark-thrift-server-1a9aee7e31f36eea-exec-17   0/1     Completed   0          
> 38h
> spark-thrift-server-1a9aee7e31f36eea-exec-18   0/1     Completed   0          
> 38h
> spark-thrift-server-1a9aee7e31f36eea-exec-19   0/1     Completed   0          
> 36h
> spark-thrift-server-1a9aee7e31f36eea-exec-21   0/1     Completed   0          
> 24h
>  {code}
> +Driver pod+
> {code:java}
> apiVersion: v1
> kind: Pod
> metadata:
>   name: spark-thrift-server-85cf5d689b-vvrwd
>   uid: b69a7c68-a767-4e3b-939c-061347b1c25e
> spec:
>   ...
> status:
>   containerStatuses:
>   - containerID: 
> containerd://7206acf424aa30b6f8533c0e32c99ebfdc5ee80648e76289f6bd2f87460ddcd3
>     image: xxx/spark:3.2.0
>     lastState:
>       terminated:
>         containerID: 
> containerd://fe3cacb8e6470ac37dcd50d525ae3d54c8b6bfef3558325bc22e7b40daab1703
>         exitCode: 143
>         finishedAt: "2022-01-09T16:09:50Z"
>         reason: OOMKilled
>         startedAt: "2022-01-07T00:32:21Z"
>     name: spark-thrift-server
>     ready: true
>     restartCount: 1
>     started: true
>     state:
>       running:
>         startedAt: "2022-01-09T16:09:51Z" {code}
> Executor pod
> {code:java}
> apiVersion: v1
> kind: Pod
> metadata:
>   name: spark-thrift-server-1a9aee7e31f36eea-exec-17
>   ownerReferences:
>   - apiVersion: v1
>     controller: true
>     kind: Pod
>     name: spark-thrift-server-85cf5d689b-vvrwd
>     uid: b69a7c68-a767-4e3b-939c-061347b1c25e
> spec:
>   ...
> status:
>   containerStatuses:
>   - containerID: 
> containerd://75c68190147ba980f4b9014eef3989ddc2ee30de321fd1119957b6684a995c19
>     image: xxx/spark:3.2.0
>     lastState: {}
>     name: spark-kubernetes-executor
>     ready: false
>     restartCount: 0
>     started: false
>     state:
>       terminated:
>         containerID: 
> containerd://75c68190147ba980f4b9014eef3989ddc2ee30de321fd1119957b6684a995c19
>         exitCode: 0
>         finishedAt: "2022-01-09T16:08:57Z"
>         reason: Completed
>         startedAt: "2022-01-09T01:39:15Z" {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37856) Executor pods keep existing if driver container was restarted

2022-01-10 Thread Denis Krivenko (Jira)
Denis Krivenko created SPARK-37856:
--

 Summary: Executor pods keep existing if driver container was 
restarted
 Key: SPARK-37856
 URL: https://issues.apache.org/jira/browse/SPARK-37856
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.2.0, 3.1.2
 Environment: * Kubernetes 1.20
 * Spark 3.1.2
 * Hadoop 3.2.0
 * Java 11
 * Scala 2.12

and
 * Kubernetes 1.20
 * Spark 3.2.0
 * Hadoop 3.3.1
 * Java 11
 * Scala 2.12
Reporter: Denis Krivenko


I run Spark Thrift Server on Kubernetes cluster, so the driver pod runs 
continuously and it creates and manages executor pods. From time to time OOM 
issue occurs on a driver pod or executor pods.

When it happens on
 * executor - the executor pod is getting deleted and the driver creates a new 
executor pod instead. It works as expected.
 * driver     - Kubernetes restarts the driver container and the driver creates 
new executor pods. All previous executors stop, but still exist with *Error* 
state for Spark 3.1.2 or with *Completed* state for Spark 3.2.0

The behavior can be reproduced by restarting a pod container with the command
{code:java}
kubectl exec POD_NAME -c CONTAINER_NAME -- /sbin/killall5{code}
Property _spark.kubernetes.executor.deleteOnTermination_ is set to *true* by 
default.

If I delete driver pod all executor pods (in any state) are also deleted 
completely.

+Pod list+
{code:java}
NAME                                           READY   STATUS      RESTARTS   
AGE
spark-thrift-server-85cf5d689b-vvrwd           1/1     Running     1          
3d15h
spark-thrift-server-198cc57e3f9a7400-exec-10   1/1     Running     0          
86m
spark-thrift-server-198cc57e3f9a7400-exec-6    1/1     Running     0          
12h
spark-thrift-server-198cc57e3f9a7400-exec-8    1/1     Running     0          9h
spark-thrift-server-198cc57e3f9a7400-exec-9    1/1     Running     0          
3h12m
spark-thrift-server-1a9aee7e31f36eea-exec-17   0/1     Completed   0          
38h
spark-thrift-server-1a9aee7e31f36eea-exec-18   0/1     Completed   0          
38h
spark-thrift-server-1a9aee7e31f36eea-exec-19   0/1     Completed   0          
36h
spark-thrift-server-1a9aee7e31f36eea-exec-21   0/1     Completed   0          
24h
 {code}
+Driver pod+
{code:java}
apiVersion: v1
kind: Pod
metadata:
  name: spark-thrift-server-85cf5d689b-vvrwd
  uid: b69a7c68-a767-4e3b-939c-061347b1c25e
spec:
  ...
status:
  containerStatuses:
  - containerID: 
containerd://7206acf424aa30b6f8533c0e32c99ebfdc5ee80648e76289f6bd2f87460ddcd3
    image: xxx/spark:3.2.0
    lastState:
      terminated:
        containerID: 
containerd://fe3cacb8e6470ac37dcd50d525ae3d54c8b6bfef3558325bc22e7b40daab1703
        exitCode: 143
        finishedAt: "2022-01-09T16:09:50Z"
        reason: OOMKilled
        startedAt: "2022-01-07T00:32:21Z"
    name: spark-thrift-server
    ready: true
    restartCount: 1
    started: true
    state:
      running:
        startedAt: "2022-01-09T16:09:51Z" {code}
Executor pod
{code:java}
apiVersion: v1
kind: Pod
metadata:
  name: spark-thrift-server-1a9aee7e31f36eea-exec-17
  ownerReferences:
  - apiVersion: v1
    controller: true
    kind: Pod
    name: spark-thrift-server-85cf5d689b-vvrwd
    uid: b69a7c68-a767-4e3b-939c-061347b1c25e
spec:
  ...
status:
  containerStatuses:
  - containerID: 
containerd://75c68190147ba980f4b9014eef3989ddc2ee30de321fd1119957b6684a995c19
    image: xxx/spark:3.2.0
    lastState: {}
    name: spark-kubernetes-executor
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: 
containerd://75c68190147ba980f4b9014eef3989ddc2ee30de321fd1119957b6684a995c19
        exitCode: 0
        finishedAt: "2022-01-09T16:08:57Z"
        reason: Completed
        startedAt: "2022-01-09T01:39:15Z" {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37132) Incorrect Spark 3.2.0 package names with included Hadoop binaries

2021-10-27 Thread Denis Krivenko (Jira)
Denis Krivenko created SPARK-37132:
--

 Summary: Incorrect Spark 3.2.0 package names with included Hadoop 
binaries
 Key: SPARK-37132
 URL: https://issues.apache.org/jira/browse/SPARK-37132
 Project: Spark
  Issue Type: Bug
  Components: Build, Documentation
Affects Versions: 3.2.0
Reporter: Denis Krivenko


*Spark 3.2.0+Hadoop* packages contains Hadoop 3.3 binaries, however file names 
still refer to Hadoop 3.2, i.e. _spark-3.2.0-bin-*hadoop3.2*.tgz_

[https://dlcdn.apache.org/spark/spark-3.2.0/]

[https://dlcdn.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz]

[https://dlcdn.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2-scala2.13.tgz]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36398) Redact sensitive information in Spark Thrift Server log

2021-08-14 Thread Denis Krivenko (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17399169#comment-17399169
 ] 

Denis Krivenko commented on SPARK-36398:


Should be fixed by [PR#33743|https://github.com/apache/spark/pull/33743]

> Redact sensitive information in Spark Thrift Server log
> ---
>
> Key: SPARK-36398
> URL: https://issues.apache.org/jira/browse/SPARK-36398
> Project: Spark
>  Issue Type: Bug
>  Components: Security, SQL
>Affects Versions: 3.1.2
>Reporter: Denis Krivenko
>Priority: Major
>
> Spark Thrift Server logs query without sensitive information redaction in 
> [org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.scala|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala#L188]
> {code:scala}
>   override def runInternal(): Unit = {
> setState(OperationState.PENDING)
> logInfo(s"Submitting query '$statement' with $statementId")
> {code}
> Logs
> {code:sh}
> 21/08/03 20:49:46 INFO SparkExecuteStatementOperation: Submitting query 
> 'CREATE OR REPLACE TEMPORARY VIEW test_view
> USING org.apache.spark.sql.jdbc
> OPTIONS (
> url="jdbc:mysql://example.com:3306",
> driver="com.mysql.jdbc.Driver",
> dbtable="example.test",
> user="my_username",
> password="my_password"
> )' with 37e5d2cb-aa96-407e-b589-7cb212324100
> 21/08/03 20:49:46 INFO SparkExecuteStatementOperation: Running query with 
> 37e5d2cb-aa96-407e-b589-7cb212324100
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36510) Missing spark.redaction.string.regex property

2021-08-13 Thread Denis Krivenko (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Krivenko updated SPARK-36510:
---
Description: The property *spark.redaction.string.regex* is missing in 
[Runtime 
Environment|https://spark.apache.org/docs/3.1.2/configuration.html#runtime-environment]
 properties table but referred by *spark.sql.redaction.string.regex* 
description as its default value  (was: The property 
*spark.redaction.string.regex* is missing in [Runtime 
Environment|https://spark.apache.org/docs/3.1.2/configuration.html#runtime-environment]
 properties table but referred by spark.sql.redaction.string.regex ** 
description as its default value)

> Missing spark.redaction.string.regex property
> -
>
> Key: SPARK-36510
> URL: https://issues.apache.org/jira/browse/SPARK-36510
> Project: Spark
>  Issue Type: Documentation
>  Components: docs
>Affects Versions: 3.1.2
>Reporter: Denis Krivenko
>Priority: Trivial
>
> The property *spark.redaction.string.regex* is missing in [Runtime 
> Environment|https://spark.apache.org/docs/3.1.2/configuration.html#runtime-environment]
>  properties table but referred by *spark.sql.redaction.string.regex* 
> description as its default value



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36510) Missing spark.redaction.string.regex property

2021-08-13 Thread Denis Krivenko (Jira)
Denis Krivenko created SPARK-36510:
--

 Summary: Missing spark.redaction.string.regex property
 Key: SPARK-36510
 URL: https://issues.apache.org/jira/browse/SPARK-36510
 Project: Spark
  Issue Type: Documentation
  Components: docs
Affects Versions: 3.1.2
Reporter: Denis Krivenko


The property *spark.redaction.string.regex* is missing in [Runtime 
Environment|https://spark.apache.org/docs/3.1.2/configuration.html#runtime-environment]
 properties table but referred by spark.sql.redaction.string.regex ** 
description as its default value



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36472) Improve SQL syntax for MERGE

2021-08-11 Thread Denis Krivenko (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Krivenko updated SPARK-36472:
---
Description: 
Existing SQL syntax for *MEGRE* (see Delta Lake examples 
[here|https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge]
 and 
[here|https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-merge-into])
 could be improved by adding an alternative for {{}}

*Main assumption*
 In common cases target and source tables have the same column names used in 
{{}} as merge keys, for example:
{code:sql}
ON target.key1 = source.key1 AND target.key2 = source.key2{code}
It would be more convenient to use a syntax similar to:
{code:sql}
ON COLUMNS (key1, key2)
-- or
ON MATCHING (key1, key2)
{code}
The same approach is used for 
[JOIN|https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-join.html] 
where {{join_criteria}} syntax is
{code:sql}
ON boolean_expression | USING ( column_name [ , ... ] )
{code}
*Improvement proposal*
 Syntax
{code:sql}
MERGE INTO target_table_identifier [AS target_alias]
USING source_table_identifier [] [AS source_alias]
ON {  | COLUMNS ( column_name [ , ... ] ) }
[ WHEN MATCHED [ AND  ] THEN  ]
[ WHEN MATCHED [ AND  ] THEN  ]
[ WHEN NOT MATCHED [ AND  ]  THEN  ]
{code}
Example
{code:sql}
MERGE INTO target
USING source
ON COLUMNS (key1, key2)
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *
{code}

  was:
Existing SQL syntax for *MEGRE* (see Delta Lake examples 
[here|https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge]
 and 
[here|https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-merge-into])
 could be improved by adding an alternative for {{}}

*Main assumption*
 In common cases target and source tables have the same column names used in 
{{}} as merge keys, for example:
{code:sql}
ON target.key1 = source.key1 AND target.key2 = source.key2{code}
It would be more convenient to use a syntax similar to:
{code:sql}
ON COLUMNS (key1, key2)
{code}
The same approach is used for 
[JOIN|https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-join.html] 
where {{join_criteria}} syntax is
{code:sql}
ON boolean_expression | USING ( column_name [ , ... ] )
{code}
*Improvement proposal*
 Syntax
{code:sql}
MERGE INTO target_table_identifier [AS target_alias]
USING source_table_identifier [] [AS source_alias]
ON {  | COLUMNS ( column_name [ , ... ] ) }
[ WHEN MATCHED [ AND  ] THEN  ]
[ WHEN MATCHED [ AND  ] THEN  ]
[ WHEN NOT MATCHED [ AND  ]  THEN  ]
{code}
Example
{code:sql}
MERGE INTO target
USING source
ON COLUMNS (key1, key2)
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *
{code}


> Improve SQL syntax for MERGE
> 
>
> Key: SPARK-36472
> URL: https://issues.apache.org/jira/browse/SPARK-36472
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Denis Krivenko
>Priority: Trivial
>
> Existing SQL syntax for *MEGRE* (see Delta Lake examples 
> [here|https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge]
>  and 
> [here|https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-merge-into])
>  could be improved by adding an alternative for {{}}
> *Main assumption*
>  In common cases target and source tables have the same column names used in 
> {{}} as merge keys, for example:
> {code:sql}
> ON target.key1 = source.key1 AND target.key2 = source.key2{code}
> It would be more convenient to use a syntax similar to:
> {code:sql}
> ON COLUMNS (key1, key2)
> -- or
> ON MATCHING (key1, key2)
> {code}
> The same approach is used for 
> [JOIN|https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-join.html]
>  where {{join_criteria}} syntax is
> {code:sql}
> ON boolean_expression | USING ( column_name [ , ... ] )
> {code}
> *Improvement proposal*
>  Syntax
> {code:sql}
> MERGE INTO target_table_identifier [AS target_alias]
> USING source_table_identifier [] [AS source_alias]
> ON {  | COLUMNS ( column_name [ , ... ] ) }
> [ WHEN MATCHED [ AND  ] THEN  ]
> [ WHEN MATCHED [ AND  ] THEN  ]
> [ WHEN NOT MATCHED [ AND  ]  THEN  ]
> {code}
> Example
> {code:sql}
> MERGE INTO target
> USING source
> ON COLUMNS (key1, key2)
> WHEN MATCHED THEN
> UPDATE SET *
> WHEN NOT MATCHED THEN
> INSERT *
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36472) Improve SQL syntax for MERGE

2021-08-10 Thread Denis Krivenko (Jira)
Denis Krivenko created SPARK-36472:
--

 Summary: Improve SQL syntax for MERGE
 Key: SPARK-36472
 URL: https://issues.apache.org/jira/browse/SPARK-36472
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.2
Reporter: Denis Krivenko


Existing SQL syntax for *MEGRE* (see Delta Lake examples 
[here|https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge]
 and 
[here|https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-merge-into])
 could be improved by adding an alternative for {{}}

*Main assumption*
 In common cases target and source tables have the same column names used in 
{{}} as merge keys, for example:
{code:sql}
ON target.key1 = source.key1 AND target.key2 = source.key2{code}
It would be more convenient to use a syntax similar to:
{code:sql}
ON COLUMNS (key1, key2)
{code}
The same approach is used for 
[JOIN|https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-join.html] 
where {{join_criteria}} syntax is
{code:sql}
ON boolean_expression | USING ( column_name [ , ... ] )
{code}
*Improvement proposal*
 Syntax
{code:sql}
MERGE INTO target_table_identifier [AS target_alias]
USING source_table_identifier [] [AS source_alias]
ON {  | COLUMNS ( column_name [ , ... ] ) }
[ WHEN MATCHED [ AND  ] THEN  ]
[ WHEN MATCHED [ AND  ] THEN  ]
[ WHEN NOT MATCHED [ AND  ]  THEN  ]
{code}
Example
{code:sql}
MERGE INTO target
USING source
ON COLUMNS (key1, key2)
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36400) Redact sensitive information in Spark Thrift Server UI

2021-08-03 Thread Denis Krivenko (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Krivenko updated SPARK-36400:
---
Attachment: SQL Statistics.png

> Redact sensitive information in Spark Thrift Server UI
> --
>
> Key: SPARK-36400
> URL: https://issues.apache.org/jira/browse/SPARK-36400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 3.1.2
>Reporter: Denis Krivenko
>Priority: Major
> Attachments: SQL Statistics.png
>
>
> Spark UI displays sensitive information on "JDBC/ODBC Server" tab
> The reason of the issue is in 
> [org.apache.spark.sql.hive.thriftserver.ui.SqlStatsPagedTable|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/ThriftServerPage.scala#L166]
>  class 
> [here|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/ThriftServerPage.scala#L266-L268]
> {code:scala}
>   
> 
>   {info.statement}
> 
>   
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36400) Redact sensitive information in Spark Thrift Server UI

2021-08-03 Thread Denis Krivenko (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Krivenko updated SPARK-36400:
---
Description: 
Spark UI displays sensitive information on "JDBC/ODBC Server" tab

The reason of the issue is in 
[org.apache.spark.sql.hive.thriftserver.ui.SqlStatsPagedTable|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/ThriftServerPage.scala#L166]
 class 
[here|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/ThriftServerPage.scala#L266-L268]
{code:scala}
  

  {info.statement}

  
{code}

  was:
Spark UI displays sensitive information on "JDBC/ODBC Server" tab

!image-2021-08-04-01-02-27-593.png|width=594,height=272!

The reason of the issue is in 
[org.apache.spark.sql.hive.thriftserver.ui.SqlStatsPagedTable|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/ThriftServerPage.scala#L166]
 class 
[here|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/ThriftServerPage.scala#L266-L268]
{code:scala}
  

  {info.statement}

  
{code}


> Redact sensitive information in Spark Thrift Server UI
> --
>
> Key: SPARK-36400
> URL: https://issues.apache.org/jira/browse/SPARK-36400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 3.1.2
>Reporter: Denis Krivenko
>Priority: Major
> Attachments: SQL Statistics.png
>
>
> Spark UI displays sensitive information on "JDBC/ODBC Server" tab
> The reason of the issue is in 
> [org.apache.spark.sql.hive.thriftserver.ui.SqlStatsPagedTable|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/ThriftServerPage.scala#L166]
>  class 
> [here|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/ThriftServerPage.scala#L266-L268]
> {code:scala}
>   
> 
>   {info.statement}
> 
>   
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36400) Redact sensitive information in Spark Thrift Server UI

2021-08-03 Thread Denis Krivenko (Jira)
Denis Krivenko created SPARK-36400:
--

 Summary: Redact sensitive information in Spark Thrift Server UI
 Key: SPARK-36400
 URL: https://issues.apache.org/jira/browse/SPARK-36400
 Project: Spark
  Issue Type: Bug
  Components: SQL, Web UI
Affects Versions: 3.1.2
Reporter: Denis Krivenko
 Attachments: SQL Statistics.png

Spark UI displays sensitive information on "JDBC/ODBC Server" tab

!image-2021-08-04-01-02-27-593.png|width=594,height=272!

The reason of the issue is in 
[org.apache.spark.sql.hive.thriftserver.ui.SqlStatsPagedTable|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/ThriftServerPage.scala#L166]
 class 
[here|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/ThriftServerPage.scala#L266-L268]
{code:scala}
  

  {info.statement}

  
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36398) Redact sensitive information in Spark Thrift Server log

2021-08-03 Thread Denis Krivenko (Jira)
Denis Krivenko created SPARK-36398:
--

 Summary: Redact sensitive information in Spark Thrift Server log
 Key: SPARK-36398
 URL: https://issues.apache.org/jira/browse/SPARK-36398
 Project: Spark
  Issue Type: Bug
  Components: Security, SQL
Affects Versions: 3.1.2
Reporter: Denis Krivenko


Spark Thrift Server logs query without sensitive information redaction in 
[org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.scala|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala#L188]
{code:scala}
  override def runInternal(): Unit = {
setState(OperationState.PENDING)
logInfo(s"Submitting query '$statement' with $statementId")
{code}
Logs
{code:sh}
21/08/03 20:49:46 INFO SparkExecuteStatementOperation: Submitting query 'CREATE 
OR REPLACE TEMPORARY VIEW test_view
USING org.apache.spark.sql.jdbc
OPTIONS (
url="jdbc:mysql://example.com:3306",
driver="com.mysql.jdbc.Driver",
dbtable="example.test",
user="my_username",
password="my_password"
)' with 37e5d2cb-aa96-407e-b589-7cb212324100
21/08/03 20:49:46 INFO SparkExecuteStatementOperation: Running query with 
37e5d2cb-aa96-407e-b589-7cb212324100
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org