[jira] [Updated] (SPARK-31440) Improve SQL Rest API

2020-05-01 Thread Eren Avsarogullari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eren Avsarogullari updated SPARK-31440:
---
Attachment: improved_version_May1th.json

> Improve SQL Rest API
> 
>
> Key: SPARK-31440
> URL: https://issues.apache.org/jira/browse/SPARK-31440
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Eren Avsarogullari
>Priority: Major
> Attachments: current_version.json, improved_version.json, 
> improved_version_May1th.json
>
>
> SQL Rest API exposes query execution metrics as Public API. This Jira aims to 
> apply following improvements on SQL Rest API by aligning Spark-UI.
> *Proposed Improvements:*
> 1- Support Physical Operations and group metrics per physical operation by 
> aligning Spark UI.
> 2- Support *wholeStageCodegenId* for Physical Operations
> 3- *nodeId* can be useful for grouping metrics and sorting physical 
> operations (according to execution order) to differentiate same operators (if 
> used multiple times during the same query execution) and their metrics.
> 4- Filter *empty* metrics by aligning with Spark UI - SQL Tab. Currently, 
> Spark UI does not show empty metrics.
> 5- Remove line breakers(*\n*) from *metricValue*.
> 6- *planDescription* can be *optional* Http parameter to avoid network cost 
> where there is specially complex jobs creating big-plans.
> 7- *metrics* attribute needs to be exposed at the bottom order as 
> *metricDetails*. Specially, this can be useful for the user where 
> *metricDetails* array size is high. 
> 8- Reverse order on *metricDetails* aims to match with Spark UI by supporting 
> Physical Operators' execution order.
> *Attachments:*
>  Please find both *current* and *improved* versions of the results as 
> attached for following SQL Rest Endpoint:
> {code:java}
> curl -X GET 
> http://localhost:4040/api/v1/applications/$appId/sql/$executionId?details=true{code}
>  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31590) The filter used by Metadata-only queries should filter out all the unevaluable expr

2020-05-01 Thread dzcxzl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-31590:
---
Summary: The filter used by Metadata-only queries should filter out all the 
unevaluable expr  (was: The filter used by Metadata-only queries should not 
have Unevaluable)

> The filter used by Metadata-only queries should filter out all the 
> unevaluable expr
> ---
>
> Key: SPARK-31590
> URL: https://issues.apache.org/jira/browse/SPARK-31590
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: dzcxzl
>Priority: Trivial
>
> When using SPARK-23877, some sql execution errors.
> code:
> {code:scala}
> sql("set spark.sql.optimizer.metadataOnly=true")
> sql("CREATE TABLE test_tbl (a INT,d STRING,h STRING) USING PARQUET 
> PARTITIONED BY (d ,h)")
> sql("""
> |INSERT OVERWRITE TABLE test_tbl PARTITION(d,h)
> |SELECT 1,'2020-01-01','23'
> |UNION ALL
> |SELECT 2,'2020-01-02','01'
> |UNION ALL
> |SELECT 3,'2020-01-02','02'
> """.stripMargin)
> sql(
>   s"""
>  |SELECT d, MAX(h) AS h
>  |FROM test_tbl
>  |WHERE d= (
>  |  SELECT MAX(d) AS d
>  |  FROM test_tbl
>  |)
>  |GROUP BY d
> """.stripMargin).collect()
> {code}
> Exception:
> {code:java}
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#48 []
> ...
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.prunePartitions(PartitioningAwareFileIndex.scala:180)
> {code}
> optimizedPlan:
> {code:java}
> Aggregate [d#245], [d#245, max(h#246) AS h#243]
> +- Project [d#245, h#246]
>+- Filter (isnotnull(d#245) AND (d#245 = scalar-subquery#242 []))
>   :  +- Aggregate [max(d#245) AS d#241]
>   : +- LocalRelation , [d#245]
>   +- Relation[a#244,d#245,h#246] parquet
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31625) Unregister application from YARN resource manager outside the shutdown hook

2020-05-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097793#comment-17097793
 ] 

Apache Spark commented on SPARK-31625:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/28435

> Unregister application from YARN resource manager outside the shutdown hook
> ---
>
> Key: SPARK-31625
> URL: https://issues.apache.org/jira/browse/SPARK-31625
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Major
>
> Currently, an application is unregistered from YARN resource manager as a 
> shutdown hook. In the scenario where the shutdown hook does not run (e.g., 
> timeouts, etc.), the application is not unregistered, resulting in YARN 
> resubmitting the application even if it succeeded.
> For example, you could see the following on the driver log:
> {code:java}
> 20/04/30 06:20:29 INFO SparkContext: Successfully stopped SparkContext
> 20/04/30 06:20:29 INFO ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0
> 20/04/30 06:20:59 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, 
> java.util.concurrent.TimeoutException
> java.util.concurrent.TimeoutException
>   at java.util.concurrent.FutureTask.get(FutureTask.java:205)
>   at 
> org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
> {code}
> On the YARN RM side:
> {code:java}
> 2020-04-30 06:21:25,083 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1588227360159_0001_01_01 Container Transitioned from RUNNING to 
> COMPLETED
> 2020-04-30 06:21:25,085 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Updating application attempt appattempt_1588227360159_0001_01 with final 
> state: FAILED, and exit status: 0
> 2020-04-30 06:21:25,085 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1588227360159_0001_01 State change from RUNNING to 
> FINAL_SAVING on event = CONTAINER_FINISHED
> {code}
> You see that the final state of the application becomes FAILED since the 
> container is finished before the application is unregistered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31625) Unregister application from YARN resource manager outside the shutdown hook

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31625:


Assignee: Apache Spark

> Unregister application from YARN resource manager outside the shutdown hook
> ---
>
> Key: SPARK-31625
> URL: https://issues.apache.org/jira/browse/SPARK-31625
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Major
>
> Currently, an application is unregistered from YARN resource manager as a 
> shutdown hook. In the scenario where the shutdown hook does not run (e.g., 
> timeouts, etc.), the application is not unregistered, resulting in YARN 
> resubmitting the application even if it succeeded.
> For example, you could see the following on the driver log:
> {code:java}
> 20/04/30 06:20:29 INFO SparkContext: Successfully stopped SparkContext
> 20/04/30 06:20:29 INFO ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0
> 20/04/30 06:20:59 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, 
> java.util.concurrent.TimeoutException
> java.util.concurrent.TimeoutException
>   at java.util.concurrent.FutureTask.get(FutureTask.java:205)
>   at 
> org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
> {code}
> On the YARN RM side:
> {code:java}
> 2020-04-30 06:21:25,083 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1588227360159_0001_01_01 Container Transitioned from RUNNING to 
> COMPLETED
> 2020-04-30 06:21:25,085 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Updating application attempt appattempt_1588227360159_0001_01 with final 
> state: FAILED, and exit status: 0
> 2020-04-30 06:21:25,085 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1588227360159_0001_01 State change from RUNNING to 
> FINAL_SAVING on event = CONTAINER_FINISHED
> {code}
> You see that the final state of the application becomes FAILED since the 
> container is finished before the application is unregistered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31625) Unregister application from YARN resource manager outside the shutdown hook

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31625:


Assignee: (was: Apache Spark)

> Unregister application from YARN resource manager outside the shutdown hook
> ---
>
> Key: SPARK-31625
> URL: https://issues.apache.org/jira/browse/SPARK-31625
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Major
>
> Currently, an application is unregistered from YARN resource manager as a 
> shutdown hook. In the scenario where the shutdown hook does not run (e.g., 
> timeouts, etc.), the application is not unregistered, resulting in YARN 
> resubmitting the application even if it succeeded.
> For example, you could see the following on the driver log:
> {code:java}
> 20/04/30 06:20:29 INFO SparkContext: Successfully stopped SparkContext
> 20/04/30 06:20:29 INFO ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0
> 20/04/30 06:20:59 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, 
> java.util.concurrent.TimeoutException
> java.util.concurrent.TimeoutException
>   at java.util.concurrent.FutureTask.get(FutureTask.java:205)
>   at 
> org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
> {code}
> On the YARN RM side:
> {code:java}
> 2020-04-30 06:21:25,083 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1588227360159_0001_01_01 Container Transitioned from RUNNING to 
> COMPLETED
> 2020-04-30 06:21:25,085 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Updating application attempt appattempt_1588227360159_0001_01 with final 
> state: FAILED, and exit status: 0
> 2020-04-30 06:21:25,085 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1588227360159_0001_01 State change from RUNNING to 
> FINAL_SAVING on event = CONTAINER_FINISHED
> {code}
> You see that the final state of the application becomes FAILED since the 
> container is finished before the application is unregistered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31625) Unregister application from YARN resource manager outside the shutdown hook

2020-05-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097792#comment-17097792
 ] 

Apache Spark commented on SPARK-31625:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/28435

> Unregister application from YARN resource manager outside the shutdown hook
> ---
>
> Key: SPARK-31625
> URL: https://issues.apache.org/jira/browse/SPARK-31625
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Major
>
> Currently, an application is unregistered from YARN resource manager as a 
> shutdown hook. In the scenario where the shutdown hook does not run (e.g., 
> timeouts, etc.), the application is not unregistered, resulting in YARN 
> resubmitting the application even if it succeeded.
> For example, you could see the following on the driver log:
> {code:java}
> 20/04/30 06:20:29 INFO SparkContext: Successfully stopped SparkContext
> 20/04/30 06:20:29 INFO ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0
> 20/04/30 06:20:59 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, 
> java.util.concurrent.TimeoutException
> java.util.concurrent.TimeoutException
>   at java.util.concurrent.FutureTask.get(FutureTask.java:205)
>   at 
> org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
> {code}
> On the YARN RM side:
> {code:java}
> 2020-04-30 06:21:25,083 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1588227360159_0001_01_01 Container Transitioned from RUNNING to 
> COMPLETED
> 2020-04-30 06:21:25,085 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Updating application attempt appattempt_1588227360159_0001_01 with final 
> state: FAILED, and exit status: 0
> 2020-04-30 06:21:25,085 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1588227360159_0001_01 State change from RUNNING to 
> FINAL_SAVING on event = CONTAINER_FINISHED
> {code}
> You see that the final state of the application becomes FAILED since the 
> container is finished before the application is unregistered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31625) Unregister application from YARN resource manager outside the shutdown hook

2020-05-01 Thread Terry Kim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Terry Kim updated SPARK-31625:
--
Description: 
Currently, an application is unregistered from YARN resource manager as a 
shutdown hook. In the scenario where the shutdown hook does not run (e.g., 
timeouts, etc.), the application is not unregistered, resulting in YARN 
resubmitting the application even if it succeeded.

For example, you could see the following on the driver log:
{code:java}
20/04/30 06:20:29 INFO SparkContext: Successfully stopped SparkContext
20/04/30 06:20:29 INFO ApplicationMaster: Final app status: SUCCEEDED, 
exitCode: 0
20/04/30 06:20:59 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, 
java.util.concurrent.TimeoutException
java.util.concurrent.TimeoutException
at java.util.concurrent.FutureTask.get(FutureTask.java:205)
at 
org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
{code}
On the YARN RM side:
{code:java}
2020-04-30 06:21:25,083 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_1588227360159_0001_01_01 Container Transitioned from RUNNING to 
COMPLETED
2020-04-30 06:21:25,085 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
Updating application attempt appattempt_1588227360159_0001_01 with final 
state: FAILED, and exit status: 0
2020-04-30 06:21:25,085 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1588227360159_0001_01 State change from RUNNING to FINAL_SAVING 
on event = CONTAINER_FINISHED
{code}
You see that the final state of the application becomes FAILED since the 
container is finished before the application is unregistered.

  was:
Currently, an application is unregistered from YARN resource manager as a 
shutdown hook. In the scenario where the shutdown hook does not run (e.g., 
timeouts, etc.), the application is not unregistered, resulting in YARN 
resubmitting the application even if it succeeded.

For example, you could see the following on the driver log:
{code:java}
20/04/30 06:20:29 INFO SparkContext: Successfully stopped SparkContext
20/04/30 06:20:29 INFO ApplicationMaster: Final app status: SUCCEEDED, 
exitCode: 0
20/04/30 06:20:59 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, 
java.util.concurrent.TimeoutException
java.util.concurrent.TimeoutException
at java.util.concurrent.FutureTask.get(FutureTask.java:205)
at 
org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
{code}
On the YARN RM side:
{code:java}
2020-04-30 06:21:25,083 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_1588227360159_0001_01_01 Container Transitioned from RUNNING to 
COMPLETED
2020-04-30 06:21:25,085 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
Updating application attempt appattempt_1588227360159_0001_01 with final 
state: FAILED, and exit status: 0
2020-04-30 06:21:25,085 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1588227360159_0001_01 State change from RUNNING to FINAL_SAVING 
on event = CONTAINER_FINISHED
{code}
You see the final state of the application becomes FAILED since container is 
finished before the application is unregistered.


> Unregister application from YARN resource manager outside the shutdown hook
> ---
>
> Key: SPARK-31625
> URL: https://issues.apache.org/jira/browse/SPARK-31625
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Major
>
> Currently, an application is unregistered from YARN resource manager as a 
> shutdown hook. In the scenario where the shutdown hook does not run (e.g., 
> timeouts, etc.), the application is not unregistered, resulting in YARN 
> resubmitting the application even if it succeeded.
> For example, you could see the following on the driver log:
> {code:java}
> 20/04/30 06:20:29 INFO SparkContext: Successfully stopped SparkContext
> 20/04/30 06:20:29 INFO ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0
> 20/04/30 06:20:59 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, 
> java.util.concurrent.TimeoutException
> java.util.concurrent.TimeoutException
>   at java.util.concurrent.FutureTask.get(FutureTask.java:205)
>   at 
> org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
>   at 
> org.apa

[jira] [Created] (SPARK-31625) Unregister application from YARN resource manager outside the shutdown hook

2020-05-01 Thread Terry Kim (Jira)
Terry Kim created SPARK-31625:
-

 Summary: Unregister application from YARN resource manager outside 
the shutdown hook
 Key: SPARK-31625
 URL: https://issues.apache.org/jira/browse/SPARK-31625
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 3.1.0
Reporter: Terry Kim


Currently, an application is unregistered from YARN resource manager as a 
shutdown hook. In the scenario where the shutdown hook does not run (e.g., 
timeouts, etc.), the application is not unregistered, resulting in YARN 
resubmitting the application even if it succeeded.

For example, you could see the following on the driver log:
{code:java}
20/04/30 06:20:29 INFO SparkContext: Successfully stopped SparkContext
20/04/30 06:20:29 INFO ApplicationMaster: Final app status: SUCCEEDED, 
exitCode: 0
20/04/30 06:20:59 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, 
java.util.concurrent.TimeoutException
java.util.concurrent.TimeoutException
at java.util.concurrent.FutureTask.get(FutureTask.java:205)
at 
org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
{code}
On the YARN RM side:
{code:java}
2020-04-30 06:21:25,083 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_1588227360159_0001_01_01 Container Transitioned from RUNNING to 
COMPLETED
2020-04-30 06:21:25,085 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
Updating application attempt appattempt_1588227360159_0001_01 with final 
state: FAILED, and exit status: 0
2020-04-30 06:21:25,085 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1588227360159_0001_01 State change from RUNNING to FINAL_SAVING 
on event = CONTAINER_FINISHED
{code}
You see the final state of the application becomes FAILED since container is 
finished before the application is unregistered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31624) SHOW TBLPROPERTIES doesn't handle Session Catalog correctly

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31624:


Assignee: Burak Yavuz  (was: Apache Spark)

> SHOW TBLPROPERTIES doesn't handle Session Catalog correctly
> ---
>
> Key: SPARK-31624
> URL: https://issues.apache.org/jira/browse/SPARK-31624
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
>
> SHOW TBLPROPERTIES doesn't handle DataSource V2 tables that use the session 
> catalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31624) SHOW TBLPROPERTIES doesn't handle Session Catalog correctly

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31624:


Assignee: Burak Yavuz  (was: Apache Spark)

> SHOW TBLPROPERTIES doesn't handle Session Catalog correctly
> ---
>
> Key: SPARK-31624
> URL: https://issues.apache.org/jira/browse/SPARK-31624
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
>
> SHOW TBLPROPERTIES doesn't handle DataSource V2 tables that use the session 
> catalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31624) SHOW TBLPROPERTIES doesn't handle Session Catalog correctly

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31624:


Assignee: Apache Spark  (was: Burak Yavuz)

> SHOW TBLPROPERTIES doesn't handle Session Catalog correctly
> ---
>
> Key: SPARK-31624
> URL: https://issues.apache.org/jira/browse/SPARK-31624
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>Priority: Major
>
> SHOW TBLPROPERTIES doesn't handle DataSource V2 tables that use the session 
> catalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31624) SHOW TBLPROPERTIES doesn't handle Session Catalog correctly

2020-05-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097776#comment-17097776
 ] 

Apache Spark commented on SPARK-31624:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/28434

> SHOW TBLPROPERTIES doesn't handle Session Catalog correctly
> ---
>
> Key: SPARK-31624
> URL: https://issues.apache.org/jira/browse/SPARK-31624
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
>
> SHOW TBLPROPERTIES doesn't handle DataSource V2 tables that use the session 
> catalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31235) Separates different categories of applications

2020-05-01 Thread wangzhun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhun updated SPARK-31235:
-
Fix Version/s: 3.0.0

> Separates different categories of applications
> --
>
> Key: SPARK-31235
> URL: https://issues.apache.org/jira/browse/SPARK-31235
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: wangzhun
>Priority: Minor
> Fix For: 3.0.0
>
>
> The current application defaults to the SPARK type. 
> In fact, different types of applications have different characteristics and 
> are suitable for different scenarios.For example: SPAKR-SQL, SPARK-STREAMING.
> I recommend distinguishing them by the parameter `spark.yarn.applicationType` 
> so that we can more easily manage and maintain different types of 
> applications.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31624) SHOW TBLPROPERTIES doesn't handle Session Catalog correctly

2020-05-01 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-31624:
---

Assignee: Burak Yavuz

> SHOW TBLPROPERTIES doesn't handle Session Catalog correctly
> ---
>
> Key: SPARK-31624
> URL: https://issues.apache.org/jira/browse/SPARK-31624
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
>
> SHOW TBLPROPERTIES doesn't handle DataSource V2 tables that use the session 
> catalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31624) SHOW TBLPROPERTIES doesn't handle Session Catalog correctly

2020-05-01 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-31624:
---

 Summary: SHOW TBLPROPERTIES doesn't handle Session Catalog 
correctly
 Key: SPARK-31624
 URL: https://issues.apache.org/jira/browse/SPARK-31624
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz


SHOW TBLPROPERTIES doesn't handle DataSource V2 tables that use the session 
catalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31030) Backward Compatibility for Parsing and Formatting Datetime

2020-05-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097750#comment-17097750
 ] 

Apache Spark commented on SPARK-31030:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/28433

> Backward Compatibility for Parsing and Formatting Datetime
> --
>
> Key: SPARK-31030
> URL: https://issues.apache.org/jira/browse/SPARK-31030
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: image-2020-03-04-10-54-05-208.png, 
> image-2020-03-04-10-54-13-238.png
>
>
> *Background*
> In Spark version 2.4 and earlier, datetime parsing, formatting and conversion 
> are performed by using the hybrid calendar ([Julian + 
> Gregorian|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]).
>  
> Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as 
> well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by 
> using Java 8 API classes (the java.time packages that are based on [ISO 
> chronology|https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]
>  ).
> The switching job is completed in SPARK-26651. 
>  
> *Problem*
> Switching to Java 8 datetime API breaks the backward compatibility of Spark 
> 2.4 and earlier when parsing datetime. Spark need its own patters definition 
> on datetime parsing and formatting.
>  
> *Solution*
> To avoid unexpected result changes after the underlying datetime API switch, 
> we propose the following solution. 
>  * Introduce the fallback mechanism: when the Java 8-based parser fails, we 
> need to detect these behavior differences by falling back to the legacy 
> parser, and fail with a user-friendly error message to tell users what gets 
> changed and how to fix the pattern.
>  * Document the Spark’s datetime patterns: The date-time formatter of Spark 
> is decoupled with the Java patterns. The Spark’s patterns are mainly based on 
> the [Java 7’s 
> pattern|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html]
>  (for better backward compatibility) with the customized logic (caused by the 
> breaking changes between [Java 
> 7|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html] 
> and [Java 
> 8|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html]
>  pattern string). Below are the customized rules:
> ||Pattern||Java 7||Java 8|| Example||Rule||
> |u|Day number of week (1 = Monday, ..., 7 = Sunday)|Year (Different with y, u 
> accept a negative value to represent BC, while y should be used together with 
> G to do the same thing.)|!image-2020-03-04-10-54-05-208.png!  |Substitute ‘u’ 
> to ‘e’ and use Java 8 parser to parse the string. If parsable, return the 
> result; otherwise, fall back to ‘u’, and then use the legacy Java 7 parser to 
> parse. When it is successfully parsed, throw an exception and ask users to 
> change the pattern strings or turn on the legacy mode; otherwise, return NULL 
> as what Spark 2.4 does.|
> | z| General time zone which also accepts
>  [RFC 822 time zones|#rfc822timezone]]|Only accept time-zone name, e.g. 
> Pacific Standard Time; PST|!image-2020-03-04-10-54-13-238.png!  |The 
> semantics of ‘z’ are different between Java 7 and Java 8. Here, Spark 3.0 
> follows the semantics of Java 8. 
>  Use Java 8 to parse the string. If parsable, return the result; otherwise, 
> use the legacy Java 7 parser to parse. When it is successfully parsed, throw 
> an exception and ask users to change the pattern strings or turn on the 
> legacy mode; otherwise, return NULL as what Spark 2.4 does.|
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31030) Backward Compatibility for Parsing and Formatting Datetime

2020-05-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097749#comment-17097749
 ] 

Apache Spark commented on SPARK-31030:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/28433

> Backward Compatibility for Parsing and Formatting Datetime
> --
>
> Key: SPARK-31030
> URL: https://issues.apache.org/jira/browse/SPARK-31030
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: image-2020-03-04-10-54-05-208.png, 
> image-2020-03-04-10-54-13-238.png
>
>
> *Background*
> In Spark version 2.4 and earlier, datetime parsing, formatting and conversion 
> are performed by using the hybrid calendar ([Julian + 
> Gregorian|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]).
>  
> Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as 
> well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by 
> using Java 8 API classes (the java.time packages that are based on [ISO 
> chronology|https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]
>  ).
> The switching job is completed in SPARK-26651. 
>  
> *Problem*
> Switching to Java 8 datetime API breaks the backward compatibility of Spark 
> 2.4 and earlier when parsing datetime. Spark need its own patters definition 
> on datetime parsing and formatting.
>  
> *Solution*
> To avoid unexpected result changes after the underlying datetime API switch, 
> we propose the following solution. 
>  * Introduce the fallback mechanism: when the Java 8-based parser fails, we 
> need to detect these behavior differences by falling back to the legacy 
> parser, and fail with a user-friendly error message to tell users what gets 
> changed and how to fix the pattern.
>  * Document the Spark’s datetime patterns: The date-time formatter of Spark 
> is decoupled with the Java patterns. The Spark’s patterns are mainly based on 
> the [Java 7’s 
> pattern|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html]
>  (for better backward compatibility) with the customized logic (caused by the 
> breaking changes between [Java 
> 7|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html] 
> and [Java 
> 8|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html]
>  pattern string). Below are the customized rules:
> ||Pattern||Java 7||Java 8|| Example||Rule||
> |u|Day number of week (1 = Monday, ..., 7 = Sunday)|Year (Different with y, u 
> accept a negative value to represent BC, while y should be used together with 
> G to do the same thing.)|!image-2020-03-04-10-54-05-208.png!  |Substitute ‘u’ 
> to ‘e’ and use Java 8 parser to parse the string. If parsable, return the 
> result; otherwise, fall back to ‘u’, and then use the legacy Java 7 parser to 
> parse. When it is successfully parsed, throw an exception and ask users to 
> change the pattern strings or turn on the legacy mode; otherwise, return NULL 
> as what Spark 2.4 does.|
> | z| General time zone which also accepts
>  [RFC 822 time zones|#rfc822timezone]]|Only accept time-zone name, e.g. 
> Pacific Standard Time; PST|!image-2020-03-04-10-54-13-238.png!  |The 
> semantics of ‘z’ are different between Java 7 and Java 8. Here, Spark 3.0 
> follows the semantics of Java 8. 
>  Use Java 8 to parse the string. If parsable, return the result; otherwise, 
> use the legacy Java 7 parser to parse. When it is successfully parsed, throw 
> an exception and ask users to change the pattern strings or turn on the 
> legacy mode; otherwise, return NULL as what Spark 2.4 does.|
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21117) Built-in SQL Function Support - WIDTH_BUCKET

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21117:


Assignee: (was: Apache Spark)

> Built-in SQL Function Support - WIDTH_BUCKET
> 
>
> Key: SPARK-21117
> URL: https://issues.apache.org/jira/browse/SPARK-21117
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> For a given expression, the {{WIDTH_BUCKET}} function returns the bucket 
> number into which the value of this expression would fall after being 
> evaluated.
> {code:sql}
> WIDTH_BUCKET (expr , min_value , max_value , num_buckets)
> {code}
> Ref: 
> https://docs.oracle.com/cd/B28359_01/olap.111/b28126/dml_functions_2137.htm#OLADM717



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21117) Built-in SQL Function Support - WIDTH_BUCKET

2020-05-01 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21117:

Target Version/s: 3.1.0

> Built-in SQL Function Support - WIDTH_BUCKET
> 
>
> Key: SPARK-21117
> URL: https://issues.apache.org/jira/browse/SPARK-21117
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> For a given expression, the {{WIDTH_BUCKET}} function returns the bucket 
> number into which the value of this expression would fall after being 
> evaluated.
> {code:sql}
> WIDTH_BUCKET (expr , min_value , max_value , num_buckets)
> {code}
> Ref: 
> https://docs.oracle.com/cd/B28359_01/olap.111/b28126/dml_functions_2137.htm#OLADM717



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21117) Built-in SQL Function Support - WIDTH_BUCKET

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21117:


Assignee: Apache Spark

> Built-in SQL Function Support - WIDTH_BUCKET
> 
>
> Key: SPARK-21117
> URL: https://issues.apache.org/jira/browse/SPARK-21117
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> For a given expression, the {{WIDTH_BUCKET}} function returns the bucket 
> number into which the value of this expression would fall after being 
> evaluated.
> {code:sql}
> WIDTH_BUCKET (expr , min_value , max_value , num_buckets)
> {code}
> Ref: 
> https://docs.oracle.com/cd/B28359_01/olap.111/b28126/dml_functions_2137.htm#OLADM717



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21117) Built-in SQL Function Support - WIDTH_BUCKET

2020-05-01 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21117:

Labels:   (was: bulk-closed)

> Built-in SQL Function Support - WIDTH_BUCKET
> 
>
> Key: SPARK-21117
> URL: https://issues.apache.org/jira/browse/SPARK-21117
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> For a given expression, the {{WIDTH_BUCKET}} function returns the bucket 
> number into which the value of this expression would fall after being 
> evaluated.
> {code:sql}
> WIDTH_BUCKET (expr , min_value , max_value , num_buckets)
> {code}
> Ref: 
> https://docs.oracle.com/cd/B28359_01/olap.111/b28126/dml_functions_2137.htm#OLADM717



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21117) Built-in SQL Function Support - WIDTH_BUCKET

2020-05-01 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-21117:
-

> Built-in SQL Function Support - WIDTH_BUCKET
> 
>
> Key: SPARK-21117
> URL: https://issues.apache.org/jira/browse/SPARK-21117
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> For a given expression, the {{WIDTH_BUCKET}} function returns the bucket 
> number into which the value of this expression would fall after being 
> evaluated.
> {code:sql}
> WIDTH_BUCKET (expr , min_value , max_value , num_buckets)
> {code}
> Ref: 
> https://docs.oracle.com/cd/B28359_01/olap.111/b28126/dml_functions_2137.htm#OLADM717



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31582) Being able to not populate Hadoop classpath

2020-05-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097580#comment-17097580
 ] 

Apache Spark commented on SPARK-31582:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/28376

> Being able to not populate Hadoop classpath
> ---
>
> Key: SPARK-31582
> URL: https://issues.apache.org/jira/browse/SPARK-31582
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 2.4.5
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.6, 3.0.0
>
>
> Spark Yarn client will populate hadoop classpath from 
> `yarn.application.classpath` and ``mapreduce.application.classpath`. However, 
> for Spark with embedded hadoop build, it will result jar conflicts because 
> spark distribution can contain different version of hadoop jars.
> We are adding a new Yarn configuration to not populate hadoop classpath from  
> `yarn.application.classpath` and ``mapreduce.application.classpath`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31582) Being able to not populate Hadoop classpath

2020-05-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097576#comment-17097576
 ] 

Apache Spark commented on SPARK-31582:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/28376

> Being able to not populate Hadoop classpath
> ---
>
> Key: SPARK-31582
> URL: https://issues.apache.org/jira/browse/SPARK-31582
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 2.4.5
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.6, 3.0.0
>
>
> Spark Yarn client will populate hadoop classpath from 
> `yarn.application.classpath` and ``mapreduce.application.classpath`. However, 
> for Spark with embedded hadoop build, it will result jar conflicts because 
> spark distribution can contain different version of hadoop jars.
> We are adding a new Yarn configuration to not populate hadoop classpath from  
> `yarn.application.classpath` and ``mapreduce.application.classpath`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29701) Different answers when empty input given in GROUPING SETS

2020-05-01 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097521#comment-17097521
 ] 

Dongjoon Hyun commented on SPARK-29701:
---

For the record, please see the discussion on the following PR. Although this is 
a correct issue, the existing behavior of Apache Spark 2.4 is also reasonable 
in the same way with Oracle/SQLServer. So, we keep this way in Apache Spark 
3.0+ consistently. This issue is moved from PostgreSQL compatibility umbrella 
Jira into this Spark versioning umbrella Jira (SPARK-31085) to give a better 
context.
 - [https://github.com/apache/spark/pull/27233]

{code:java}
To put a conclusion: I think this PR does fix a "correctness" issue according 
to the SQL standard. But as @tgravescs said in #27233 (comment) , the current 
behavior looks reasonable as well, and is the same with Oracle/SQL Server.

This is a very corner case, and most likely people don't care.
{code}

> Different answers when empty input given in GROUPING SETS
> -
>
> Key: SPARK-29701
> URL: https://issues.apache.org/jira/browse/SPARK-29701
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Critical
>  Labels: correctness
>
> A query below with an empty input seems to have different answers between 
> PgSQL and Spark;
> {code:java}
> postgres=# create table gstest_empty (a integer, b integer, v integer);
> CREATE TABLE
> postgres=# select a, b, sum(v), count(*) from gstest_empty group by grouping 
> sets ((a,b),());
>  a | b | sum | count 
> ---+---+-+---
>|   | | 0
> (1 row)
> {code}
> {code:java}
> scala> sql("""select a, b, sum(v), count(*) from gstest_empty group by 
> grouping sets ((a,b),())""").show
> +---+---+--++
> |  a|  b|sum(v)|count(1)|
> +---+---+--++
> +---+---+--++
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29701) Different answers when empty input given in GROUPING SETS

2020-05-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29701:
--
Parent: (was: SPARK-27764)
Issue Type: Bug  (was: Sub-task)

> Different answers when empty input given in GROUPING SETS
> -
>
> Key: SPARK-29701
> URL: https://issues.apache.org/jira/browse/SPARK-29701
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Critical
>  Labels: correctness
>
> A query below with an empty input seems to have different answers between 
> PgSQL and Spark;
> {code:java}
> postgres=# create table gstest_empty (a integer, b integer, v integer);
> CREATE TABLE
> postgres=# select a, b, sum(v), count(*) from gstest_empty group by grouping 
> sets ((a,b),());
>  a | b | sum | count 
> ---+---+-+---
>|   | | 0
> (1 row)
> {code}
> {code:java}
> scala> sql("""select a, b, sum(v), count(*) from gstest_empty group by 
> grouping sets ((a,b),())""").show
> +---+---+--++
> |  a|  b|sum(v)|count(1)|
> +---+---+--++
> +---+---+--++
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29701) Different answers when empty input given in GROUPING SETS

2020-05-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29701:
--
Parent: SPARK-31085
Issue Type: Sub-task  (was: Bug)

> Different answers when empty input given in GROUPING SETS
> -
>
> Key: SPARK-29701
> URL: https://issues.apache.org/jira/browse/SPARK-29701
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Critical
>  Labels: correctness
>
> A query below with an empty input seems to have different answers between 
> PgSQL and Spark;
> {code:java}
> postgres=# create table gstest_empty (a integer, b integer, v integer);
> CREATE TABLE
> postgres=# select a, b, sum(v), count(*) from gstest_empty group by grouping 
> sets ((a,b),());
>  a | b | sum | count 
> ---+---+-+---
>|   | | 0
> (1 row)
> {code}
> {code:java}
> scala> sql("""select a, b, sum(v), count(*) from gstest_empty group by 
> grouping sets ((a,b),())""").show
> +---+---+--++
> |  a|  b|sum(v)|count(1)|
> +---+---+--++
> +---+---+--++
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31523) LogicalPlan doCanonicalize should throw exception if not resolved

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31523:


Assignee: (was: Apache Spark)

> LogicalPlan doCanonicalize should throw exception if not resolved
> -
>
> Key: SPARK-31523
> URL: https://issues.apache.org/jira/browse/SPARK-31523
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31523) LogicalPlan doCanonicalize should throw exception if not resolved

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31523:


Assignee: Apache Spark

> LogicalPlan doCanonicalize should throw exception if not resolved
> -
>
> Key: SPARK-31523
> URL: https://issues.apache.org/jira/browse/SPARK-31523
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31493) Optimize InSet to In according partition size at InSubqueryExec

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31493:


Assignee: (was: Apache Spark)

> Optimize InSet to In according partition size at InSubqueryExec
> ---
>
> Key: SPARK-31493
> URL: https://issues.apache.org/jira/browse/SPARK-31493
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31523) LogicalPlan doCanonicalize should throw exception if not resolved

2020-05-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097506#comment-17097506
 ] 

Apache Spark commented on SPARK-31523:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/28304

> LogicalPlan doCanonicalize should throw exception if not resolved
> -
>
> Key: SPARK-31523
> URL: https://issues.apache.org/jira/browse/SPARK-31523
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31493) Optimize InSet to In according partition size at InSubqueryExec

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31493:


Assignee: Apache Spark

> Optimize InSet to In according partition size at InSubqueryExec
> ---
>
> Key: SPARK-31493
> URL: https://issues.apache.org/jira/browse/SPARK-31493
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31493) Optimize InSet to In according partition size at InSubqueryExec

2020-05-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097505#comment-17097505
 ] 

Apache Spark commented on SPARK-31493:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/28269

> Optimize InSet to In according partition size at InSubqueryExec
> ---
>
> Key: SPARK-31493
> URL: https://issues.apache.org/jira/browse/SPARK-31493
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31551:


Assignee: (was: Apache Spark)

> createSparkUser lost user's non-Hadoop credentials
> --
>
> Key: SPARK-31551
> URL: https://issues.apache.org/jira/browse/SPARK-31551
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.5
>Reporter: Yuqi Wang
>Priority: Major
>
> See current 
> *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*:
> {code:java}
>    def createSparkUser(): UserGroupInformation = {
> val user = Utils.getCurrentUserName()
> logDebug("creating UGI for user: " + user)
> val ugi = UserGroupInformation.createRemoteUser(user)
> transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
> ugi
>   }
>   def transferCredentials(source: UserGroupInformation, dest: 
> UserGroupInformation): Unit = {
> dest.addCredentials(source.getCredentials())
>   }
>   def getCurrentUserName(): String = {
> Option(System.getenv("SPARK_USER"))
>   .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
>   }
> {code}
> The *transferCredentials* func can only transfer Hadoop creds such as 
> Delegation Tokens.
>  However, other creds stored in UGI.subject.getPrivateCredentials, will be 
> lost here, such as:
>  # Non-Hadoop creds:
>  Such as, [Kafka creds 
> |https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395]
>  # Newly supported or 3rd party supported Hadoop creds:
>  Such as to support OAuth/JWT token authn on Hadoop, we need to store the 
> OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
> are not supposed to be managed by Hadoop Credentials (currently it is only 
> for Hadoop secret keys and delegation tokens)
> Another issue is that the *SPARK_USER* only gets the 
> UserGroupInformation.getCurrentUser().getShortUserName() of the user, which 
> may lost the user's fully qualified user name. We should better use the 
> *getUserName* to get fully qualified user name in our client side, which is 
> aligned to 
> *[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L716-L720]*.
> Related to https://issues.apache.org/jira/browse/SPARK-1051



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31487) Move slots check of barrier job from DAGScheduler to TaskSchedulerImpl

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31487:


Assignee: (was: Apache Spark)

> Move slots check of barrier job from DAGScheduler to TaskSchedulerImpl
> --
>
> Key: SPARK-31487
> URL: https://issues.apache.org/jira/browse/SPARK-31487
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: wuyi
>Priority: Major
>
> Move the check to TaskSchedulerImpl to avoid re-submit the same job multiple 
> times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31458) LOAD DATA support for builtin datasource tables

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31458:


Assignee: Apache Spark

> LOAD DATA support for builtin datasource tables
> ---
>
> Key: SPARK-31458
> URL: https://issues.apache.org/jira/browse/SPARK-31458
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> LOAD DATA support for builtin datasource tables like parquet orc etc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29274) Should not coerce decimal type to double type when it's join column

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-29274:


Assignee: Pengfei Chang  (was: Apache Spark)

> Should not coerce decimal type to double type when it's join column
> ---
>
> Key: SPARK-29274
> URL: https://issues.apache.org/jira/browse/SPARK-29274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Assignee: Pengfei Chang
>Priority: Major
> Attachments: image-2019-09-27-20-20-24-238.png
>
>
> How to reproduce this issue:
> {code:sql}
> create table t1 (incdata_id decimal(21,0), v string) using parquet;
> create table t2 (incdata_id string, v string) using parquet;
> explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);
> == Physical Plan ==
> *(5) SortMergeJoin 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double)))], 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
> double)))], Inner
> :- *(2) Sort 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double))) ASC NULLS FIRST], false, 0
> :  +- Exchange 
> hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31
>  as double))), 200), true, [id=#104]
> : +- *(1) Filter isnotnull(incdata_id#31)
> :+- Scan hive default.t1 [incdata_id#31, v#32], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [incdata_id#31, v#32], Statistics(sizeInBytes=8.0 EiB)
> +- *(4) Sort 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
> double))) ASC NULLS FIRST], false, 0
>+- Exchange 
> hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33
>  as double))), 200), true, [id=#112]
>   +- *(3) Filter isnotnull(incdata_id#33)
>  +- Scan hive default.t2 [incdata_id#33, v#34], HiveTableRelation 
> `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [incdata_id#33, v#34], Statistics(sizeInBytes=8.0 EiB)
> {code}
> {code:sql}
> select cast(v1 as double) as v3, cast(v2 as double) as v4,
>   cast(v1 as double) = cast(v2 as double), v1 = v2 
> from (select cast('1001636981212' as decimal(21, 0)) as v1,
>   cast('1001636981213' as decimal(21, 0)) as v2) t;
> 1.00163697E20 1.00163697E20   truefalse
> {code}
>  
> It's a realy case in our production:
> !image-2019-09-27-20-20-24-238.png|width=100%!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31377) Add unit tests for "number of output rows" metric for joins in SQLMetricsSuite

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31377:


Assignee: (was: Apache Spark)

> Add unit tests for "number of output rows" metric for joins in SQLMetricsSuite
> --
>
> Key: SPARK-31377
> URL: https://issues.apache.org/jira/browse/SPARK-31377
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Srinivas Rishindra Pothireddi
>Priority: Minor
>
> For some combinations of join algorithm and join types there are no unit 
> tests for the "number of output rows" metric.
> A list of missing unit tests include the following.
>  * ShuffledHashJoin: leftOuter, RightOuter, LeftAnti, LeftSemi
>  * BroadcastNestedLoopJoin: RightOuter
>  * BroadcastHashJoin: LeftAnti



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31454) An optimized K-Means based on DenseMatrix and GEMM

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31454:


Assignee: (was: Apache Spark)

> An optimized K-Means based on DenseMatrix and GEMM
> --
>
> Key: SPARK-31454
> URL: https://issues.apache.org/jira/browse/SPARK-31454
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Xiaochang Wu
>Priority: Major
>  Labels: performance
>
> The main computations in K-Means are calculating distances between individual 
> points and center points. Currently K-Means implementation is vector-based 
> which can't take advantage of optimized native BLAS libraries.
> When the original points are represented as dense vectors, our approach is to 
> modify the original input data structures to a DenseMatrix-based one by 
> grouping several points together. The original distance calculations can be 
> translated into a Matrix multiplication then optimized native GEMM routines 
> (Intel MKL, OpenBLAS etc.) can be used. This approach can also work with 
> sparse vectors despite having larger memory consumption when translating 
> sparse vectors to dense matrix.
> Our preliminary benchmark shows this DenseMatrix+GEMM approach can boost the 
> training performance by *3.5x* with Intel MKL, looks very promising!
> To minimize end user impact, proposed changes are to use config parameters to 
> control if turn on this implementation without modifying public interfaces. 
> Parameter rowsPerMatrix is used to control how many points are grouped 
> together to build a DenseMatrix. An example:
> $ spark-submit --master $SPARK_MASTER \
> --conf "spark.ml.kmeans.matrixImplementation.enabled=true" \
>     --conf "spark.ml.kmeans.matrixImplementation.rowsPerMatrix=5000" \
>     --class org.apache.spark.examples.ml.KMeansExample 
> Several code changes are made in "spark.ml" namespace as we think 
> "spark.mllib" is in maintenance mode, some are duplications from spark.mllib 
> for using private definitions in the same package: 
>  - Modified: KMeans.scala, DatasetUtils.scala
>  - Added: KMeansMatrixImpl.scala
>  - Duplications: DistanceMeasure.scala, LocalKMeans.scala
> If this general idea is accepted by community, we are willing to contribute 
> our code to upstream and polish the implementation according to feedbacks and 
> produce benchmarks.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31467) Fix test issue with table named `test` in hive/SQLQuerySuite

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31467:


Assignee: (was: Apache Spark)

> Fix test issue with table named `test` in hive/SQLQuerySuite
> 
>
> Key: SPARK-31467
> URL: https://issues.apache.org/jira/browse/SPARK-31467
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: feiwang
>Priority: Major
>
> If we add ut in hive/SQLQuerySuite and use table named `test`. We may meet 
> these exceptions.
> {code:java}
>  org.apache.spark.sql.AnalysisException: Inserting into an RDD-based table is 
> not allowed.;;
> [info] 'InsertIntoTable Project [_1#1403 AS key#1406, _2#1404 AS value#1407], 
> Map(name -> Some(n1)), true, false
> [info] +- Project [col1#3850]
> [info]+- LocalRelation [col1#3850]
> {code}
> {code:java}
> org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or 
> view 'test' already exists in database 'default';
> [info]   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply$mcV$sp(HiveExternalCatalog.scala:226)
> [info]   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply(HiveExternalCatalog.scala:216)
> [info]   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply(HiveExternalCatalog.scala:216)
> [info]   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> [info]   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.doCreateTable(HiveExternalCatalog.scala:216)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29274) Should not coerce decimal type to double type when it's join column

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-29274:


Assignee: Apache Spark  (was: Pengfei Chang)

> Should not coerce decimal type to double type when it's join column
> ---
>
> Key: SPARK-29274
> URL: https://issues.apache.org/jira/browse/SPARK-29274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
> Attachments: image-2019-09-27-20-20-24-238.png
>
>
> How to reproduce this issue:
> {code:sql}
> create table t1 (incdata_id decimal(21,0), v string) using parquet;
> create table t2 (incdata_id string, v string) using parquet;
> explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);
> == Physical Plan ==
> *(5) SortMergeJoin 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double)))], 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
> double)))], Inner
> :- *(2) Sort 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double))) ASC NULLS FIRST], false, 0
> :  +- Exchange 
> hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31
>  as double))), 200), true, [id=#104]
> : +- *(1) Filter isnotnull(incdata_id#31)
> :+- Scan hive default.t1 [incdata_id#31, v#32], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [incdata_id#31, v#32], Statistics(sizeInBytes=8.0 EiB)
> +- *(4) Sort 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
> double))) ASC NULLS FIRST], false, 0
>+- Exchange 
> hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33
>  as double))), 200), true, [id=#112]
>   +- *(3) Filter isnotnull(incdata_id#33)
>  +- Scan hive default.t2 [incdata_id#33, v#34], HiveTableRelation 
> `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [incdata_id#33, v#34], Statistics(sizeInBytes=8.0 EiB)
> {code}
> {code:sql}
> select cast(v1 as double) as v3, cast(v2 as double) as v4,
>   cast(v1 as double) = cast(v2 as double), v1 = v2 
> from (select cast('1001636981212' as decimal(21, 0)) as v1,
>   cast('1001636981213' as decimal(21, 0)) as v2) t;
> 1.00163697E20 1.00163697E20   truefalse
> {code}
>  
> It's a realy case in our production:
> !image-2019-09-27-20-20-24-238.png|width=100%!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31431) CalendarInterval encoder support

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31431:


Assignee: Apache Spark

> CalendarInterval encoder support
> 
>
> Key: SPARK-31431
> URL: https://issues.apache.org/jira/browse/SPARK-31431
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> CalenderInterval is available to be converted to/from internal Spark SQL 
> representation when it is a member of a Scala's product type e.g tuples/ case 
> class etc but not as a primitive type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31440) Improve SQL Rest API

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31440:


Assignee: (was: Apache Spark)

> Improve SQL Rest API
> 
>
> Key: SPARK-31440
> URL: https://issues.apache.org/jira/browse/SPARK-31440
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Eren Avsarogullari
>Priority: Major
> Attachments: current_version.json, improved_version.json
>
>
> SQL Rest API exposes query execution metrics as Public API. This Jira aims to 
> apply following improvements on SQL Rest API by aligning Spark-UI.
> *Proposed Improvements:*
> 1- Support Physical Operations and group metrics per physical operation by 
> aligning Spark UI.
> 2- Support *wholeStageCodegenId* for Physical Operations
> 3- *nodeId* can be useful for grouping metrics and sorting physical 
> operations (according to execution order) to differentiate same operators (if 
> used multiple times during the same query execution) and their metrics.
> 4- Filter *empty* metrics by aligning with Spark UI - SQL Tab. Currently, 
> Spark UI does not show empty metrics.
> 5- Remove line breakers(*\n*) from *metricValue*.
> 6- *planDescription* can be *optional* Http parameter to avoid network cost 
> where there is specially complex jobs creating big-plans.
> 7- *metrics* attribute needs to be exposed at the bottom order as 
> *metricDetails*. Specially, this can be useful for the user where 
> *metricDetails* array size is high. 
> 8- Reverse order on *metricDetails* aims to match with Spark UI by supporting 
> Physical Operators' execution order.
> *Attachments:*
>  Please find both *current* and *improved* versions of the results as 
> attached for following SQL Rest Endpoint:
> {code:java}
> curl -X GET 
> http://localhost:4040/api/v1/applications/$appId/sql/$executionId?details=true{code}
>  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31459) When using the insert overwrite directory syntax, if the target path is an existing file, the final run result is incorrect

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31459:


Assignee: (was: Apache Spark)

> When using the insert overwrite directory syntax, if the target path is an 
> existing file, the final run result is incorrect
> ---
>
> Key: SPARK-31459
> URL: https://issues.apache.org/jira/browse/SPARK-31459
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
> Environment: spark2.4.5
>Reporter: mcdull_zhang
>Priority: Major
>  Labels: sql
>
> When using the insert overwrite directory syntax, if the target path is an 
> existing file, the final operation result is incorrect.
> At present, Spark will not delete the existing files. After the calculation 
> is completed, one of the result files will be renamed to the result path.
> This is different from hive's behavior. Hive will delete the existing target 
> file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25065) Driver and executors pick the wrong logging configuration file.

2020-05-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097501#comment-17097501
 ] 

Apache Spark commented on SPARK-25065:
--

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/27735

> Driver and executors pick the wrong logging configuration file.
> ---
>
> Key: SPARK-25065
> URL: https://issues.apache.org/jira/browse/SPARK-25065
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Prashant Sharma
>Priority: Major
>
> Currently, when running in kubernetes mode, it sets necessary configuration 
> properties by creating a spark.properties file and mounting a conf dir.
> The shipped Dockerfile, do not copy conf to the image, and this is on purpose 
> and that is well understood. However, one would like to have his custom 
> logging configuration file in the image conf directory.
> In order to achieve this, it is not enough to copy it in the spark's conf dir 
> of the resultant image, as it is reset during kubernetes mount conf volume 
> step.
>  
> In order to reproduce, please add {code}-Dlog4j.debug{code} to 
> {code:java}spark.(executor|driver).extraJavaOptions{code}. This way, it was 
> found the provided log4j file is not picked and the one coming from 
> kubernetes client jar was picked up by the driver process.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31432) Make SPARK_JARS_DIR configurable

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31432:


Assignee: (was: Apache Spark)

> Make SPARK_JARS_DIR configurable
> 
>
> Key: SPARK-31432
> URL: https://issues.apache.org/jira/browse/SPARK-31432
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.1.0
>Reporter: Shingo Furuyama
>Priority: Minor
>
> In the script under bin/sbin, it is better that we can specify SPARK_JARS_DIR 
> as same as SPARK_CONF_DIR.
> Our usecase:
>  We are trying to employ spark 2.4.5 with YARN in HDP2.6.4. Since there is an 
> incompatible conflict on commons-lang3 between spark 2.4.5 and HDP2.6.4, we 
> tweak the jars by Maven Shade Plugin.
>  The jars slightly differ from jars in spark 2.4.5, and we locate it in a 
> directory different from the default. So it is useful for us if we can set 
> SPARK_JARS_DIR for bin/sbin scripts to point the direcotry.
>  We can do that without the modification by deploying spark home as many as 
> set of jars, but it is somehow redundant.
> Common usecase:
>  I believe there is a similer usecase. For example, deploying spark built for 
> scala 2.11 and 2.12 in a machine and switch jars location by setting 
> SPARK_JARS_DIR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31551) createSparkUser lost user's non-Hadoop credentials

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31551:


Assignee: Apache Spark

> createSparkUser lost user's non-Hadoop credentials
> --
>
> Key: SPARK-31551
> URL: https://issues.apache.org/jira/browse/SPARK-31551
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.5
>Reporter: Yuqi Wang
>Assignee: Apache Spark
>Priority: Major
>
> See current 
> *[createSparkUser|https://github.com/apache/spark/blob/263f04db865920d9c10251517b00a1b477b58ff1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76]*:
> {code:java}
>    def createSparkUser(): UserGroupInformation = {
> val user = Utils.getCurrentUserName()
> logDebug("creating UGI for user: " + user)
> val ugi = UserGroupInformation.createRemoteUser(user)
> transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
> ugi
>   }
>   def transferCredentials(source: UserGroupInformation, dest: 
> UserGroupInformation): Unit = {
> dest.addCredentials(source.getCredentials())
>   }
>   def getCurrentUserName(): String = {
> Option(System.getenv("SPARK_USER"))
>   .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
>   }
> {code}
> The *transferCredentials* func can only transfer Hadoop creds such as 
> Delegation Tokens.
>  However, other creds stored in UGI.subject.getPrivateCredentials, will be 
> lost here, such as:
>  # Non-Hadoop creds:
>  Such as, [Kafka creds 
> |https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395]
>  # Newly supported or 3rd party supported Hadoop creds:
>  Such as to support OAuth/JWT token authn on Hadoop, we need to store the 
> OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens 
> are not supposed to be managed by Hadoop Credentials (currently it is only 
> for Hadoop secret keys and delegation tokens)
> Another issue is that the *SPARK_USER* only gets the 
> UserGroupInformation.getCurrentUser().getShortUserName() of the user, which 
> may lost the user's fully qualified user name. We should better use the 
> *getUserName* to get fully qualified user name in our client side, which is 
> aligned to 
> *[HADOOP_PROXY_USER|https://github.com/apache/hadoop/blob/30ef8d0f1a1463931fe581a46c739dad4c8260e4/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L716-L720]*.
> Related to https://issues.apache.org/jira/browse/SPARK-1051



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31440) Improve SQL Rest API

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31440:


Assignee: Apache Spark

> Improve SQL Rest API
> 
>
> Key: SPARK-31440
> URL: https://issues.apache.org/jira/browse/SPARK-31440
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Eren Avsarogullari
>Assignee: Apache Spark
>Priority: Major
> Attachments: current_version.json, improved_version.json
>
>
> SQL Rest API exposes query execution metrics as Public API. This Jira aims to 
> apply following improvements on SQL Rest API by aligning Spark-UI.
> *Proposed Improvements:*
> 1- Support Physical Operations and group metrics per physical operation by 
> aligning Spark UI.
> 2- Support *wholeStageCodegenId* for Physical Operations
> 3- *nodeId* can be useful for grouping metrics and sorting physical 
> operations (according to execution order) to differentiate same operators (if 
> used multiple times during the same query execution) and their metrics.
> 4- Filter *empty* metrics by aligning with Spark UI - SQL Tab. Currently, 
> Spark UI does not show empty metrics.
> 5- Remove line breakers(*\n*) from *metricValue*.
> 6- *planDescription* can be *optional* Http parameter to avoid network cost 
> where there is specially complex jobs creating big-plans.
> 7- *metrics* attribute needs to be exposed at the bottom order as 
> *metricDetails*. Specially, this can be useful for the user where 
> *metricDetails* array size is high. 
> 8- Reverse order on *metricDetails* aims to match with Spark UI by supporting 
> Physical Operators' execution order.
> *Attachments:*
>  Please find both *current* and *improved* versions of the results as 
> attached for following SQL Rest Endpoint:
> {code:java}
> curl -X GET 
> http://localhost:4040/api/v1/applications/$appId/sql/$executionId?details=true{code}
>  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31432) Make SPARK_JARS_DIR configurable

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31432:


Assignee: Apache Spark

> Make SPARK_JARS_DIR configurable
> 
>
> Key: SPARK-31432
> URL: https://issues.apache.org/jira/browse/SPARK-31432
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.1.0
>Reporter: Shingo Furuyama
>Assignee: Apache Spark
>Priority: Minor
>
> In the script under bin/sbin, it is better that we can specify SPARK_JARS_DIR 
> as same as SPARK_CONF_DIR.
> Our usecase:
>  We are trying to employ spark 2.4.5 with YARN in HDP2.6.4. Since there is an 
> incompatible conflict on commons-lang3 between spark 2.4.5 and HDP2.6.4, we 
> tweak the jars by Maven Shade Plugin.
>  The jars slightly differ from jars in spark 2.4.5, and we locate it in a 
> directory different from the default. So it is useful for us if we can set 
> SPARK_JARS_DIR for bin/sbin scripts to point the direcotry.
>  We can do that without the modification by deploying spark home as many as 
> set of jars, but it is somehow redundant.
> Common usecase:
>  I believe there is a similer usecase. For example, deploying spark built for 
> scala 2.11 and 2.12 in a machine and switch jars location by setting 
> SPARK_JARS_DIR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31431) CalendarInterval encoder support

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31431:


Assignee: (was: Apache Spark)

> CalendarInterval encoder support
> 
>
> Key: SPARK-31431
> URL: https://issues.apache.org/jira/browse/SPARK-31431
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> CalenderInterval is available to be converted to/from internal Spark SQL 
> representation when it is a member of a Scala's product type e.g tuples/ case 
> class etc but not as a primitive type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31418) Blacklisting feature aborts Spark job without retrying for max num retries in case of Dynamic allocation

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31418:


Assignee: (was: Apache Spark)

> Blacklisting feature aborts Spark job without retrying for max num retries in 
> case of Dynamic allocation
> 
>
> Key: SPARK-31418
> URL: https://issues.apache.org/jira/browse/SPARK-31418
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.5
>Reporter: Venkata krishnan Sowrirajan
>Priority: Major
>
> With Spark blacklisting, if a task fails on an executor, the executor gets 
> blacklisted for the task. In order to retry the task, it checks if there are 
> idle blacklisted executor which can be killed and replaced to retry the task 
> if not it aborts the job without doing max retries.
> In the context of dynamic allocation this can be better, instead of killing 
> the blacklisted idle executor (its possible there are no idle blacklisted 
> executor), request an additional executor and retry the task.
> This can be easily reproduced with a simple job like below, although this 
> example should fail eventually just to show that its not retried 
> spark.task.maxFailures times: 
> {code:java}
> def test(a: Int) = { a.asInstanceOf[String] }
> sc.parallelize(1 to 10, 10).map(x => test(x)).collect 
> {code}
> with dynamic allocation enabled and min executors set to 1. But there are 
> various other cases where this can fail as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31106) Support is_json function

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31106:


Assignee: Apache Spark

> Support is_json function
> 
>
> Key: SPARK-31106
> URL: https://issues.apache.org/jira/browse/SPARK-31106
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Assignee: Apache Spark
>Priority: Major
>
> This function will allow users to verify whether the given string is valid 
> JSON or not. It returns `true` for valid JSON and `false` for invalid JSON. 
> `NULL` is returned for `NULL` input.
> DBMSs supporting this functions are :
>  * MySQL
>  * SQL Server
>  * Sqlite
>  * MariaDB
>  * Amazon Redshift
>  * IBM Db2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31458) LOAD DATA support for builtin datasource tables

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31458:


Assignee: (was: Apache Spark)

> LOAD DATA support for builtin datasource tables
> ---
>
> Key: SPARK-31458
> URL: https://issues.apache.org/jira/browse/SPARK-31458
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> LOAD DATA support for builtin datasource tables like parquet orc etc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31467) Fix test issue with table named `test` in hive/SQLQuerySuite

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31467:


Assignee: Apache Spark

> Fix test issue with table named `test` in hive/SQLQuerySuite
> 
>
> Key: SPARK-31467
> URL: https://issues.apache.org/jira/browse/SPARK-31467
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: feiwang
>Assignee: Apache Spark
>Priority: Major
>
> If we add ut in hive/SQLQuerySuite and use table named `test`. We may meet 
> these exceptions.
> {code:java}
>  org.apache.spark.sql.AnalysisException: Inserting into an RDD-based table is 
> not allowed.;;
> [info] 'InsertIntoTable Project [_1#1403 AS key#1406, _2#1404 AS value#1407], 
> Map(name -> Some(n1)), true, false
> [info] +- Project [col1#3850]
> [info]+- LocalRelation [col1#3850]
> {code}
> {code:java}
> org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or 
> view 'test' already exists in database 'default';
> [info]   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply$mcV$sp(HiveExternalCatalog.scala:226)
> [info]   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply(HiveExternalCatalog.scala:216)
> [info]   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply(HiveExternalCatalog.scala:216)
> [info]   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> [info]   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.doCreateTable(HiveExternalCatalog.scala:216)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31487) Move slots check of barrier job from DAGScheduler to TaskSchedulerImpl

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31487:


Assignee: Apache Spark

> Move slots check of barrier job from DAGScheduler to TaskSchedulerImpl
> --
>
> Key: SPARK-31487
> URL: https://issues.apache.org/jira/browse/SPARK-31487
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
> Move the check to TaskSchedulerImpl to avoid re-submit the same job multiple 
> times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29791) Add a spark config to allow user to use executor cores virtually.

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-29791:


Assignee: Apache Spark

> Add a spark config to allow user to use executor cores virtually.
> -
>
> Key: SPARK-29791
> URL: https://issues.apache.org/jira/browse/SPARK-29791
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: zengrui
>Assignee: Apache Spark
>Priority: Minor
> Attachments: 0001-add-implementation-for-issue-SPARK-29791.patch
>
>
> We can config the executor cores by "spark.executor.cores". For example, if 
> we config 8 cores for a executor, then the driver can only scheduler 8 tasks 
> to this executor concurrently. In fact, most cases a task does not always 
> occupy a core or more. More time, tasks spent on disk IO or network IO, so we 
> can make driver to scheduler more than 8 tasks(virtual the cores to 16,32 or 
> more the executor report to driver) to this executor concurrently, it will 
> make the whole job execute more quickly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31418) Blacklisting feature aborts Spark job without retrying for max num retries in case of Dynamic allocation

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31418:


Assignee: Apache Spark

> Blacklisting feature aborts Spark job without retrying for max num retries in 
> case of Dynamic allocation
> 
>
> Key: SPARK-31418
> URL: https://issues.apache.org/jira/browse/SPARK-31418
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.5
>Reporter: Venkata krishnan Sowrirajan
>Assignee: Apache Spark
>Priority: Major
>
> With Spark blacklisting, if a task fails on an executor, the executor gets 
> blacklisted for the task. In order to retry the task, it checks if there are 
> idle blacklisted executor which can be killed and replaced to retry the task 
> if not it aborts the job without doing max retries.
> In the context of dynamic allocation this can be better, instead of killing 
> the blacklisted idle executor (its possible there are no idle blacklisted 
> executor), request an additional executor and retry the task.
> This can be easily reproduced with a simple job like below, although this 
> example should fail eventually just to show that its not retried 
> spark.task.maxFailures times: 
> {code:java}
> def test(a: Int) = { a.asInstanceOf[String] }
> sc.parallelize(1 to 10, 10).map(x => test(x)).collect 
> {code}
> with dynamic allocation enabled and min executors set to 1. But there are 
> various other cases where this can fail as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31483) pyspark shell IPython launch throws ".../pyspark/bin/load-spark-env.sh: No such file or directory"

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31483:


Assignee: (was: Apache Spark)

> pyspark shell IPython launch throws ".../pyspark/bin/load-spark-env.sh: No 
> such file or directory"
> --
>
> Key: SPARK-31483
> URL: https://issues.apache.org/jira/browse/SPARK-31483
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell
>Affects Versions: 2.4.5
> Environment: $ uname -a
> Darwin mengyu-C02Z7885LVDQ 19.3.0 Darwin Kernel Version 19.3.0: Thu Jan  9 
> 20:58:23 PST 2020; root:xnu-6153.81.5~1/RELEASE_X86_64 x86_64
> $ python -V
> Python 3.7.7
> $ ipython -V
> 7.13.0
>  
>Reporter: Zhang
>Priority: Major
>
> I'm trying launching pyspark shell with IPython interface via
> {{PYSPARK_DRIVER_PYTHON=ipython pyspark}}
> However it hits ".../pyspark/bin/load-spark-env.sh: No such file or directory"
> {{(py3-spark) mengyu@mengyu-C02Z7885LVDQ:~/workspace/tmp$ 
> PYSPARK_DRIVER_PYTHON=ipython pyspark}}
> {{/Users/mengyu/opt/anaconda2/envs/py3-spark/bin/pyspark: line 24: 
> /Users/mengyu/opt/anaconda2/envs/py3-spark/lib/python3.7/site-packages/pyspark/bin/load-spark-env.sh:
>  No such file or directory}}
> {{/Users/mengyu/opt/anaconda2/envs/py3-spark/bin/pyspark: line 77: 
> /Users/mengyu/workspace/tmp//Users/mengyu/opt/anaconda2/envs/py3-spark/lib/python3.7/site-packages/pyspark/bin/spark-submit:
>  No such file or directory}}
> {{/Users/mengyu/opt/anaconda2/envs/py3-spark/bin/pyspark: line 77: exec: 
> /Users/mengyu/workspace/tmp//Users/mengyu/opt/anaconda2/envs/py3-spark/lib/python3.7/site-packages/pyspark/bin/spark-submit:
>  cannot execute: No such file or directory}}
>  
> It is strange because the path 
> "{{/Users/mengyu/opt/anaconda2/envs/py3-spark/lib/python3.7/site-packages/pyspark/bin/load-spark-env.sh}}"
>  exists.
>  
> {{$ file 
> /Users/mengyu/opt/anaconda2/envs/py3-spark/lib/python3.7/site-packages/pyspark/bin/load-spark-env.sh}}{{/Users/mengyu/opt/anaconda2/envs/py3-spark/lib/python3.7/site-packages/pyspark/bin/load-spark-env.sh:
>  Bourne-Again shell script text executable, ASCII text}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31478) Executors Stop() method is not executed when they are killed

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31478:


Assignee: (was: Apache Spark)

> Executors Stop() method is not executed when they are killed
> 
>
> Key: SPARK-31478
> URL: https://issues.apache.org/jira/browse/SPARK-31478
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Minor
>
> In dynamic Allocation when executors are killed, stop() method of executors 
> is never called. So executors never goes down properly.
> In SPARK-29152, shutdown hook was added to stop the executors properly.
> Instead of forcing a shutdown hook we should ask executors to stop themselves 
> before killing them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31478) Executors Stop() method is not executed when they are killed

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31478:


Assignee: Apache Spark

> Executors Stop() method is not executed when they are killed
> 
>
> Key: SPARK-31478
> URL: https://issues.apache.org/jira/browse/SPARK-31478
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Assignee: Apache Spark
>Priority: Minor
>
> In dynamic Allocation when executors are killed, stop() method of executors 
> is never called. So executors never goes down properly.
> In SPARK-29152, shutdown hook was added to stop the executors properly.
> Instead of forcing a shutdown hook we should ask executors to stop themselves 
> before killing them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31106) Support is_json function

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31106:


Assignee: (was: Apache Spark)

> Support is_json function
> 
>
> Key: SPARK-31106
> URL: https://issues.apache.org/jira/browse/SPARK-31106
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> This function will allow users to verify whether the given string is valid 
> JSON or not. It returns `true` for valid JSON and `false` for invalid JSON. 
> `NULL` is returned for `NULL` input.
> DBMSs supporting this functions are :
>  * MySQL
>  * SQL Server
>  * Sqlite
>  * MariaDB
>  * Amazon Redshift
>  * IBM Db2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31377) Add unit tests for "number of output rows" metric for joins in SQLMetricsSuite

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31377:


Assignee: Apache Spark

> Add unit tests for "number of output rows" metric for joins in SQLMetricsSuite
> --
>
> Key: SPARK-31377
> URL: https://issues.apache.org/jira/browse/SPARK-31377
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Srinivas Rishindra Pothireddi
>Assignee: Apache Spark
>Priority: Minor
>
> For some combinations of join algorithm and join types there are no unit 
> tests for the "number of output rows" metric.
> A list of missing unit tests include the following.
>  * ShuffledHashJoin: leftOuter, RightOuter, LeftAnti, LeftSemi
>  * BroadcastNestedLoopJoin: RightOuter
>  * BroadcastHashJoin: LeftAnti



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31454) An optimized K-Means based on DenseMatrix and GEMM

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31454:


Assignee: Apache Spark

> An optimized K-Means based on DenseMatrix and GEMM
> --
>
> Key: SPARK-31454
> URL: https://issues.apache.org/jira/browse/SPARK-31454
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Xiaochang Wu
>Assignee: Apache Spark
>Priority: Major
>  Labels: performance
>
> The main computations in K-Means are calculating distances between individual 
> points and center points. Currently K-Means implementation is vector-based 
> which can't take advantage of optimized native BLAS libraries.
> When the original points are represented as dense vectors, our approach is to 
> modify the original input data structures to a DenseMatrix-based one by 
> grouping several points together. The original distance calculations can be 
> translated into a Matrix multiplication then optimized native GEMM routines 
> (Intel MKL, OpenBLAS etc.) can be used. This approach can also work with 
> sparse vectors despite having larger memory consumption when translating 
> sparse vectors to dense matrix.
> Our preliminary benchmark shows this DenseMatrix+GEMM approach can boost the 
> training performance by *3.5x* with Intel MKL, looks very promising!
> To minimize end user impact, proposed changes are to use config parameters to 
> control if turn on this implementation without modifying public interfaces. 
> Parameter rowsPerMatrix is used to control how many points are grouped 
> together to build a DenseMatrix. An example:
> $ spark-submit --master $SPARK_MASTER \
> --conf "spark.ml.kmeans.matrixImplementation.enabled=true" \
>     --conf "spark.ml.kmeans.matrixImplementation.rowsPerMatrix=5000" \
>     --class org.apache.spark.examples.ml.KMeansExample 
> Several code changes are made in "spark.ml" namespace as we think 
> "spark.mllib" is in maintenance mode, some are duplications from spark.mllib 
> for using private definitions in the same package: 
>  - Modified: KMeans.scala, DatasetUtils.scala
>  - Added: KMeansMatrixImpl.scala
>  - Duplications: DistanceMeasure.scala, LocalKMeans.scala
> If this general idea is accepted by community, we are willing to contribute 
> our code to upstream and polish the implementation according to feedbacks and 
> produce benchmarks.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31459) When using the insert overwrite directory syntax, if the target path is an existing file, the final run result is incorrect

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31459:


Assignee: Apache Spark

> When using the insert overwrite directory syntax, if the target path is an 
> existing file, the final run result is incorrect
> ---
>
> Key: SPARK-31459
> URL: https://issues.apache.org/jira/browse/SPARK-31459
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
> Environment: spark2.4.5
>Reporter: mcdull_zhang
>Assignee: Apache Spark
>Priority: Major
>  Labels: sql
>
> When using the insert overwrite directory syntax, if the target path is an 
> existing file, the final operation result is incorrect.
> At present, Spark will not delete the existing files. After the calculation 
> is completed, one of the result files will be renamed to the result path.
> This is different from hive's behavior. Hive will delete the existing target 
> file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31483) pyspark shell IPython launch throws ".../pyspark/bin/load-spark-env.sh: No such file or directory"

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31483:


Assignee: Apache Spark

> pyspark shell IPython launch throws ".../pyspark/bin/load-spark-env.sh: No 
> such file or directory"
> --
>
> Key: SPARK-31483
> URL: https://issues.apache.org/jira/browse/SPARK-31483
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell
>Affects Versions: 2.4.5
> Environment: $ uname -a
> Darwin mengyu-C02Z7885LVDQ 19.3.0 Darwin Kernel Version 19.3.0: Thu Jan  9 
> 20:58:23 PST 2020; root:xnu-6153.81.5~1/RELEASE_X86_64 x86_64
> $ python -V
> Python 3.7.7
> $ ipython -V
> 7.13.0
>  
>Reporter: Zhang
>Assignee: Apache Spark
>Priority: Major
>
> I'm trying launching pyspark shell with IPython interface via
> {{PYSPARK_DRIVER_PYTHON=ipython pyspark}}
> However it hits ".../pyspark/bin/load-spark-env.sh: No such file or directory"
> {{(py3-spark) mengyu@mengyu-C02Z7885LVDQ:~/workspace/tmp$ 
> PYSPARK_DRIVER_PYTHON=ipython pyspark}}
> {{/Users/mengyu/opt/anaconda2/envs/py3-spark/bin/pyspark: line 24: 
> /Users/mengyu/opt/anaconda2/envs/py3-spark/lib/python3.7/site-packages/pyspark/bin/load-spark-env.sh:
>  No such file or directory}}
> {{/Users/mengyu/opt/anaconda2/envs/py3-spark/bin/pyspark: line 77: 
> /Users/mengyu/workspace/tmp//Users/mengyu/opt/anaconda2/envs/py3-spark/lib/python3.7/site-packages/pyspark/bin/spark-submit:
>  No such file or directory}}
> {{/Users/mengyu/opt/anaconda2/envs/py3-spark/bin/pyspark: line 77: exec: 
> /Users/mengyu/workspace/tmp//Users/mengyu/opt/anaconda2/envs/py3-spark/lib/python3.7/site-packages/pyspark/bin/spark-submit:
>  cannot execute: No such file or directory}}
>  
> It is strange because the path 
> "{{/Users/mengyu/opt/anaconda2/envs/py3-spark/lib/python3.7/site-packages/pyspark/bin/load-spark-env.sh}}"
>  exists.
>  
> {{$ file 
> /Users/mengyu/opt/anaconda2/envs/py3-spark/lib/python3.7/site-packages/pyspark/bin/load-spark-env.sh}}{{/Users/mengyu/opt/anaconda2/envs/py3-spark/lib/python3.7/site-packages/pyspark/bin/load-spark-env.sh:
>  Bourne-Again shell script text executable, ASCII text}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29791) Add a spark config to allow user to use executor cores virtually.

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-29791:


Assignee: (was: Apache Spark)

> Add a spark config to allow user to use executor cores virtually.
> -
>
> Key: SPARK-29791
> URL: https://issues.apache.org/jira/browse/SPARK-29791
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: zengrui
>Priority: Minor
> Attachments: 0001-add-implementation-for-issue-SPARK-29791.patch
>
>
> We can config the executor cores by "spark.executor.cores". For example, if 
> we config 8 cores for a executor, then the driver can only scheduler 8 tasks 
> to this executor concurrently. In fact, most cases a task does not always 
> occupy a core or more. More time, tasks spent on disk IO or network IO, so we 
> can make driver to scheduler more than 8 tasks(virtual the cores to 16,32 or 
> more the executor report to driver) to this executor concurrently, it will 
> make the whole job execute more quickly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31056) Add CalendarIntervals division

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31056:


Assignee: (was: Apache Spark)

> Add CalendarIntervals division
> --
>
> Key: SPARK-31056
> URL: https://issues.apache.org/jira/browse/SPARK-31056
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Enrico Minack
>Priority: Major
>
> {{CalendarInterval}} should be allowed for division. The {{CalendarInterval}} 
> consists of three time components: {{months}}, {{days}} and {{microseconds}}. 
> The division can only be defined between intervals that have a single 
> non-zero time component, while both intervals have the same non-zero time 
> component. Otherwise the division expression would be ambiguous.
> This allows to evaluate the magnitude of {{CalendarInterval}} in SQL 
> expressions:
> {code}
> Seq((Timestamp.valueOf("2020-02-01 12:00:00"), Timestamp.valueOf("2020-02-01 
> 13:30:25")))
>   .toDF("start", "end")
>   .withColumn("interval", $"end" - $"start")
>   .withColumn("interval [h]", $"interval" / lit("1 
> hour").cast(CalendarIntervalType))
>   .withColumn("rate [€/h]", lit(1.45))
>   .withColumn("price [€]", $"interval [h]" * $"rate [€/h]")
>   .show(false)
> +---+---+-+--+--+--+
> |start  |end|interval 
> |interval [h]  |rate [€/h]|price [€] |
> +---+---+-+--+--+--+
> |2020-02-01 12:00:00|2020-02-01 13:30:25|1 hours 30 minutes 25 
> seconds|1.5069|1.45  |2.18506943|
> +---+---+-+--+--+--+
> {code}
> The currently available approach is
> {code}
> Seq((Timestamp.valueOf("2020-02-01 12:00:00"), Timestamp.valueOf("2020-02-01 
> 13:30:25")))
>   .toDF("start", "end")
>   .withColumn("interval [s]", unix_timestamp($"end") - 
> unix_timestamp($"start"))
>   .withColumn("interval [h]", $"interval [s]" / 3600)
>   .withColumn("rate [€/h]", lit(1.45))
>   .withColumn("price [€]", $"interval [h]" * $"rate [€/h]")
>   .show(false)
> {code}
> Going through {{unix_timestamp}} is a hack and it pollutes the SQL query with 
> unrelated semantics (unix timestamp is completely irrelevant for this 
> computation). It is merely there because there is currently no way to access 
> the length of an {{CalendarInterval}}. Dividing an interval by another 
> interval provides means to measure the length in an arbitrary unit (minutes, 
> hours, quarter hours).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31056) Add CalendarIntervals division

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31056:


Assignee: Apache Spark

> Add CalendarIntervals division
> --
>
> Key: SPARK-31056
> URL: https://issues.apache.org/jira/browse/SPARK-31056
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Enrico Minack
>Assignee: Apache Spark
>Priority: Major
>
> {{CalendarInterval}} should be allowed for division. The {{CalendarInterval}} 
> consists of three time components: {{months}}, {{days}} and {{microseconds}}. 
> The division can only be defined between intervals that have a single 
> non-zero time component, while both intervals have the same non-zero time 
> component. Otherwise the division expression would be ambiguous.
> This allows to evaluate the magnitude of {{CalendarInterval}} in SQL 
> expressions:
> {code}
> Seq((Timestamp.valueOf("2020-02-01 12:00:00"), Timestamp.valueOf("2020-02-01 
> 13:30:25")))
>   .toDF("start", "end")
>   .withColumn("interval", $"end" - $"start")
>   .withColumn("interval [h]", $"interval" / lit("1 
> hour").cast(CalendarIntervalType))
>   .withColumn("rate [€/h]", lit(1.45))
>   .withColumn("price [€]", $"interval [h]" * $"rate [€/h]")
>   .show(false)
> +---+---+-+--+--+--+
> |start  |end|interval 
> |interval [h]  |rate [€/h]|price [€] |
> +---+---+-+--+--+--+
> |2020-02-01 12:00:00|2020-02-01 13:30:25|1 hours 30 minutes 25 
> seconds|1.5069|1.45  |2.18506943|
> +---+---+-+--+--+--+
> {code}
> The currently available approach is
> {code}
> Seq((Timestamp.valueOf("2020-02-01 12:00:00"), Timestamp.valueOf("2020-02-01 
> 13:30:25")))
>   .toDF("start", "end")
>   .withColumn("interval [s]", unix_timestamp($"end") - 
> unix_timestamp($"start"))
>   .withColumn("interval [h]", $"interval [s]" / 3600)
>   .withColumn("rate [€/h]", lit(1.45))
>   .withColumn("price [€]", $"interval [h]" * $"rate [€/h]")
>   .show(false)
> {code}
> Going through {{unix_timestamp}} is a hack and it pollutes the SQL query with 
> unrelated semantics (unix timestamp is completely irrelevant for this 
> computation). It is merely there because there is currently no way to access 
> the length of an {{CalendarInterval}}. Dividing an interval by another 
> interval provides means to measure the length in an arbitrary unit (minutes, 
> hours, quarter hours).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30679) REPLACE TABLE can omit the USING clause

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30679:


Assignee: Wenchen Fan  (was: Apache Spark)

> REPLACE TABLE can omit the USING clause
> ---
>
> Key: SPARK-30679
> URL: https://issues.apache.org/jira/browse/SPARK-30679
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30366) Remove Redundant Information for InMemoryTableScan in SQL UI

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30366:


Assignee: (was: Apache Spark)

> Remove Redundant Information for InMemoryTableScan in SQL UI
> 
>
> Key: SPARK-30366
> URL: https://issues.apache.org/jira/browse/SPARK-30366
> Project: Spark
>  Issue Type: Epic
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: Max Thompson
>Priority: Minor
>
> All the JIRAs within this epic are follow-ups for 
> https://issues.apache.org/jira/browse/SPARK-29431 
>  
>  This epic contains JIRAs for adding features to how InMemoryTableScan 
> operators and their children are displayed in the SQL tab of the Web UI, 
> aimed at removing redundant information that may confuse the user as to how 
> the query was actually executed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30366) Remove Redundant Information for InMemoryTableScan in SQL UI

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30366:


Assignee: Apache Spark

> Remove Redundant Information for InMemoryTableScan in SQL UI
> 
>
> Key: SPARK-30366
> URL: https://issues.apache.org/jira/browse/SPARK-30366
> Project: Spark
>  Issue Type: Epic
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: Max Thompson
>Assignee: Apache Spark
>Priority: Minor
>
> All the JIRAs within this epic are follow-ups for 
> https://issues.apache.org/jira/browse/SPARK-29431 
>  
>  This epic contains JIRAs for adding features to how InMemoryTableScan 
> operators and their children are displayed in the SQL tab of the Web UI, 
> aimed at removing redundant information that may confuse the user as to how 
> the query was actually executed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28330) ANSI SQL: Top-level in

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28330:


Assignee: Apache Spark

> ANSI SQL: Top-level  in 
> 
>
> Key: SPARK-28330
> URL: https://issues.apache.org/jira/browse/SPARK-28330
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> h2. {{LIMIT}} and {{OFFSET}}
> LIMIT and OFFSET allow you to retrieve just a portion of the rows that are 
> generated by the rest of the query:
> {noformat}
> SELECT select_list
> FROM table_expression
> [ ORDER BY ... ]
> [ LIMIT { number | ALL } ] [ OFFSET number ]
> {noformat}
> If a limit count is given, no more than that many rows will be returned (but 
> possibly fewer, if the query itself yields fewer rows). LIMIT ALL is the same 
> as omitting the LIMIT clause, as is LIMIT with a NULL argument.
> OFFSET says to skip that many rows before beginning to return rows. OFFSET 0 
> is the same as omitting the OFFSET clause, as is OFFSET with a NULL argument.
> If both OFFSET and LIMIT appear, then OFFSET rows are skipped before starting 
> to count the LIMIT rows that are returned.
> https://www.postgresql.org/docs/11/queries-limit.html
> *Feature ID*: F861



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30679) REPLACE TABLE can omit the USING clause

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30679:


Assignee: Apache Spark  (was: Wenchen Fan)

> REPLACE TABLE can omit the USING clause
> ---
>
> Key: SPARK-30679
> URL: https://issues.apache.org/jira/browse/SPARK-30679
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30600) Migrate ALTER VIEW SET/UNSET commands to the new resolution framework

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30600:


Assignee: Apache Spark

> Migrate ALTER VIEW SET/UNSET commands to the new resolution framework
> -
>
> Key: SPARK-30600
> URL: https://issues.apache.org/jira/browse/SPARK-30600
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30828) Improve insertInto behaviour

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30828:


Assignee: Apache Spark

> Improve insertInto behaviour
> 
>
> Key: SPARK-30828
> URL: https://issues.apache.org/jira/browse/SPARK-30828
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: German Schiavon Matteo
>Assignee: Apache Spark
>Priority: Minor
>
> Actually when you call *_insertInto_* to add a dataFrame into an existing 
> table the only safety check is that the number of columns match, but the 
> order doesn't matter, and the message in case that the number of columns 
> doesn't match is not very helpful, specially when you have  a lot of columns:
> {code:java}
>  org.apache.spark.sql.AnalysisException: `default`.`table` requires that the 
> data to be inserted have the same number of columns as the target table: 
> target table has 2 column(s) but the inserted data has 1 column(s), including 
> 0 partition column(s) having constant value(s).; {code}
> I think a standard column check would be very helpful, just like in almost 
> other cases with Spark:
>  
> {code:java}
> "cannot resolve 'p2' given input columns: [id, p1];"  
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30648) Support filters pushdown in JSON datasource

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30648:


Assignee: Apache Spark

> Support filters pushdown in JSON datasource
> ---
>
> Key: SPARK-30648
> URL: https://issues.apache.org/jira/browse/SPARK-30648
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> * Implement the `SupportsPushDownFilters` interface in `JsonScanBuilder`
>  * Apply filters in JacksonParser
>  * Change API JacksonParser - return Option[InternalRow] from 
> `convertObject()` for root JSON fields.
>  * Update JSONBenchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30666) Reliable single-stage accumulators

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30666:


Assignee: Apache Spark

> Reliable single-stage accumulators
> --
>
> Key: SPARK-30666
> URL: https://issues.apache.org/jira/browse/SPARK-30666
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Enrico Minack
>Assignee: Apache Spark
>Priority: Major
>
> This proposes a pragmatic improvement to allow for reliable single-stage 
> accumulators. Under the assumption that a given stage / partition / rdd 
> produces identical results, non-deterministic code produces identical 
> accumulator increments on success. Rerunning partitions for any reason should 
> always produce the same increments per partition on success.
> With this pragmatic approach, increments from individual partitions / tasks 
> are only merged into the accumulator on driver side for the first time per 
> partition. This is useful for accumulators registered with 
> {{countFailedValues == false}}. Hence, the accumulator aggregates all 
> successful partitions only once.
> The implementations require extra memory that scales with the number of 
> partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30705) Improve CaseWhen sub-expression equality

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30705:


Assignee: (was: Apache Spark)

> Improve CaseWhen sub-expression equality
> 
>
> Key: SPARK-30705
> URL: https://issues.apache.org/jira/browse/SPARK-30705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> We only support the first condition expression. But we can improve this 
> pattern:
> {code:sql}
> CASE WHEN testUdf(a) > 3 THEN 4
> WHEN testUdf(a) = 3 THEN 3
> WHEN testUdf(a) = 2 THEN 2
> WHEN testUdf(a) = 1 THEN 1
> ELSE 0 END
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30688) Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30688:


Assignee: (was: Apache Spark)

> Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF
> --
>
> Key: SPARK-30688
> URL: https://issues.apache.org/jira/browse/SPARK-30688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.4, 3.0.0
>Reporter: Rajkumar Singh
>Priority: Major
>
>  
> {code:java}
> scala> spark.sql("select unix_timestamp('20201', 'ww')").show();
> +-+
> |unix_timestamp(20201, ww)|
> +-+
> |                         null|
> +-+
>  
> scala> spark.sql("select unix_timestamp('20202', 'ww')").show();
> -+
> |unix_timestamp(20202, ww)|
> +-+
> |                   1578182400|
> +-+
>  
> {code}
>  
>  
> This seems to happen for leap year only, I dig deeper into it and it seems 
> that  Spark is using the java.text.SimpleDateFormat and try to parse the 
> expression here
> [org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652]
> {code:java}
> formatter.parse(
>  t.asInstanceOf[UTF8String].toString).getTime / 1000L{code}
>  but fail and SimpleDateFormat unable to parse the date throw Unparseable 
> Exception but Spark handle it silently and returns NULL.
>  
> *Spark-3.0:* I did some tests where spark no longer using the legacy 
> java.text.SimpleDateFormat but java date/time API, it seems  date/time API 
> expect a valid date with valid format
>  org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter#parse



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27951) ANSI SQL: NTH_VALUE function

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27951:


Assignee: (was: Apache Spark)

> ANSI SQL: NTH_VALUE function
> 
>
> Key: SPARK-27951
> URL: https://issues.apache.org/jira/browse/SPARK-27951
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> |{{nth_value({{value}}{{any}}, {{nth}}{{integer}})}}|{{same type as 
> }}{{value}}|returns {{value}} evaluated at the row that is the {{nth}} row of 
> the window frame (counting from 1); null if no such row|
> [https://www.postgresql.org/docs/8.4/functions-window.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30276) Support Filter expression allows simultaneous use of DISTINCT

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30276:


Assignee: (was: Apache Spark)

> Support Filter expression allows simultaneous use of DISTINCT
> -
>
> Key: SPARK-30276
> URL: https://issues.apache.org/jira/browse/SPARK-30276
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> SPARK-27986 only supports  Filter expression without DISTINCT.
> We need to support Filter expression allow simultaneous use of DISTINCT
> PostgreSQL support:
> {code:java}
> select ten, sum(distinct four) filter (where four > 10) from onek group by 
> ten;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27996) Spark UI redirect will be failed behind the https reverse proxy

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27996:


Assignee: Apache Spark

> Spark UI redirect will be failed behind the https reverse proxy
> ---
>
> Key: SPARK-27996
> URL: https://issues.apache.org/jira/browse/SPARK-27996
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.3
>Reporter: Saisai Shao
>Assignee: Apache Spark
>Priority: Minor
>
> When Spark live/history UI is proxied behind the reverse proxy, the redirect 
> will return wrong scheme, for example:
> If reverse proxy is SSL enabled, so the client to reverse proxy is a HTTPS 
> request, whereas if Spark's UI is not SSL enabled, then the request from 
> reverse proxy to Spark UI is a HTTP request, Spark itself treats all the 
> requests as HTTP requests, so the redirect URL is just started with "http", 
> which will be failed to redirect from client. 
> Actually for most of the reverse proxy, the proxy will add an additional 
> header "X-Forwarded-Proto" to tell the backend server that the client request 
> is a https request, so Spark should leverage this header to return the 
> correct URL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30276) Support Filter expression allows simultaneous use of DISTINCT

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30276:


Assignee: Apache Spark

> Support Filter expression allows simultaneous use of DISTINCT
> -
>
> Key: SPARK-30276
> URL: https://issues.apache.org/jira/browse/SPARK-30276
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-27986 only supports  Filter expression without DISTINCT.
> We need to support Filter expression allow simultaneous use of DISTINCT
> PostgreSQL support:
> {code:java}
> select ten, sum(distinct four) filter (where four > 10) from onek group by 
> ten;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27996) Spark UI redirect will be failed behind the https reverse proxy

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27996:


Assignee: (was: Apache Spark)

> Spark UI redirect will be failed behind the https reverse proxy
> ---
>
> Key: SPARK-27996
> URL: https://issues.apache.org/jira/browse/SPARK-27996
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.3
>Reporter: Saisai Shao
>Priority: Minor
>
> When Spark live/history UI is proxied behind the reverse proxy, the redirect 
> will return wrong scheme, for example:
> If reverse proxy is SSL enabled, so the client to reverse proxy is a HTTPS 
> request, whereas if Spark's UI is not SSL enabled, then the request from 
> reverse proxy to Spark UI is a HTTP request, Spark itself treats all the 
> requests as HTTP requests, so the redirect URL is just started with "http", 
> which will be failed to redirect from client. 
> Actually for most of the reverse proxy, the proxy will add an additional 
> header "X-Forwarded-Proto" to tell the backend server that the client request 
> is a https request, so Spark should leverage this header to return the 
> correct URL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29157) DataSourceV2: Add DataFrameWriterV2 to Python API

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-29157:


Assignee: (was: Apache Spark)

> DataSourceV2: Add DataFrameWriterV2 to Python API
> -
>
> Key: SPARK-29157
> URL: https://issues.apache.org/jira/browse/SPARK-29157
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>
> After SPARK-28612 is committed, we need to add the corresponding PySpark API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30621) Dynamic Pruning thread propagates the localProperties to task

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30621:


Assignee: Apache Spark

> Dynamic Pruning thread propagates the localProperties to task
> -
>
> Key: SPARK-30621
> URL: https://issues.apache.org/jira/browse/SPARK-30621
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Ajith S
>Assignee: Apache Spark
>Priority: Major
>
> Local properties set via sparkContext are not available as TaskContext 
> properties when executing parallel jobs and threadpools have idle threads
> Explanation:
> When executing parallel jobs via SubqueryBroadcastExec, the 
> {{relationFuture}} is evaluated via a separate thread. The threads inherit 
> the {{localProperties}} from sparkContext as they are the child threads.
> These threads are controlled via the executionContext (thread pools). Each 
> Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle threads.
> Scenarios where the thread pool has threads which are idle and reused for a 
> subsequent new query, the thread local properties will not be inherited from 
> spark context (thread properties are inherited only on thread creation) hence 
> end up having old or no properties set. This will cause taskset properties to 
> be missing when properties are transferred by child thread



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29157) DataSourceV2: Add DataFrameWriterV2 to Python API

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-29157:


Assignee: Apache Spark

> DataSourceV2: Add DataFrameWriterV2 to Python API
> -
>
> Key: SPARK-29157
> URL: https://issues.apache.org/jira/browse/SPARK-29157
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Assignee: Apache Spark
>Priority: Major
>
> After SPARK-28612 is committed, we need to add the corresponding PySpark API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30705) Improve CaseWhen sub-expression equality

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30705:


Assignee: Apache Spark

> Improve CaseWhen sub-expression equality
> 
>
> Key: SPARK-30705
> URL: https://issues.apache.org/jira/browse/SPARK-30705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> We only support the first condition expression. But we can improve this 
> pattern:
> {code:sql}
> CASE WHEN testUdf(a) > 3 THEN 4
> WHEN testUdf(a) = 3 THEN 3
> WHEN testUdf(a) = 2 THEN 2
> WHEN testUdf(a) = 1 THEN 1
> ELSE 0 END
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30600) Migrate ALTER VIEW SET/UNSET commands to the new resolution framework

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30600:


Assignee: (was: Apache Spark)

> Migrate ALTER VIEW SET/UNSET commands to the new resolution framework
> -
>
> Key: SPARK-30600
> URL: https://issues.apache.org/jira/browse/SPARK-30600
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24884) Implement regexp_extract_all

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24884:


Assignee: (was: Apache Spark)

> Implement regexp_extract_all
> 
>
> Key: SPARK-24884
> URL: https://issues.apache.org/jira/browse/SPARK-24884
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nick Nicolini
>Priority: Major
>
> I've recently hit many cases of regexp parsing where we need to match on 
> something that is always arbitrary in length; for example, a text block that 
> looks something like:
> {code:java}
> AAA:WORDS|
> BBB:TEXT|
> MSG:ASDF|
> MSG:QWER|
> ...
> MSG:ZXCV|{code}
> Where I need to pull out all values between "MSG:" and "|", which can occur 
> in each instance between 1 and n times. I cannot reliably use the existing 
> {{regexp_extract}} method since the number of occurrences is always 
> arbitrary, and while I can write a UDF to handle this it'd be great if this 
> was supported natively in Spark.
> Perhaps we can implement something like {{regexp_extract_all}} as 
> [Presto|https://prestodb.io/docs/current/functions/regexp.html] and 
> [Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html]
>  have?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30821) Executor pods with multiple containers will not be rescheduled unless all containers fail

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30821:


Assignee: Apache Spark

> Executor pods with multiple containers will not be rescheduled unless all 
> containers fail
> -
>
> Key: SPARK-30821
> URL: https://issues.apache.org/jira/browse/SPARK-30821
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Kevin Hogeland
>Assignee: Apache Spark
>Priority: Major
>
> Since the restart policy of launched pods is Never, additional handling is 
> required for pods that may have sidecar containers. The executor should be 
> considered failed if any containers have terminated and have a non-zero exit 
> code, but Spark currently only checks the pod phase. The pod phase will 
> remain "running" as long as _any_ pods are still running. Kubernetes sidecar 
> support in 1.18/1.19 does not address this situation, as sidecar containers 
> are excluded from pod phase calculation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30664) Add more metrics to the all stages page

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30664:


Assignee: Apache Spark

> Add more metrics to the all stages page
> ---
>
> Key: SPARK-30664
> URL: https://issues.apache.org/jira/browse/SPARK-30664
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Enrico Minack
>Assignee: Apache Spark
>Priority: Minor
> Attachments: image-2020-01-28-16-12-49-807.png, 
> image-2020-01-28-16-13-36-174.png, image-2020-01-28-16-15-20-258.png
>
>
> The web UI page for individual stages has many useful metrics to diagnose 
> poorly performing stages, e.g. spilled bytes or GC time. Identifying those 
> stages among hundreds or thousands of stages is cumbersome, as you have to 
> click through all stages on the all stages page. The all stages page should 
> host more metrics from the individual stages page like
>  - Peak Execution Memory
>  - Spill (Memory)
>  - Spill (Disk)
>  - GC Time
> These additional metrics would make the page more complex, so showing them 
> should be optional. The individual stages page hides some metrics under 
> !image-2020-01-28-16-12-49-807.png! . Those new metrics on the all stages 
> page should also be made optional in the same way.
> !image-2020-01-28-16-13-36-174.png!
> Existing metrics like
>  - Input
>  - Output
>  - Shuffle Read
>  - Shuffle Write
> could be made optional as well and active by default. Then users can remove 
> them if they want but get the same view as now by default.
> The table extends as additional metrics get checked / unchecked:
> !image-2020-01-28-16-15-20-258.png!
> Sorting the table by metrics allows to find the stages with highest GC time 
> or spilled bytes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30648) Support filters pushdown in JSON datasource

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30648:


Assignee: (was: Apache Spark)

> Support filters pushdown in JSON datasource
> ---
>
> Key: SPARK-30648
> URL: https://issues.apache.org/jira/browse/SPARK-30648
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> * Implement the `SupportsPushDownFilters` interface in `JsonScanBuilder`
>  * Apply filters in JacksonParser
>  * Change API JacksonParser - return Option[InternalRow] from 
> `convertObject()` for root JSON fields.
>  * Update JSONBenchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24884) Implement regexp_extract_all

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24884:


Assignee: Apache Spark

> Implement regexp_extract_all
> 
>
> Key: SPARK-24884
> URL: https://issues.apache.org/jira/browse/SPARK-24884
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nick Nicolini
>Assignee: Apache Spark
>Priority: Major
>
> I've recently hit many cases of regexp parsing where we need to match on 
> something that is always arbitrary in length; for example, a text block that 
> looks something like:
> {code:java}
> AAA:WORDS|
> BBB:TEXT|
> MSG:ASDF|
> MSG:QWER|
> ...
> MSG:ZXCV|{code}
> Where I need to pull out all values between "MSG:" and "|", which can occur 
> in each instance between 1 and n times. I cannot reliably use the existing 
> {{regexp_extract}} method since the number of occurrences is always 
> arbitrary, and while I can write a UDF to handle this it'd be great if this 
> was supported natively in Spark.
> Perhaps we can implement something like {{regexp_extract_all}} as 
> [Presto|https://prestodb.io/docs/current/functions/regexp.html] and 
> [Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html]
>  have?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28325) Support ANSI SQL:SIMILAR TO ... ESCAPE syntax

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28325:


Assignee: Apache Spark

> Support ANSI SQL:SIMILAR TO ... ESCAPE syntax
> -
>
> Key: SPARK-28325
> URL: https://issues.apache.org/jira/browse/SPARK-28325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
>  ::=
>  
>  ::=
> [ NOT ] SIMILAR TO  [ ESCAPE  ]
>  ::=
> 
>  ::=
> 
> |   
>  ::=
> 
> |  
>  ::=
> 
> |  
> |  
> |  
> |  
>  ::=
>   [  ] 
>  ::=
>  [  ]
>  ::=
> 
>  ::=
> 
>  ::=
> 
> | 
> | 
> |   
>  ::=
> 
> | 
>  ::=
> !! See the Syntax Rules.
> 494 Foundation (SQL/Foundation)
> CD 9075-2:201?(E)
> 8.6 
>  ::=
> !! See the Syntax Rules.
>  ::=
> 
> |  ... 
> |   ... 
> |  ...
>  ... 
>  ::=
> 
>  ::=
> 
>  ::=
> 
> |   
> |  bracket>
>  ::=
> {code}
>  
>  Examples:
> {code}
> SELECT 'abc' RLIKE '%(b|d)%';      // false
> SELECT 'abc' SIMILAR TO '%(b|d)%'   // true
> SELECT 'abc' RLIKE '(b|c)%';  // false
> SELECT 'abc' SIMILAR TO '(b|c)%'; // false{code}
>  
> Currently, the following DBMSs support the syntax:
>  * 
> PostgreSQL:[https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-SIMILARTO-REGEXP]
>  * Redshift: 
> [https://docs.aws.amazon.com/redshift/latest/dg/pattern-matching-conditions-similar-to.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28325) Support ANSI SQL:SIMILAR TO ... ESCAPE syntax

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28325:


Assignee: (was: Apache Spark)

> Support ANSI SQL:SIMILAR TO ... ESCAPE syntax
> -
>
> Key: SPARK-28325
> URL: https://issues.apache.org/jira/browse/SPARK-28325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> {code:java}
>  ::=
>  
>  ::=
> [ NOT ] SIMILAR TO  [ ESCAPE  ]
>  ::=
> 
>  ::=
> 
> |   
>  ::=
> 
> |  
>  ::=
> 
> |  
> |  
> |  
> |  
>  ::=
>   [  ] 
>  ::=
>  [  ]
>  ::=
> 
>  ::=
> 
>  ::=
> 
> | 
> | 
> |   
>  ::=
> 
> | 
>  ::=
> !! See the Syntax Rules.
> 494 Foundation (SQL/Foundation)
> CD 9075-2:201?(E)
> 8.6 
>  ::=
> !! See the Syntax Rules.
>  ::=
> 
> |  ... 
> |   ... 
> |  ...
>  ... 
>  ::=
> 
>  ::=
> 
>  ::=
> 
> |   
> |  bracket>
>  ::=
> {code}
>  
>  Examples:
> {code}
> SELECT 'abc' RLIKE '%(b|d)%';      // false
> SELECT 'abc' SIMILAR TO '%(b|d)%'   // true
> SELECT 'abc' RLIKE '(b|c)%';  // false
> SELECT 'abc' SIMILAR TO '(b|c)%'; // false{code}
>  
> Currently, the following DBMSs support the syntax:
>  * 
> PostgreSQL:[https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-SIMILARTO-REGEXP]
>  * Redshift: 
> [https://docs.aws.amazon.com/redshift/latest/dg/pattern-matching-conditions-similar-to.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30666) Reliable single-stage accumulators

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30666:


Assignee: (was: Apache Spark)

> Reliable single-stage accumulators
> --
>
> Key: SPARK-30666
> URL: https://issues.apache.org/jira/browse/SPARK-30666
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Enrico Minack
>Priority: Major
>
> This proposes a pragmatic improvement to allow for reliable single-stage 
> accumulators. Under the assumption that a given stage / partition / rdd 
> produces identical results, non-deterministic code produces identical 
> accumulator increments on success. Rerunning partitions for any reason should 
> always produce the same increments per partition on success.
> With this pragmatic approach, increments from individual partitions / tasks 
> are only merged into the accumulator on driver side for the first time per 
> partition. This is useful for accumulators registered with 
> {{countFailedValues == false}}. Hence, the accumulator aggregates all 
> successful partitions only once.
> The implementations require extra memory that scales with the number of 
> partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30664) Add more metrics to the all stages page

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30664:


Assignee: (was: Apache Spark)

> Add more metrics to the all stages page
> ---
>
> Key: SPARK-30664
> URL: https://issues.apache.org/jira/browse/SPARK-30664
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Enrico Minack
>Priority: Minor
> Attachments: image-2020-01-28-16-12-49-807.png, 
> image-2020-01-28-16-13-36-174.png, image-2020-01-28-16-15-20-258.png
>
>
> The web UI page for individual stages has many useful metrics to diagnose 
> poorly performing stages, e.g. spilled bytes or GC time. Identifying those 
> stages among hundreds or thousands of stages is cumbersome, as you have to 
> click through all stages on the all stages page. The all stages page should 
> host more metrics from the individual stages page like
>  - Peak Execution Memory
>  - Spill (Memory)
>  - Spill (Disk)
>  - GC Time
> These additional metrics would make the page more complex, so showing them 
> should be optional. The individual stages page hides some metrics under 
> !image-2020-01-28-16-12-49-807.png! . Those new metrics on the all stages 
> page should also be made optional in the same way.
> !image-2020-01-28-16-13-36-174.png!
> Existing metrics like
>  - Input
>  - Output
>  - Shuffle Read
>  - Shuffle Write
> could be made optional as well and active by default. Then users can remove 
> them if they want but get the same view as now by default.
> The table extends as additional metrics get checked / unchecked:
> !image-2020-01-28-16-15-20-258.png!
> Sorting the table by metrics allows to find the stages with highest GC time 
> or spilled bytes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27951) ANSI SQL: NTH_VALUE function

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27951:


Assignee: Apache Spark

> ANSI SQL: NTH_VALUE function
> 
>
> Key: SPARK-27951
> URL: https://issues.apache.org/jira/browse/SPARK-27951
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Zhu, Lipeng
>Assignee: Apache Spark
>Priority: Major
>
> |{{nth_value({{value}}{{any}}, {{nth}}{{integer}})}}|{{same type as 
> }}{{value}}|returns {{value}} evaluated at the row that is the {{nth}} row of 
> the window frame (counting from 1); null if no such row|
> [https://www.postgresql.org/docs/8.4/functions-window.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30768) Constraints inferred from inequality attributes

2020-05-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30768:


Assignee: Apache Spark

> Constraints inferred from inequality attributes
> ---
>
> Key: SPARK-30768
> URL: https://issues.apache.org/jira/browse/SPARK-30768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> How to reproduce:
> {code:sql}
> create table SPARK_30768_1(c1 int, c2 int);
> create table SPARK_30768_2(c1 int, c2 int);
> {code}
> *Spark SQL*:
> {noformat}
> spark-sql> explain select t1.* from SPARK_30768_1 t1 join SPARK_30768_2 t2 on 
> (t1.c1 > t2.c1) where t1.c1 = 3;
> == Physical Plan ==
> *(3) Project [c1#5, c2#6]
> +- BroadcastNestedLoopJoin BuildRight, Inner, (c1#5 > c1#7)
>:- *(1) Project [c1#5, c2#6]
>:  +- *(1) Filter (isnotnull(c1#5) AND (c1#5 = 3))
>: +- *(1) ColumnarToRow
>:+- FileScan parquet default.spark_30768_1[c1#5,c2#6] Batched: 
> true, DataFilters: [isnotnull(c1#5), (c1#5 = 3)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(c1), EqualTo(c1,3)], 
> ReadSchema: struct
>+- BroadcastExchange IdentityBroadcastMode, [id=#60]
>   +- *(2) Project [c1#7]
>  +- *(2) Filter isnotnull(c1#7)
> +- *(2) ColumnarToRow
>+- FileScan parquet default.spark_30768_2[c1#7] Batched: true, 
> DataFilters: [isnotnull(c1#7)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(c1)], ReadSchema: 
> struct
> {noformat}
> *Hive* support this feature:
> {noformat}
> hive> explain select t1.* from SPARK_30768_1 t1 join SPARK_30768_2 t2 on 
> (t1.c1 > t2.c1) where t1.c1 = 3;
> Warning: Map Join MAPJOIN[13][bigTable=?] in task 'Stage-3:MAPRED' is a cross 
> product
> OK
> STAGE DEPENDENCIES:
>   Stage-4 is a root stage
>   Stage-3 depends on stages: Stage-4
>   Stage-0 depends on stages: Stage-3
> STAGE PLANS:
>   Stage: Stage-4
> Map Reduce Local Work
>   Alias -> Map Local Tables:
> $hdt$_0:t1
>   Fetch Operator
> limit: -1
>   Alias -> Map Local Operator Tree:
> $hdt$_0:t1
>   TableScan
> alias: t1
> filterExpr: (c1 = 3) (type: boolean)
> Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
> stats: NONE
> Filter Operator
>   predicate: (c1 = 3) (type: boolean)
>   Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
> Column stats: NONE
>   Select Operator
> expressions: c2 (type: int)
> outputColumnNames: _col1
> Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
> Column stats: NONE
> HashTable Sink Operator
>   keys:
> 0
> 1
>   Stage: Stage-3
> Map Reduce
>   Map Operator Tree:
>   TableScan
> alias: t2
> filterExpr: (c1 < 3) (type: boolean)
> Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
> stats: NONE
> Filter Operator
>   predicate: (c1 < 3) (type: boolean)
>   Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
> Column stats: NONE
>   Select Operator
> Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
> Column stats: NONE
> Map Join Operator
>   condition map:
>Inner Join 0 to 1
>   keys:
> 0
> 1
>   outputColumnNames: _col1
>   Statistics: Num rows: 1 Data size: 1 Basic stats: PARTIAL 
> Column stats: NONE
>   Select Operator
> expressions: 3 (type: int), _col1 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 1 Data size: 1 Basic stats: PARTIAL 
> Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> PARTIAL Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy

  1   2   >