[jira] [Comment Edited] (SPARK-33582) Hive partition pruning support not-equals

2020-11-27 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239919#comment-17239919
 ] 

Yuming Wang edited comment on SPARK-33582 at 11/28/20, 7:52 AM:


This change should after SPARK-33581, I have prepared the pr: 
https://github.com/wangyum/spark/tree/SPARK-33582


was (Author: q79969786):
This changed should after SPARK-33581, I have prepared the pr: 
https://github.com/wangyum/spark/tree/SPARK-33582

> Hive partition pruning support not-equals
> -
>
> Key: SPARK-33582
> URL: https://issues.apache.org/jira/browse/SPARK-33582
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> https://github.com/apache/hive/blob/b8bd4594bef718b1eeac9fceb437d7df7b480ed1/itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java#L2194-L2207
> https://issues.apache.org/jira/browse/HIVE-2702



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33582) Hive partition pruning support not-equals

2020-11-27 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239919#comment-17239919
 ] 

Yuming Wang commented on SPARK-33582:
-

This changed should after SPARK-33581, I have prepared the pr: 
https://github.com/wangyum/spark/tree/SPARK-33582

> Hive partition pruning support not-equals
> -
>
> Key: SPARK-33582
> URL: https://issues.apache.org/jira/browse/SPARK-33582
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> https://github.com/apache/hive/blob/b8bd4594bef718b1eeac9fceb437d7df7b480ed1/itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java#L2194-L2207
> https://issues.apache.org/jira/browse/HIVE-2702



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33582) Hive partition pruning support not-equals

2020-11-27 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-33582:
---

 Summary: Hive partition pruning support not-equals
 Key: SPARK-33582
 URL: https://issues.apache.org/jira/browse/SPARK-33582
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang
Assignee: Yuming Wang


https://github.com/apache/hive/blob/b8bd4594bef718b1eeac9fceb437d7df7b480ed1/itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java#L2194-L2207

https://issues.apache.org/jira/browse/HIVE-2702



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33581) Refactor HivePartitionFilteringSuite

2020-11-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239887#comment-17239887
 ] 

Apache Spark commented on SPARK-33581:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/30525

> Refactor HivePartitionFilteringSuite
> 
>
> Key: SPARK-33581
> URL: https://issues.apache.org/jira/browse/SPARK-33581
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Refactor HivePartitionFilteringSuite, to make it easy to maintain.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33581) Refactor HivePartitionFilteringSuite

2020-11-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33581:


Assignee: Apache Spark

> Refactor HivePartitionFilteringSuite
> 
>
> Key: SPARK-33581
> URL: https://issues.apache.org/jira/browse/SPARK-33581
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> Refactor HivePartitionFilteringSuite, to make it easy to maintain.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33581) Refactor HivePartitionFilteringSuite

2020-11-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33581:


Assignee: (was: Apache Spark)

> Refactor HivePartitionFilteringSuite
> 
>
> Key: SPARK-33581
> URL: https://issues.apache.org/jira/browse/SPARK-33581
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Refactor HivePartitionFilteringSuite, to make it easy to maintain.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33581) Refactor HivePartitionFilteringSuite

2020-11-27 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33581:

Description: Refactor HivePartitionFilteringSuite, to make it easy to 
maintain.  (was: Refactor HivePartitionFilteringSuite, to make it easy 
maintain.)

> Refactor HivePartitionFilteringSuite
> 
>
> Key: SPARK-33581
> URL: https://issues.apache.org/jira/browse/SPARK-33581
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Refactor HivePartitionFilteringSuite, to make it easy to maintain.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33581) Refactor HivePartitionFilteringSuite

2020-11-27 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-33581:
---

 Summary: Refactor HivePartitionFilteringSuite
 Key: SPARK-33581
 URL: https://issues.apache.org/jira/browse/SPARK-33581
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Tests
Affects Versions: 3.1.0
Reporter: Yuming Wang


Refactor HivePartitionFilteringSuite, to make it easy maintain.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33580) resolveDependencyPaths should use classifier attribute of artifact

2020-11-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239882#comment-17239882
 ] 

Apache Spark commented on SPARK-33580:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/30524

> resolveDependencyPaths should use classifier attribute of artifact
> --
>
> Key: SPARK-33580
> URL: https://issues.apache.org/jira/browse/SPARK-33580
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> `resolveDependencyPaths` now takes artifact type to decide to add "-tests" 
> postfix. However, the path pattern of ivy in `resolveMavenCoordinates` is 
> "[organization]_[artifact]-[revision](-[classifier]).[ext]". We should use 
> classifier instead of type to construct file path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33580) resolveDependencyPaths should use classifier attribute of artifact

2020-11-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33580:


Assignee: L. C. Hsieh  (was: Apache Spark)

> resolveDependencyPaths should use classifier attribute of artifact
> --
>
> Key: SPARK-33580
> URL: https://issues.apache.org/jira/browse/SPARK-33580
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> `resolveDependencyPaths` now takes artifact type to decide to add "-tests" 
> postfix. However, the path pattern of ivy in `resolveMavenCoordinates` is 
> "[organization]_[artifact]-[revision](-[classifier]).[ext]". We should use 
> classifier instead of type to construct file path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33580) resolveDependencyPaths should use classifier attribute of artifact

2020-11-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33580:


Assignee: Apache Spark  (was: L. C. Hsieh)

> resolveDependencyPaths should use classifier attribute of artifact
> --
>
> Key: SPARK-33580
> URL: https://issues.apache.org/jira/browse/SPARK-33580
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> `resolveDependencyPaths` now takes artifact type to decide to add "-tests" 
> postfix. However, the path pattern of ivy in `resolveMavenCoordinates` is 
> "[organization]_[artifact]-[revision](-[classifier]).[ext]". We should use 
> classifier instead of type to construct file path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33580) resolveDependencyPaths should use classifier attribute of artifact

2020-11-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239881#comment-17239881
 ] 

Apache Spark commented on SPARK-33580:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/30524

> resolveDependencyPaths should use classifier attribute of artifact
> --
>
> Key: SPARK-33580
> URL: https://issues.apache.org/jira/browse/SPARK-33580
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> `resolveDependencyPaths` now takes artifact type to decide to add "-tests" 
> postfix. However, the path pattern of ivy in `resolveMavenCoordinates` is 
> "[organization]_[artifact]-[revision](-[classifier]).[ext]". We should use 
> classifier instead of type to construct file path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33580) resolveDependencyPaths should use classifier attribute of artifact

2020-11-27 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-33580:

Description: `resolveDependencyPaths` now takes artifact type to decide to 
add "-tests" postfix. However, the path pattern of ivy in 
`resolveMavenCoordinates` is 
"[organization]_[artifact]-[revision](-[classifier]).[ext]". We should use 
classifier instead of type to construct file path.  (was: 
`resolveDependencyPaths` now takes artifact type to decide to add -tests 
postfix. However, the path pattern of ivy in `resolveMavenCoordinates` is 
"[organization]_[artifact]-[revision](-[classifier]).[ext]". We should use 
classifier instead of type to construct file path.)

> resolveDependencyPaths should use classifier attribute of artifact
> --
>
> Key: SPARK-33580
> URL: https://issues.apache.org/jira/browse/SPARK-33580
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> `resolveDependencyPaths` now takes artifact type to decide to add "-tests" 
> postfix. However, the path pattern of ivy in `resolveMavenCoordinates` is 
> "[organization]_[artifact]-[revision](-[classifier]).[ext]". We should use 
> classifier instead of type to construct file path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33580) resolveDependencyPaths should use classifier attribute of artifact

2020-11-27 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-33580:
---

 Summary: resolveDependencyPaths should use classifier attribute of 
artifact
 Key: SPARK-33580
 URL: https://issues.apache.org/jira/browse/SPARK-33580
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


`resolveDependencyPaths` now takes artifact type to decide to add -tests 
postfix. However, the path pattern of ivy in `resolveMavenCoordinates` is 
"[organization]_[artifact]-[revision](-[classifier]).[ext]". We should use 
classifier instead of type to construct file path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33579) Executors blank page behind proxy

2020-11-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239782#comment-17239782
 ] 

Apache Spark commented on SPARK-33579:
--

User 'pgillet' has created a pull request for this issue:
https://github.com/apache/spark/pull/30523

> Executors blank page behind proxy
> -
>
> Key: SPARK-33579
> URL: https://issues.apache.org/jira/browse/SPARK-33579
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
> Environment: Spark 3.0.1 on Kubernetes
>Reporter: Pascal GILLET
>Priority: Minor
>  Labels: core, ui
>
> When accessing the Web UI behind a proxy (e.g. a Kubernetes ingress), 
> executors page is blank.
> In {{/core/src/main/resources/org/apache/spark/ui/static/utils.js}}, we  
> should avoid the use of location.origin when constructing URLs for internal 
> API calls within the JavaScript.
>  Instead, we should use {{apiRoot}} global variable.
> On one hand, it would allow to build relative URLs. On the other hand, 
> {{apiRoot}} reflects the Spark property {{spark.ui.proxyBase}} which can be 
> set to change the root path of the Web UI.
> If {{spark.ui.proxyBase}} is actually set, original URLs become incorrect, 
> and we end up with an executors blank page.
>  I encounter this bug when accessing the Web UI behind a proxy (in my case a 
> Kubernetes Ingress).
>  
> See also 
> [https://github.com/jupyterhub/jupyter-server-proxy/issues/57#issuecomment-699163115]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33579) Executors blank page behind proxy

2020-11-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33579:


Assignee: (was: Apache Spark)

> Executors blank page behind proxy
> -
>
> Key: SPARK-33579
> URL: https://issues.apache.org/jira/browse/SPARK-33579
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
> Environment: Spark 3.0.1 on Kubernetes
>Reporter: Pascal GILLET
>Priority: Minor
>  Labels: core, ui
>
> When accessing the Web UI behind a proxy (e.g. a Kubernetes ingress), 
> executors page is blank.
> In {{/core/src/main/resources/org/apache/spark/ui/static/utils.js}}, we  
> should avoid the use of location.origin when constructing URLs for internal 
> API calls within the JavaScript.
>  Instead, we should use {{apiRoot}} global variable.
> On one hand, it would allow to build relative URLs. On the other hand, 
> {{apiRoot}} reflects the Spark property {{spark.ui.proxyBase}} which can be 
> set to change the root path of the Web UI.
> If {{spark.ui.proxyBase}} is actually set, original URLs become incorrect, 
> and we end up with an executors blank page.
>  I encounter this bug when accessing the Web UI behind a proxy (in my case a 
> Kubernetes Ingress).
>  
> See also 
> [https://github.com/jupyterhub/jupyter-server-proxy/issues/57#issuecomment-699163115]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33579) Executors blank page behind proxy

2020-11-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239780#comment-17239780
 ] 

Apache Spark commented on SPARK-33579:
--

User 'pgillet' has created a pull request for this issue:
https://github.com/apache/spark/pull/30523

> Executors blank page behind proxy
> -
>
> Key: SPARK-33579
> URL: https://issues.apache.org/jira/browse/SPARK-33579
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
> Environment: Spark 3.0.1 on Kubernetes
>Reporter: Pascal GILLET
>Priority: Minor
>  Labels: core, ui
>
> When accessing the Web UI behind a proxy (e.g. a Kubernetes ingress), 
> executors page is blank.
> In {{/core/src/main/resources/org/apache/spark/ui/static/utils.js}}, we  
> should avoid the use of location.origin when constructing URLs for internal 
> API calls within the JavaScript.
>  Instead, we should use {{apiRoot}} global variable.
> On one hand, it would allow to build relative URLs. On the other hand, 
> {{apiRoot}} reflects the Spark property {{spark.ui.proxyBase}} which can be 
> set to change the root path of the Web UI.
> If {{spark.ui.proxyBase}} is actually set, original URLs become incorrect, 
> and we end up with an executors blank page.
>  I encounter this bug when accessing the Web UI behind a proxy (in my case a 
> Kubernetes Ingress).
>  
> See also 
> [https://github.com/jupyterhub/jupyter-server-proxy/issues/57#issuecomment-699163115]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33579) Executors blank page behind proxy

2020-11-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33579:


Assignee: Apache Spark

> Executors blank page behind proxy
> -
>
> Key: SPARK-33579
> URL: https://issues.apache.org/jira/browse/SPARK-33579
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
> Environment: Spark 3.0.1 on Kubernetes
>Reporter: Pascal GILLET
>Assignee: Apache Spark
>Priority: Minor
>  Labels: core, ui
>
> When accessing the Web UI behind a proxy (e.g. a Kubernetes ingress), 
> executors page is blank.
> In {{/core/src/main/resources/org/apache/spark/ui/static/utils.js}}, we  
> should avoid the use of location.origin when constructing URLs for internal 
> API calls within the JavaScript.
>  Instead, we should use {{apiRoot}} global variable.
> On one hand, it would allow to build relative URLs. On the other hand, 
> {{apiRoot}} reflects the Spark property {{spark.ui.proxyBase}} which can be 
> set to change the root path of the Web UI.
> If {{spark.ui.proxyBase}} is actually set, original URLs become incorrect, 
> and we end up with an executors blank page.
>  I encounter this bug when accessing the Web UI behind a proxy (in my case a 
> Kubernetes Ingress).
>  
> See also 
> [https://github.com/jupyterhub/jupyter-server-proxy/issues/57#issuecomment-699163115]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33579) Executors blank page behind proxy

2020-11-27 Thread Pascal GILLET (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pascal GILLET updated SPARK-33579:
--
Description: 
When accessing the Web UI behind a proxy (e.g. a Kubernetes ingress), executors 
page is blank.

In {{/core/src/main/resources/org/apache/spark/ui/static/utils.js}}, we  should 
avoid the use of location.origin when constructing URLs for internal API calls 
within the JavaScript.
 Instead, we should use {{apiRoot}} global variable.

On one hand, it would allow to build relative URLs. On the other hand, 
{{apiRoot}} reflects the Spark property {{spark.ui.proxyBase}} which can be set 
to change the root path of the Web UI.

If {{spark.ui.proxyBase}} is actually set, original URLs become incorrect, and 
we end up with an executors blank page.
 I encounter this bug when accessing the Web UI behind a proxy (in my case a 
Kubernetes Ingress).

 

See also 
[https://github.com/jupyterhub/jupyter-server-proxy/issues/57#issuecomment-699163115]

  was:
When accessing the Web UI behind a proxy (e.g. a Kubernetes ingress), executors 
page is blank.

In {{/core/src/main/resources/org/apache/spark/ui/static/utils.js}}, we  should 
avoid the use of location.origin when constructing URLs for internal API calls 
within the JavaScript.
 Instead, we should use {{apiRoot}} global variable.

On one hand, it would allow to build relative URLs. On the other hand, 
{{apiRoot}} reflects the Spark property {{spark.ui.proxyBase}} which can be set 
to change the root path of the Web UI.

If {{spark.ui.proxyBase}} is actually set, original URLs become incorrect, and 
we end up with an executors blank page.
 I encounter this bug when accessing the Web UI behind a proxy (in my case a 
Kubernetes Ingress).


> Executors blank page behind proxy
> -
>
> Key: SPARK-33579
> URL: https://issues.apache.org/jira/browse/SPARK-33579
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
> Environment: Spark 3.0.1 on Kubernetes
>Reporter: Pascal GILLET
>Priority: Minor
>  Labels: core, ui
>
> When accessing the Web UI behind a proxy (e.g. a Kubernetes ingress), 
> executors page is blank.
> In {{/core/src/main/resources/org/apache/spark/ui/static/utils.js}}, we  
> should avoid the use of location.origin when constructing URLs for internal 
> API calls within the JavaScript.
>  Instead, we should use {{apiRoot}} global variable.
> On one hand, it would allow to build relative URLs. On the other hand, 
> {{apiRoot}} reflects the Spark property {{spark.ui.proxyBase}} which can be 
> set to change the root path of the Web UI.
> If {{spark.ui.proxyBase}} is actually set, original URLs become incorrect, 
> and we end up with an executors blank page.
>  I encounter this bug when accessing the Web UI behind a proxy (in my case a 
> Kubernetes Ingress).
>  
> See also 
> [https://github.com/jupyterhub/jupyter-server-proxy/issues/57#issuecomment-699163115]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33579) Executors blank page behind proxy

2020-11-27 Thread Pascal GILLET (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pascal GILLET updated SPARK-33579:
--
Description: 
When accessing the Web UI behind a proxy (e.g. a Kubernetes ingress), executors 
page is blank.

In {{/core/src/main/resources/org/apache/spark/ui/static/utils.js}}, we  should 
avoid the use of location.origin when constructing URLs for internal API calls 
within the JavaScript.
 Instead, we should use {{apiRoot}} global variable.

On one hand, it would allow to build relative URLs. On the other hand, 
{{apiRoot}} reflects the Spark property {{spark.ui.proxyBase}} which can be set 
to change the root path of the Web UI.

If {{spark.ui.proxyBase}} is actually set, original URLs become incorrect, and 
we end up with an executors blank page.
 I encounter this bug when accessing the Web UI behind a proxy (in my case a 
Kubernetes Ingress).

  was:
When accessing the Web UI behind a proxy (e.g. a Kubernetes ingress), executors 
page is blank.

In{{ /core/src/main/resources/org/apache/spark/ui/static/utils.js}}, we  should 
avoid the use of location.origin when constructing URLs for internal API calls 
within the JavaScript.
Instead, we should use {{apiRoot}} global variable.

On one hand, it would allow to build relative URLs. On the other hand, 
{{apiRoot}} reflects the Spark property {{spark.ui.proxyBase}} which can be set 
to change the root path of the Web UI.

If {{spark.ui.proxyBase}} is actually set, original URLs become incorrect, and 
we end up with an executors blank page.
I encounter this bug when accessing the Web UI behind a proxy (in my case a 
Kubernetes Ingress).


> Executors blank page behind proxy
> -
>
> Key: SPARK-33579
> URL: https://issues.apache.org/jira/browse/SPARK-33579
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
> Environment: Spark 3.0.1 on Kubernetes
>Reporter: Pascal GILLET
>Priority: Minor
>  Labels: core, ui
>
> When accessing the Web UI behind a proxy (e.g. a Kubernetes ingress), 
> executors page is blank.
> In {{/core/src/main/resources/org/apache/spark/ui/static/utils.js}}, we  
> should avoid the use of location.origin when constructing URLs for internal 
> API calls within the JavaScript.
>  Instead, we should use {{apiRoot}} global variable.
> On one hand, it would allow to build relative URLs. On the other hand, 
> {{apiRoot}} reflects the Spark property {{spark.ui.proxyBase}} which can be 
> set to change the root path of the Web UI.
> If {{spark.ui.proxyBase}} is actually set, original URLs become incorrect, 
> and we end up with an executors blank page.
>  I encounter this bug when accessing the Web UI behind a proxy (in my case a 
> Kubernetes Ingress).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33579) Executors blank page behind proxy

2020-11-27 Thread Pascal GILLET (Jira)
Pascal GILLET created SPARK-33579:
-

 Summary: Executors blank page behind proxy
 Key: SPARK-33579
 URL: https://issues.apache.org/jira/browse/SPARK-33579
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.1
 Environment: Spark 3.0.1 on Kubernetes
Reporter: Pascal GILLET


When accessing the Web UI behind a proxy (e.g. a Kubernetes ingress), executors 
page is blank.

In{{ /core/src/main/resources/org/apache/spark/ui/static/utils.js}}, we  should 
avoid the use of location.origin when constructing URLs for internal API calls 
within the JavaScript.
Instead, we should use {{apiRoot}} global variable.

On one hand, it would allow to build relative URLs. On the other hand, 
{{apiRoot}} reflects the Spark property {{spark.ui.proxyBase}} which can be set 
to change the root path of the Web UI.

If {{spark.ui.proxyBase}} is actually set, original URLs become incorrect, and 
we end up with an executors blank page.
I encounter this bug when accessing the Web UI behind a proxy (in my case a 
Kubernetes Ingress).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-11-27 Thread Simon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon updated SPARK-33571:
--
Description: 
The handling of old dates written with older Spark versions (<2.4.6) using the 
hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
correctly.

>From what I understand it should work like this:
 * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
1900-01-01T00:00:00Z
 * Only applies when reading or writing parquet files
 * When reading parquet files written with Spark < 2.4.6 which contain dates or 
timestamps before the above mentioned moments in time a `SparkUpgradeException` 
should be raised informing the user to choose either `LEGACY` or `CORRECTED` 
for the `datetimeRebaseModeInRead`
 * When reading parquet files written with Spark < 2.4.6 which contain dates or 
timestamps before the above mentioned moments in time and 
`datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
show the same values in Spark 3.0.1. with for example `df.show()` as they did 
in Spark 2.4.5
 * When reading parquet files written with Spark < 2.4.6 which contain dates or 
timestamps before the above mentioned moments in time and 
`datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
should show different values in Spark 3.0.1. with for example `df.show()` as 
they did in Spark 2.4.5
 * When writing parqet files with Spark > 3.0.0 which contain dates or 
timestamps before the above mentioned moment in time a `SparkUpgradeException` 
should be raised informing the user to choose either `LEGACY` or `CORRECTED` 
for the `datetimeRebaseModeInWrite`

First of all I'm not 100% sure all of this is correct. I've been unable to find 
any clear documentation on the expected behavior. The understanding I have was 
pieced together from the mailing list 
([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
 the blog post linked there and looking at the Spark code.

>From our testing we're seeing several issues:
 * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
that contains fields of type `TimestampType` which contain timestamps before 
the above mentioned moments in time without `datetimeRebaseModeInRead` set 
doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
the resulting dataframe compares to that dataframe in Spark 2.4.5
 * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
that contains fields of type `TimestampType` or `DateType` which contain dates 
or timestamps before the above mentioned moments in time with 
`datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the 
dataframe as when using `CORRECTED`, so it seems like no rebasing is happening.

I've made some scripts to help with testing/show the behavior, it uses pyspark 
2.4.5, 2.4.6 and 3.0.1. You can find them here 
[https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the 
outputs in a comment below as well.

  was:
The handling of old dates written with older Spark versions (<2.4.6) using the 
hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
correctly.

>From what I understand it should work like this:
 * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
1900-01-01T00:00:00Z
 * Only applies when reading or writing parquet files
 * When reading parquet files written with Spark < 2.4.6 which contain dates or 
timestamps before the above mentioned moments in time a `SparkUpgradeException` 
should be raised informing the user to choose either `LEGACY` or `CORRECTED` 
for the `datetimeRebaseModeInRead`
 * When reading parquet files written with Spark < 2.4.6 which contain dates or 
timestamps before the above mentioned moments in time and 
`datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
show the same values in Spark 3.0.1. with for example `df.show()` as they did 
in Spark 2.4.5
 * When reading parquet files written with Spark < 2.4.6 which contain dates or 
timestamps before the above mentioned moments in time and 
`datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
should show different values in Spark 3.0.1. with for example `df.show()` as 
they did in Spark 2.4.5
 * When writing parqet files with Spark > 3.0.0 which contain dates or 
timestamps before the above mentioned moment in time a `SparkUpgradeException` 
should be raised informing the user to choose either `LEGACY` or `CORRECTED` 
for the `datetimeRebaseModeInWrite`

First of all I'm not 100% sure all of this is correct. I've been unable to find 
any clear documentation on the expected behavior. The understanding I have was 
pieced together from the mailing list 

[jira] [Updated] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-11-27 Thread Simon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon updated SPARK-33571:
--
Component/s: PySpark

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` which contain timestamps before 
> the above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compares to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` or `DateType` which contain 
> dates or timestamps before the above mentioned moments in time with 
> `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the 
> dataframe as when using `CORRECTED`, so it seems like no rebasing is 
> happening.
> I've made some scripts to help with testing/show the behavior, it uses 
> pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here 
> [https://github.com/simonvanderveldt/spark3-rebasemode-issue.] I'll post the 
> outputs in a comment below as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33141) capture SQL configs when creating permanent views

2020-11-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33141:
---

Assignee: Leanken.Lin

> capture SQL configs when creating permanent views
> -
>
> Key: SPARK-33141
> URL: https://issues.apache.org/jira/browse/SPARK-33141
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Assignee: Leanken.Lin
>Priority: Major
>
> TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33141) capture SQL configs when creating permanent views

2020-11-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33141.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30289
[https://github.com/apache/spark/pull/30289]

> capture SQL configs when creating permanent views
> -
>
> Key: SPARK-33141
> URL: https://issues.apache.org/jira/browse/SPARK-33141
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Assignee: Leanken.Lin
>Priority: Major
> Fix For: 3.1.0
>
>
> TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33498) Datetime parsing should fail if the input string can't be parsed, or the pattern string is invalid

2020-11-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33498.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30442
[https://github.com/apache/spark/pull/30442]

> Datetime parsing should fail if the input string can't be parsed, or the 
> pattern string is invalid
> --
>
> Key: SPARK-33498
> URL: https://issues.apache.org/jira/browse/SPARK-33498
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Assignee: Leanken.Lin
>Priority: Major
> Fix For: 3.1.0
>
>
> Datetime parsing should fail if the input string can't be parsed, or the 
> pattern string is invalid, when ANSI mode is enable. This patch should update 
> GetTimeStamp, UnixTimeStamp, ToUnixTimeStamp and Cast



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33498) Datetime parsing should fail if the input string can't be parsed, or the pattern string is invalid

2020-11-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33498:
---

Assignee: Leanken.Lin

> Datetime parsing should fail if the input string can't be parsed, or the 
> pattern string is invalid
> --
>
> Key: SPARK-33498
> URL: https://issues.apache.org/jira/browse/SPARK-33498
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Assignee: Leanken.Lin
>Priority: Major
>
> Datetime parsing should fail if the input string can't be parsed, or the 
> pattern string is invalid, when ANSI mode is enable. This patch should update 
> GetTimeStamp, UnixTimeStamp, ToUnixTimeStamp and Cast



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33578) enableHiveSupport is invalid after sparkContext that without hive support created

2020-11-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239653#comment-17239653
 ] 

Apache Spark commented on SPARK-33578:
--

User 'yui2010' has created a pull request for this issue:
https://github.com/apache/spark/pull/30522

> enableHiveSupport is invalid after sparkContext that without hive support 
> created 
> --
>
> Key: SPARK-33578
> URL: https://issues.apache.org/jira/browse/SPARK-33578
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: steven zhang
>Priority: Minor
> Fix For: 3.1.0
>
>
> reproduce as follow code:
>         SparkConf sparkConf = new SparkConf().setAppName("hello");
>         sparkConf.set("spark.master", "local");
>         JavaSparkContext jssc = new JavaSparkContext(sparkConf);
>         spark = SparkSession.builder()
>                 .config("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
>                 .config("hive.exec.dynamici.partition", 
> true).config("hive.exec.dynamic.partition.mode", "nonstrict")
>                 .config("hive.metastore.uris", "thrift://hivemetastore:9083")
>                 .enableHiveSupport()
>                 .master("local")
>                 .getOrCreate();
>        spark.sql("select * from hudi_db.hudi_test_order").show();
>  
>  it will produce follow Exception    
> AssertionError: assertion failed: No plan for HiveTableRelation 
> [`hudi_db`.`hudi_test_order` … (at current master branch)  
> org.apache.spark.sql.AnalysisException: Table or view not found: 
> `hudi_db`.`hudi_test_order`;  (at spark v2.4.4)
>   
>  the reason is SparkContext#getOrCreate(SparkConf) will return activeContext 
> that include previous spark config if it has
> but the input SparkConf is the newest which include previous spark config and 
> options.
>   enableHiveSupport will set options (“spark.sql.catalogImplementation", 
> "hive”) when spark session created it will miss this conf
> SharedState load conf from sparkContext and will miss hive catalog



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33578) enableHiveSupport is invalid after sparkContext that without hive support created

2020-11-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33578:


Assignee: (was: Apache Spark)

> enableHiveSupport is invalid after sparkContext that without hive support 
> created 
> --
>
> Key: SPARK-33578
> URL: https://issues.apache.org/jira/browse/SPARK-33578
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: steven zhang
>Priority: Minor
> Fix For: 3.1.0
>
>
> reproduce as follow code:
>         SparkConf sparkConf = new SparkConf().setAppName("hello");
>         sparkConf.set("spark.master", "local");
>         JavaSparkContext jssc = new JavaSparkContext(sparkConf);
>         spark = SparkSession.builder()
>                 .config("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
>                 .config("hive.exec.dynamici.partition", 
> true).config("hive.exec.dynamic.partition.mode", "nonstrict")
>                 .config("hive.metastore.uris", "thrift://hivemetastore:9083")
>                 .enableHiveSupport()
>                 .master("local")
>                 .getOrCreate();
>        spark.sql("select * from hudi_db.hudi_test_order").show();
>  
>  it will produce follow Exception    
> AssertionError: assertion failed: No plan for HiveTableRelation 
> [`hudi_db`.`hudi_test_order` … (at current master branch)  
> org.apache.spark.sql.AnalysisException: Table or view not found: 
> `hudi_db`.`hudi_test_order`;  (at spark v2.4.4)
>   
>  the reason is SparkContext#getOrCreate(SparkConf) will return activeContext 
> that include previous spark config if it has
> but the input SparkConf is the newest which include previous spark config and 
> options.
>   enableHiveSupport will set options (“spark.sql.catalogImplementation", 
> "hive”) when spark session created it will miss this conf
> SharedState load conf from sparkContext and will miss hive catalog



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33578) enableHiveSupport is invalid after sparkContext that without hive support created

2020-11-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33578:


Assignee: Apache Spark

> enableHiveSupport is invalid after sparkContext that without hive support 
> created 
> --
>
> Key: SPARK-33578
> URL: https://issues.apache.org/jira/browse/SPARK-33578
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: steven zhang
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.1.0
>
>
> reproduce as follow code:
>         SparkConf sparkConf = new SparkConf().setAppName("hello");
>         sparkConf.set("spark.master", "local");
>         JavaSparkContext jssc = new JavaSparkContext(sparkConf);
>         spark = SparkSession.builder()
>                 .config("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
>                 .config("hive.exec.dynamici.partition", 
> true).config("hive.exec.dynamic.partition.mode", "nonstrict")
>                 .config("hive.metastore.uris", "thrift://hivemetastore:9083")
>                 .enableHiveSupport()
>                 .master("local")
>                 .getOrCreate();
>        spark.sql("select * from hudi_db.hudi_test_order").show();
>  
>  it will produce follow Exception    
> AssertionError: assertion failed: No plan for HiveTableRelation 
> [`hudi_db`.`hudi_test_order` … (at current master branch)  
> org.apache.spark.sql.AnalysisException: Table or view not found: 
> `hudi_db`.`hudi_test_order`;  (at spark v2.4.4)
>   
>  the reason is SparkContext#getOrCreate(SparkConf) will return activeContext 
> that include previous spark config if it has
> but the input SparkConf is the newest which include previous spark config and 
> options.
>   enableHiveSupport will set options (“spark.sql.catalogImplementation", 
> "hive”) when spark session created it will miss this conf
> SharedState load conf from sparkContext and will miss hive catalog



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33578) enableHiveSupport is invalid after sparkContext that without hive support created

2020-11-27 Thread steven zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

steven zhang updated SPARK-33578:
-
Description: 
reproduce as follow code:

        SparkConf sparkConf = new SparkConf().setAppName("hello");

        sparkConf.set("spark.master", "local");

        JavaSparkContext jssc = new JavaSparkContext(sparkConf);

        spark = SparkSession.builder()

                .config("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")

                .config("hive.exec.dynamici.partition", 
true).config("hive.exec.dynamic.partition.mode", "nonstrict")

                .config("hive.metastore.uris", "thrift://hivemetastore:9083")

                .enableHiveSupport()

                .master("local")

                .getOrCreate();

       spark.sql("select * from hudi_db.hudi_test_order").show();

 

 it will produce follow Exception    

AssertionError: assertion failed: No plan for HiveTableRelation 
[`hudi_db`.`hudi_test_order` … (at current master branch)  
org.apache.spark.sql.AnalysisException: Table or view not found: 
`hudi_db`.`hudi_test_order`;  (at spark v2.4.4)

  

 the reason is SparkContext#getOrCreate(SparkConf) will return activeContext 
that include previous spark config if it has

but the input SparkConf is the newest which include previous spark config and 
options.

  enableHiveSupport will set options (“spark.sql.catalogImplementation", 
"hive”) when spark session created it will miss this conf

SharedState load conf from sparkContext and will miss hive catalog

  was:
reproduce as follow code:

 

        SparkConf sparkConf = new SparkConf().setAppName("hello");

        sparkConf.set("spark.master", "local");

 

        JavaSparkContext jssc = new JavaSparkContext(sparkConf);

   

        spark = SparkSession.builder()

                .config("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")

                .config("hive.exec.dynamici.partition", 
true).config("hive.exec.dynamic.partition.mode", "nonstrict")

                .config("hive.metastore.uris", "thrift://hivemetastore:9083")

                .enableHiveSupport() 

                .master("local")

                .getOrCreate();

 

       spark.sql("select * from hudi_db.hudi_test_order").show();

 

 it will produce follow Exception     AssertionError: assertion failed: No plan 
for HiveTableRelation [`hudi_db`.`hudi_test_order` … (at current master branch)

 

                                org.apache.spark.sql.AnalysisException: Table 
or view not found: `hudi_db`.`hudi_test_order`;  (at spark v2.4.4)

 

 

 The reason is SparkContext#getOrCreate(SparkConf) will return activeContext 
that include previous spark config if it has

 but the input SparkConf is the newest which include previous spark config and 
options.

 

 enableHiveSupport will set options (“spark.sql.catalogImplementation", "hive”) 
when spark session created it will miss this conf

 SharedState load conf from sparkContext and will miss hive catalog


> enableHiveSupport is invalid after sparkContext that without hive support 
> created 
> --
>
> Key: SPARK-33578
> URL: https://issues.apache.org/jira/browse/SPARK-33578
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: steven zhang
>Priority: Minor
> Fix For: 3.1.0
>
>
> reproduce as follow code:
>         SparkConf sparkConf = new SparkConf().setAppName("hello");
>         sparkConf.set("spark.master", "local");
>         JavaSparkContext jssc = new JavaSparkContext(sparkConf);
>         spark = SparkSession.builder()
>                 .config("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
>                 .config("hive.exec.dynamici.partition", 
> true).config("hive.exec.dynamic.partition.mode", "nonstrict")
>                 .config("hive.metastore.uris", "thrift://hivemetastore:9083")
>                 .enableHiveSupport()
>                 .master("local")
>                 .getOrCreate();
>        spark.sql("select * from hudi_db.hudi_test_order").show();
>  
>  it will produce follow Exception    
> AssertionError: assertion failed: No plan for HiveTableRelation 
> [`hudi_db`.`hudi_test_order` … (at current master branch)  
> org.apache.spark.sql.AnalysisException: Table or view not found: 
> `hudi_db`.`hudi_test_order`;  (at spark v2.4.4)
>   
>  the reason is SparkContext#getOrCreate(SparkConf) will return activeContext 
> that include previous spark config if it has
> but the input SparkConf is the newest which include previous spark config and 
> options.
>   enableHiveSupport will set options (“spark.sql.catalogImplementation", 
> "hive”) when spark session created it will miss this conf
> SharedState 

[jira] [Created] (SPARK-33578) enableHiveSupport is invalid after sparkContext that without hive support created

2020-11-27 Thread steven zhang (Jira)
steven zhang created SPARK-33578:


 Summary: enableHiveSupport is invalid after sparkContext that 
without hive support created 
 Key: SPARK-33578
 URL: https://issues.apache.org/jira/browse/SPARK-33578
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: steven zhang
 Fix For: 3.1.0


reproduce as follow code:

 

        SparkConf sparkConf = new SparkConf().setAppName("hello");

        sparkConf.set("spark.master", "local");

 

        JavaSparkContext jssc = new JavaSparkContext(sparkConf);

   

        spark = SparkSession.builder()

                .config("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")

                .config("hive.exec.dynamici.partition", 
true).config("hive.exec.dynamic.partition.mode", "nonstrict")

                .config("hive.metastore.uris", "thrift://hivemetastore:9083")

                .enableHiveSupport() 

                .master("local")

                .getOrCreate();

 

       spark.sql("select * from hudi_db.hudi_test_order").show();

 

 it will produce follow Exception     AssertionError: assertion failed: No plan 
for HiveTableRelation [`hudi_db`.`hudi_test_order` … (at current master branch)

 

                                org.apache.spark.sql.AnalysisException: Table 
or view not found: `hudi_db`.`hudi_test_order`;  (at spark v2.4.4)

 

 

 The reason is SparkContext#getOrCreate(SparkConf) will return activeContext 
that include previous spark config if it has

 but the input SparkConf is the newest which include previous spark config and 
options.

 

 enableHiveSupport will set options (“spark.sql.catalogImplementation", "hive”) 
when spark session created it will miss this conf

 SharedState load conf from sparkContext and will miss hive catalog



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33557) spark.storage.blockManagerSlaveTimeoutMs default value does not follow spark.network.timeout value when the latter was changed

2020-11-27 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239646#comment-17239646
 ] 

Yang Jie commented on SPARK-33557:
--

 I'm not sure whether the configurations related to "spark.network.timeout" 
really meets the expected behavior. Needs to be investigated ~
 

> spark.storage.blockManagerSlaveTimeoutMs default value does not follow 
> spark.network.timeout value when the latter was changed
> --
>
> Key: SPARK-33557
> URL: https://issues.apache.org/jira/browse/SPARK-33557
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Ohad
>Priority: Minor
>
> According to the documentation "spark.network.timeout" is the default timeout 
> for "spark.storage.blockManagerSlaveTimeoutMs" which implies that when the 
> user sets "spark.network.timeout"  the effective value of 
> "spark.storage.blockManagerSlaveTimeoutMs" should also be changed if it was 
> not specifically changed.
> However this is not the case since the default value of 
> "spark.storage.blockManagerSlaveTimeoutMs" is always the default value of 
> "spark.network.timeout" (120s)
>  
> "spark.storage.blockManagerSlaveTimeoutMs" is defined in the package object 
> of "org.apache.spark.internal.config" as follows:
> {code:java}
> private[spark] val STORAGE_BLOCKMANAGER_SLAVE_TIMEOUT =
>   ConfigBuilder("spark.storage.blockManagerSlaveTimeoutMs")
> .version("0.7.0")
> .timeConf(TimeUnit.MILLISECONDS)
> .createWithDefaultString(Network.NETWORK_TIMEOUT.defaultValueString)
> {code}
> So it seems like the its default value is indeed "fixed" to 
> "spark.network.timeout" default value.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33557) spark.storage.blockManagerSlaveTimeoutMs default value does not follow spark.network.timeout value when the latter was changed

2020-11-27 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239631#comment-17239631
 ] 

Yang Jie commented on SPARK-33557:
--

It seems that changing value of "spark.network.timeout" doesn't really change 
the value of STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT and their relationship is 
maintained by code.

For example, the treatment of "spark.shuffle.io.connectionTimeout" is as 
follows:

 
{code:java}
/** Connect timeout in milliseconds. Default 120 secs. */
public int connectionTimeoutMs() {
  long defaultNetworkTimeoutS = JavaUtils.timeStringAsSec(
conf.get("spark.network.timeout", "120s"));
  long defaultTimeoutMs = JavaUtils.timeStringAsSec(
conf.get(SPARK_NETWORK_IO_CONNECTIONTIMEOUT_KEY, defaultNetworkTimeoutS + 
"s")) * 1000;
  return (int) defaultTimeoutMs;
}
{code}
 

 

But it seems that there is no similar treatment 
forSTORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT in HeartbeatReceiver and 
MesosCoarseGrainedSchedulerBackend

 
{code:java}
private val executorTimeoutMs = 
sc.conf.get(config.STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT)
{code}
 

 
{code:java}
mesosExternalShuffleClient.get
  .registerDriverWithShuffleService(
agent.hostname,
externalShufflePort,
sc.conf.get(config.STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT),
sc.conf.get(config.EXECUTOR_HEARTBEAT_INTERVAL))
{code}
 

Maybe need to be fixed by code changes.

 

 

> spark.storage.blockManagerSlaveTimeoutMs default value does not follow 
> spark.network.timeout value when the latter was changed
> --
>
> Key: SPARK-33557
> URL: https://issues.apache.org/jira/browse/SPARK-33557
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Ohad
>Priority: Minor
>
> According to the documentation "spark.network.timeout" is the default timeout 
> for "spark.storage.blockManagerSlaveTimeoutMs" which implies that when the 
> user sets "spark.network.timeout"  the effective value of 
> "spark.storage.blockManagerSlaveTimeoutMs" should also be changed if it was 
> not specifically changed.
> However this is not the case since the default value of 
> "spark.storage.blockManagerSlaveTimeoutMs" is always the default value of 
> "spark.network.timeout" (120s)
>  
> "spark.storage.blockManagerSlaveTimeoutMs" is defined in the package object 
> of "org.apache.spark.internal.config" as follows:
> {code:java}
> private[spark] val STORAGE_BLOCKMANAGER_SLAVE_TIMEOUT =
>   ConfigBuilder("spark.storage.blockManagerSlaveTimeoutMs")
> .version("0.7.0")
> .timeConf(TimeUnit.MILLISECONDS)
> .createWithDefaultString(Network.NETWORK_TIMEOUT.defaultValueString)
> {code}
> So it seems like the its default value is indeed "fixed" to 
> "spark.network.timeout" default value.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28646) Allow usage of `count` only for parameterless aggregate function

2020-11-27 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239619#comment-17239619
 ] 

jiaan.geng commented on SPARK-28646:


I will take a look!

> Allow usage of `count` only for parameterless aggregate function
> 
>
> Key: SPARK-28646
> URL: https://issues.apache.org/jira/browse/SPARK-28646
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Dylan Guedes
>Priority: Major
>
> Currently, Spark allows calls to `count` even for non parameterless aggregate 
> function. For example, the following query actually works:
> {code:sql}SELECT count() OVER () FROM tenk1;{code}
> In PgSQL, on the other hand, the following error is thrown:
> {code:sql}ERROR:  count(*) must be used to call a parameterless aggregate 
> function{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28645) Throw an error on window redefinition

2020-11-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28645.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30512
[https://github.com/apache/spark/pull/30512]

> Throw an error on window redefinition
> -
>
> Key: SPARK-28645
> URL: https://issues.apache.org/jira/browse/SPARK-28645
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently in Spark one could redefine a window. For instance:
> {code:sql}select count(*) OVER w FROM tenk1 WINDOW w AS (ORDER BY unique1), w 
> AS (ORDER BY unique1);{code}
> The window `w` is defined two times. In PgSQL, on the other hand, a thrown 
> will happen:
> {code:sql}ERROR:  window "w" is already defined{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28645) Throw an error on window redefinition

2020-11-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-28645:
---

Assignee: jiaan.geng

> Throw an error on window redefinition
> -
>
> Key: SPARK-28645
> URL: https://issues.apache.org/jira/browse/SPARK-28645
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Assignee: jiaan.geng
>Priority: Major
>
> Currently in Spark one could redefine a window. For instance:
> {code:sql}select count(*) OVER w FROM tenk1 WINDOW w AS (ORDER BY unique1), w 
> AS (ORDER BY unique1);{code}
> The window `w` is defined two times. In PgSQL, on the other hand, a thrown 
> will happen:
> {code:sql}ERROR:  window "w" is already defined{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33522) Improve exception messages while handling UnresolvedTableOrView

2020-11-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33522:
---

Assignee: Terry Kim

> Improve exception messages while handling UnresolvedTableOrView
> ---
>
> Key: SPARK-33522
> URL: https://issues.apache.org/jira/browse/SPARK-33522
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
>
> Improve exception messages while handling UnresolvedTableOrView.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33522) Improve exception messages while handling UnresolvedTableOrView

2020-11-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33522.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30475
[https://github.com/apache/spark/pull/30475]

> Improve exception messages while handling UnresolvedTableOrView
> ---
>
> Key: SPARK-33522
> URL: https://issues.apache.org/jira/browse/SPARK-33522
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
> Fix For: 3.1.0
>
>
> Improve exception messages while handling UnresolvedTableOrView.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33576) PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC message: negative bodyLength'.

2020-11-27 Thread Darshat (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darshat updated SPARK-33576:

Description: 
Hello,

We are using Databricks on Azure to process large amount of ecommerce data. 
Databricks runtime is 7.3 which includes Apache spark 3.0.1 and Scala 2.12.

During processing, there is a groupby operation on the DataFrame that 
consistently gets an exception of this type:

 

{color:#ff}PythonException: An exception was thrown from a UDF: 'OSError: 
Invalid IPC message: negative bodyLength'. Full traceback below: Traceback 
(most recent call last): File "/databricks/spark/python/pyspark/worker.py", 
line 654, in main process() File "/databricks/spark/python/pyspark/worker.py", 
line 646, in process serializer.dump_stream(out_iter, outfile) File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 281, in 
dump_stream timely_flush_timeout_ms=self.timely_flush_timeout_ms) File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 97, in 
dump_stream for batch in iterator: File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 271, in 
init_stream_yield_batches for series in iterator: File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 287, in 
load_stream for batch in batches: File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 228, in 
load_stream for batch in batches: File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 118, in 
load_stream for batch in reader: File "pyarrow/ipc.pxi", line 412, in __iter__ 
File "pyarrow/ipc.pxi", line 432, in 
pyarrow.lib._CRecordBatchReader.read_next_batch File "pyarrow/error.pxi", line 
99, in pyarrow.lib.check_status OSError: Invalid IPC message: negative 
bodyLength{color}

 

Code that causes this:

{color:#ff}x = df.groupby('providerid').apply(domain_features){color}

{color:#ff}display(x.info()){color}

Dataframe size - 22 million rows, 31 columns
 One of the columns is a string ('providerid') on which we do a groupby 
followed by an apply  operation. There are 3 distinct provider ids in this set. 
While trying to enumerate/count the results, we get this exception.

We've put all possible checks in the code for null values, or corrupt data and 
we are not able to track this to application level code. I hope we can get some 
help troubleshooting this as this is a blocker for rolling out at scale.

The cluster has 8 nodes + driver, all 28GB RAM. I can provide any other 
settings that could be useful. 
 Hope to get some insights into the problem. 

Thanks,

Darshat Shah

  was:
Hello,

We are using Databricks on Azure to process large amount of ecommerce data. 
Databricks runtime is 7.3 which includes Apache spark 3.0.1 and Scala 2.12.

During processing, there is a groupby operation on the DataFrame that 
consistently gets an exception of this type:

 

{color:#FF}PythonException: An exception was thrown from a UDF: 'OSError: 
Invalid IPC message: negative bodyLength'. Full traceback below: Traceback 
(most recent call last): File "/databricks/spark/python/pyspark/worker.py", 
line 654, in main process() File "/databricks/spark/python/pyspark/worker.py", 
line 646, in process serializer.dump_stream(out_iter, outfile) File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 281, in 
dump_stream timely_flush_timeout_ms=self.timely_flush_timeout_ms) File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 97, in 
dump_stream for batch in iterator: File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 271, in 
init_stream_yield_batches for series in iterator: File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 287, in 
load_stream for batch in batches: File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 228, in 
load_stream for batch in batches: File 
"/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 118, in 
load_stream for batch in reader: File "pyarrow/ipc.pxi", line 412, in __iter__ 
File "pyarrow/ipc.pxi", line 432, in 
pyarrow.lib._CRecordBatchReader.read_next_batch File "pyarrow/error.pxi", line 
99, in pyarrow.lib.check_status OSError: Invalid IPC message: negative 
bodyLength{color}

 

Code that causes this:

{color:#57d9a3}## df has 22 million rows and 3 distinct provider ids. Domain 
features adds couple of computed columns to the dataframe{color}
{color:#FF}x = df.groupby('providerid').apply(domain_features){color}

{color:#FF}display(x.info()){color}

We've put all possible checks in the code for null values, or corrupt data and 
we are not able to track this to application level code. I hope we can get some 
help troubleshooting this as this is a blocker for rolling out at scale.

Dataframe size - 22 million rows, 31 columns
One of the columns is a string ('providerid') on which we do a groupby followed 
by an 

[jira] [Commented] (SPARK-33577) Add support for V1Table in stream writer table API

2020-11-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239577#comment-17239577
 ] 

Apache Spark commented on SPARK-33577:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/30521

> Add support for V1Table in stream writer table API
> --
>
> Key: SPARK-33577
> URL: https://issues.apache.org/jira/browse/SPARK-33577
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> After SPARK-32896, we have table API for stream writer but only support 
> DataSource v2 tables. Here we add the following enhancements:
>  * Create non-existing tables by default
>  * Support both managed and external V1Tables



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33577) Add support for V1Table in stream writer table API

2020-11-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33577:


Assignee: Apache Spark

> Add support for V1Table in stream writer table API
> --
>
> Key: SPARK-33577
> URL: https://issues.apache.org/jira/browse/SPARK-33577
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Apache Spark
>Priority: Major
>
> After SPARK-32896, we have table API for stream writer but only support 
> DataSource v2 tables. Here we add the following enhancements:
>  * Create non-existing tables by default
>  * Support both managed and external V1Tables



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33577) Add support for V1Table in stream writer table API

2020-11-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33577:


Assignee: (was: Apache Spark)

> Add support for V1Table in stream writer table API
> --
>
> Key: SPARK-33577
> URL: https://issues.apache.org/jira/browse/SPARK-33577
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> After SPARK-32896, we have table API for stream writer but only support 
> DataSource v2 tables. Here we add the following enhancements:
>  * Create non-existing tables by default
>  * Support both managed and external V1Tables



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33577) Add support for V1Table in stream writer table API

2020-11-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239576#comment-17239576
 ] 

Apache Spark commented on SPARK-33577:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/30521

> Add support for V1Table in stream writer table API
> --
>
> Key: SPARK-33577
> URL: https://issues.apache.org/jira/browse/SPARK-33577
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> After SPARK-32896, we have table API for stream writer but only support 
> DataSource v2 tables. Here we add the following enhancements:
>  * Create non-existing tables by default
>  * Support both managed and external V1Tables



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33577) Add support for V1Table in stream writer table API

2020-11-27 Thread Yuanjian Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-33577:

Description: 
After SPARK-32896, we have table API for stream writer but only support 
DataSource v2 tables. Here we add the following enhancements:
 * Create non-existing tables by default
 * Support both managed and external V1Tables

  was:
After SPARK-32896, we have table API for stream writer but only support 
DataSource v2 tables. Here we add the following supports:
 * Create non-existing tables by default
 * Support both managed and external V1Tables


> Add support for V1Table in stream writer table API
> --
>
> Key: SPARK-33577
> URL: https://issues.apache.org/jira/browse/SPARK-33577
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> After SPARK-32896, we have table API for stream writer but only support 
> DataSource v2 tables. Here we add the following enhancements:
>  * Create non-existing tables by default
>  * Support both managed and external V1Tables



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33577) Add support for V1Table in stream writer table API

2020-11-27 Thread Yuanjian Li (Jira)
Yuanjian Li created SPARK-33577:
---

 Summary: Add support for V1Table in stream writer table API
 Key: SPARK-33577
 URL: https://issues.apache.org/jira/browse/SPARK-33577
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Yuanjian Li


After SPARK-32896, we have table API for stream writer but only support 
DataSource v2 tables. Here we add the following supports:
 * Create non-existing tables by default
 * Support both managed and external V1Tables



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33486) Collapse Partial and Final Aggregation into Complete Aggregation mode

2020-11-27 Thread Prakhar Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakhar Jain updated SPARK-33486:
-
Issue Type: Improvement  (was: Task)

> Collapse Partial and Final Aggregation into Complete Aggregation mode
> -
>
> Key: SPARK-33486
> URL: https://issues.apache.org/jira/browse/SPARK-33486
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.7, 3.0.0, 3.0.1
>Reporter: Prakhar Jain
>Priority: Major
>
> We should merge the Partial and Final Aggregation into one if there is no 
> exchange between them.
>  
> Example: select col1, max(col2) from t1 join t2 on col1 group by col1
> In this case, after the SortMergeJoin, Spark will do PartialAggregation and 
> then FinalAggregation. So it will create HashTables two times which is not 
> required. If there is lot of data after Join with many distinct col1's then 
> there is a possibility of Spill also in HashAggregateExec. So Spill will also 
> happen twice which can be avoided. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33486) Collapse Partial and Final Aggregation into Complete Aggregation mode

2020-11-27 Thread Prakhar Jain (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239574#comment-17239574
 ] 

Prakhar Jain commented on SPARK-33486:
--

[~dongjoon] Sure. Updating the Issue Type to Improvement.

> Collapse Partial and Final Aggregation into Complete Aggregation mode
> -
>
> Key: SPARK-33486
> URL: https://issues.apache.org/jira/browse/SPARK-33486
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.7, 3.0.0, 3.0.1
>Reporter: Prakhar Jain
>Priority: Major
>
> We should merge the Partial and Final Aggregation into one if there is no 
> exchange between them.
>  
> Example: select col1, max(col2) from t1 join t2 on col1 group by col1
> In this case, after the SortMergeJoin, Spark will do PartialAggregation and 
> then FinalAggregation. So it will create HashTables two times which is not 
> required. If there is lot of data after Join with many distinct col1's then 
> there is a possibility of Spill also in HashAggregateExec. So Spill will also 
> happen twice which can be avoided. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org