[jira] [Updated] (SPARK-33418) TaskSchedulerImpl: Check pending tasks in advance when resource offers

2020-11-10 Thread dingbei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dingbei updated SPARK-33418:

Description: 
It begins with the needs to  start a lot of spark streaming receivers .  *The 
launch time gets super long when it comes to more than 300 receivers.* I will 
show tests data I did and how I improved this.

*Tests preparation*

There are two cores exists in every executors.(one for receiver and the other 
one to process every batch of datas). I observed launch time of all receivers 
through  spark web UI (duration between the first receiver started to the last 
one started).

*Tests and data*

At first, we set the number of executors to 200 which means to start 200 
receivers and everything goes well. It takes about 50s to launch all 
receivers.({color:#FF}pic 1{color})

Then we set the number of executors to 500 which means to start 500 receivers. 
The launch time became around 5 mins.({color:#FF}pic 2{color})

 *Dig into souce code*

Then I start to look for the reason in the source code.  I use Thread dump to 
check which methods takes relatively long time.({color:#FF}pic 3{color}) 
Then I type logs between these methods. At last I find that the loop in 
TaskSchedulerImpl.resourceOffers will executes more than 
60.({color:#FF}pic 4{color})

*Solution*

The loop in TaskSchedulerImpl.resourceOffers will iterate all none-zombie 
TaskSetManagers in a queue of Pool. Normally the size of this queue is not so 
big because it gets removed when the all tasks is done. But for spark streaming 
jobs, we all konw receivers will be wrapped as a non-stop job ,which means its 
TaskSetManager will exists  in the queue forever unless the application is 
finished. For example, when it start to launch the 10th receiver ,the size of 
the queue is 10 ,so it will iterates 10 times and when it starts to launch the 
500th receiver, it will iterate 500 times . However 499 of the iteration are 
not necessay ,its task are already on running .  

When I digged deep into the code. I find that it decides whether a 
TaskSetManagers still has pending tasks left in TaskSetManagers 
.dequeueTaskFromList({color:#FF}pic 5{color}) which is far away form the 
loop in TaskSchedulerImpl.resourceOffers. So I move the pending tasks code 
ahead to  the loop in TaskSchedulerImpl.resourceOffers.({color:#FF}pic 
6{color}) ,and I also consided the speculation mode.

*conclusion*

**I think the spark contributors haven't thought a scenario where a lot of job 
are running at the same time which I know is unusual but still  a good 
complement。We managed to reduce the launch time of all receivers to around 50s 
(500 receivers).

  was:
It begins with the needs to  start a lot of spark streaming receivers .  *The 
launch time gets super long when it comes to more than 300 receivers.* I will 
show tests data I did and how I improved this.

*Tests preparation*

There are two cores exists in every executors.(one for receiver and the other 
one to process every batch of datas). I observed launch time of all receivers 
through  spark web UI (duration between the first receiver started to the last 
one started).

*Tests and data*

At first, we set the number of executors to 200 which means to start 200 
receivers and everything goes well. It takes about 50s to launch all receivers.

Then we set the number of executors to 500 which means to start 500 receivers. 
The launch time became around 5 mins.

 *Dig into souce code*

Then I start to look for the reason in the source code.  I use Thread dump to 
check which methods takes relatively long time. Then I type logs between these 
methods. At last I find that The loop in TaskSchedulerImpl.resourceOffers will 
executes more than 


> TaskSchedulerImpl: Check pending tasks in advance when resource offers
> --
>
> Key: SPARK-33418
> URL: https://issues.apache.org/jira/browse/SPARK-33418
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dingbei
>Priority: Major
>
> It begins with the needs to  start a lot of spark streaming receivers .  *The 
> launch time gets super long when it comes to more than 300 receivers.* I will 
> show tests data I did and how I improved this.
> *Tests preparation*
> There are two cores exists in every executors.(one for receiver and the other 
> one to process every batch of datas). I observed launch time of all receivers 
> through  spark web UI (duration between the first receiver started to the 
> last one started).
> *Tests and data*
> At first, we set the number of executors to 200 which means to start 200 
> receivers and everything goes well. It takes about 50s to launch all 
> receivers.({color:#FF}pic 1{color})
> Then we set the number of e

[jira] [Updated] (SPARK-33418) TaskSchedulerImpl: Check pending tasks in advance when resource offers

2020-11-10 Thread dingbei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dingbei updated SPARK-33418:

Description: 
It begins with the needs to  start a lot of spark streaming receivers .  *The 
launch time gets super long when it comes to more than 300 receivers.* I will 
show tests data I did and how I improved this.

*Tests preparation*

There are two cores exists in every executors.(one for receiver and the other 
one to process every batch of datas). I observed launch time of all receivers 
through  spark web UI (duration between the first receiver started to the last 
one started).

*Tests and data*

At first, we set the number of executors to 200 which means to start 200 
receivers and everything goes well. It takes about 50s to launch all receivers.

Then we set the number of executors to 500 which means to start 500 receivers. 
The launch time became around 5 mins.

 *Dig into souce code*

Then I start to look for the reason in the source code.  I use Thread dump to 
check which methods takes relatively long time. Then I type logs between these 
methods. At last I find that The loop in TaskSchedulerImpl.resourceOffers will 
executes more than 

  was:
It begins with the needs to  start a lot of spark streaming receivers .  *The 
launch time gets super long when it comes to more than 300 receivers.* I will 
show tests data I did and how I improved this.

*Tests preparation*

There are two cores exists in every executors.(one for receiver and the other 
one to process every batch of datas). I observed launch time of all receivers 
through  spark web UI (duration between the first receiver started to the last 
one started).

*Tests and data*

At first, we set the number of executors to 200 which means to start 200 
receivers and everything goes well. It takes about 50s to launch all receivers.

Then we set the number of executors to 500 which means to start 500 receivers. 
The launch time became around 5 mins.

 *Dig into souce code*

Then I start to look for the reason in the source code.  I use Thread dump to 
check which methods takes relatively long time. Then I type logs between these 
methods. At last I find that The loop in 


> TaskSchedulerImpl: Check pending tasks in advance when resource offers
> --
>
> Key: SPARK-33418
> URL: https://issues.apache.org/jira/browse/SPARK-33418
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dingbei
>Priority: Major
>
> It begins with the needs to  start a lot of spark streaming receivers .  *The 
> launch time gets super long when it comes to more than 300 receivers.* I will 
> show tests data I did and how I improved this.
> *Tests preparation*
> There are two cores exists in every executors.(one for receiver and the other 
> one to process every batch of datas). I observed launch time of all receivers 
> through  spark web UI (duration between the first receiver started to the 
> last one started).
> *Tests and data*
> At first, we set the number of executors to 200 which means to start 200 
> receivers and everything goes well. It takes about 50s to launch all 
> receivers.
> Then we set the number of executors to 500 which means to start 500 
> receivers. The launch time became around 5 mins.
>  *Dig into souce code*
> Then I start to look for the reason in the source code.  I use Thread dump to 
> check which methods takes relatively long time. Then I type logs between 
> these methods. At last I find that The loop in 
> TaskSchedulerImpl.resourceOffers will executes more than 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33418) TaskSchedulerImpl: Check pending tasks in advance when resource offers

2020-11-10 Thread dingbei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dingbei updated SPARK-33418:

Description: 
It begins with the needs to  start a lot of spark streaming receivers .  *The 
launch time gets super long when it comes to more than 300 receivers.* I will 
show tests data I did and how I improved this.

*Tests preparation*

There are two cores exists in every executors.(one for receiver and the other 
one to process every batch of datas). I observed launch time of all receivers 
through  spark web UI (duration between the first receiver started to the last 
one started).

*Tests and data*

At first, we set the number of executors to 200 which means to start 200 
receivers and everything goes well. It takes about 50s to launch all receivers.

Then we set the number of executors to 500 which means to start 500 receivers. 
The launch time became around 5 mins.

 *Dig into souce code*

Then I start to look for the reason in the source code.  I use Thread dump to 
check which methods takes relatively long time. Then I type logs between these 
methods. At last I find that The loop in 

  was:
It begins with the needs to  start a lot of spark streaming receivers .  *The 
launch time gets super long when it comes to more than 300 receivers.* I will 
show tests data I did and how I improved this.

*Tests preparation*

There are two cores exists in every executors.(one for receiver and the other 
one to process every batch of datas). I observed launch time of all receivers 
through  spark web UI (duration between the first receiver started to the last 
one started).

*Tests and data*

At first, we set the number of executors to 200 which means to start 200 
receivers and everything goes well. launch time is around 50s.processing time 
for every batch is around 10s.(picture 1)

Then we set the number of executors to 500 which means to start 500 receivers. 
The launch time became around 5 mins. processing time for every batch is around 
2mins.(picture 2)

 

Then I start to look for the reason in the source code.  I use Thread dump to 
check which methods takes relatively long time.


> TaskSchedulerImpl: Check pending tasks in advance when resource offers
> --
>
> Key: SPARK-33418
> URL: https://issues.apache.org/jira/browse/SPARK-33418
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dingbei
>Priority: Major
>
> It begins with the needs to  start a lot of spark streaming receivers .  *The 
> launch time gets super long when it comes to more than 300 receivers.* I will 
> show tests data I did and how I improved this.
> *Tests preparation*
> There are two cores exists in every executors.(one for receiver and the other 
> one to process every batch of datas). I observed launch time of all receivers 
> through  spark web UI (duration between the first receiver started to the 
> last one started).
> *Tests and data*
> At first, we set the number of executors to 200 which means to start 200 
> receivers and everything goes well. It takes about 50s to launch all 
> receivers.
> Then we set the number of executors to 500 which means to start 500 
> receivers. The launch time became around 5 mins.
>  *Dig into souce code*
> Then I start to look for the reason in the source code.  I use Thread dump to 
> check which methods takes relatively long time. Then I type logs between 
> these methods. At last I find that The loop in 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33418) TaskSchedulerImpl: Check pending tasks in advance when resource offers

2020-11-10 Thread dingbei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dingbei updated SPARK-33418:

Description: 
It begins with the needs to  start a lot of spark streaming receivers .  *The 
launch time gets super long when it comes to more than 300 receivers.* I will 
show tests data I did and how I improved this.

*Tests preparation*

There are two cores exists in every executors.(one for receiver and the other 
one to process every batch of datas). I observed launch time of all receivers 
through  spark web UI (duration between the first receiver started to the last 
one started).

*Tests and data*

At first, we set the number of executors to 200 which means to start 200 
receivers and everything goes well. launch time is around 50s.processing time 
for every batch is around 10s.(picture 1)

Then we set the number of executors to 500 which means to start 500 receivers. 
The launch time became around 5 mins. processing time for every batch is around 
2mins.(picture 2)

 

Then I start to look for the reason in the source code.  I use Thread dump to 
check which methods takes relatively long time.

  was:
It begins with the needs to  start a lot of spark streaming receivers .  The 
launch time gets super long when it comes to more than 300 receivers. I will 
show Tests data I did and how did I improve this. There are two cores in every 
executors.(one for receiver and the other one to process bacth jobs)

There is two main metrics i will mention below. 

receiver launch time :From the first receiver started to the last one.(observed 
through spark web UI)

 

At first, we set the number of executors to 200 which means to start 200 
receivers and everything goes well. launch time is around 50s.processing time 
for every batch is around 10s.(picture 1)

Then we set the number of executors to 500 which means to start 500 receivers. 
The launch time became around 5 mins. processing time for every batch is around 
2mins.(picture 2)

 

Then I start to look for the reason in the source code.  I use Thread dump to 
check which methods takes relatively long time.


> TaskSchedulerImpl: Check pending tasks in advance when resource offers
> --
>
> Key: SPARK-33418
> URL: https://issues.apache.org/jira/browse/SPARK-33418
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dingbei
>Priority: Major
>
> It begins with the needs to  start a lot of spark streaming receivers .  *The 
> launch time gets super long when it comes to more than 300 receivers.* I will 
> show tests data I did and how I improved this.
> *Tests preparation*
> There are two cores exists in every executors.(one for receiver and the other 
> one to process every batch of datas). I observed launch time of all receivers 
> through  spark web UI (duration between the first receiver started to the 
> last one started).
> *Tests and data*
> At first, we set the number of executors to 200 which means to start 200 
> receivers and everything goes well. launch time is around 50s.processing time 
> for every batch is around 10s.(picture 1)
> Then we set the number of executors to 500 which means to start 500 
> receivers. The launch time became around 5 mins. processing time for every 
> batch is around 2mins.(picture 2)
>  
> Then I start to look for the reason in the source code.  I use Thread dump to 
> check which methods takes relatively long time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33418) TaskSchedulerImpl: Check pending tasks in advance when resource offers

2020-11-10 Thread dingbei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dingbei updated SPARK-33418:

Description: 
It begins with the needs to  start a lot of spark streaming receivers .  The 
launch time gets super long when it comes to more than 300 receivers. I will 
show Tests data I did and how did I improve this. There are two cores in every 
executors.(one for receiver and the other one to process bacth jobs)

There is two main metrics i will mention below. 

receiver launch time :From the first receiver started to the last one.(observed 
through spark web UI)

 

At first, we set the number of executors to 200 which means to start 200 
receivers and everything goes well. launch time is around 50s.processing time 
for every batch is around 10s.(picture 1)

Then we set the number of executors to 500 which means to start 500 receivers. 
The launch time became around 5 mins. processing time for every batch is around 
2mins.(picture 2)

 

Then I start to look for the reason in the source code.  I use Thread dump to 
check which methods takes relatively long time.

  was:
It begins with the needs to  start a lot of spark streaming receivers .  The 
launch time gets super long when it comes to more than 300 receivers. I will 
show Tests data I did and how did I improve this. There are two cores in every 
executors.(one for receiver and the other one to process bacth jobs),and there 
will be a batch of data every 10s.

 

There is two main metrics i will mention below. 

receiver launch time :From the first receiver started to the last one.(observed 
through spark web UI)

batch data processing time:  the time it takes to process a batch of 
data.(observed through spark web UI streaming)

 

At first, we set the number of executors to 200 which means to start 200 
receivers and everything goes well. launch time is around 50s.processing time 
for every batch is around 10s.

Then we set the number of executors to 500 which means to start 500 receivers


> TaskSchedulerImpl: Check pending tasks in advance when resource offers
> --
>
> Key: SPARK-33418
> URL: https://issues.apache.org/jira/browse/SPARK-33418
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dingbei
>Priority: Major
>
> It begins with the needs to  start a lot of spark streaming receivers .  The 
> launch time gets super long when it comes to more than 300 receivers. I will 
> show Tests data I did and how did I improve this. There are two cores in 
> every executors.(one for receiver and the other one to process bacth jobs)
> There is two main metrics i will mention below. 
> receiver launch time :From the first receiver started to the last 
> one.(observed through spark web UI)
>  
> At first, we set the number of executors to 200 which means to start 200 
> receivers and everything goes well. launch time is around 50s.processing time 
> for every batch is around 10s.(picture 1)
> Then we set the number of executors to 500 which means to start 500 
> receivers. The launch time became around 5 mins. processing time for every 
> batch is around 2mins.(picture 2)
>  
> Then I start to look for the reason in the source code.  I use Thread dump to 
> check which methods takes relatively long time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33418) TaskSchedulerImpl: Check pending tasks in advance when resource offers

2020-11-10 Thread dingbei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dingbei updated SPARK-33418:

Description: 
It begins with the needs to  start a lot of spark streaming receivers .  The 
launch time gets super long when it comes to more than 300 receivers. I will 
show Tests data I did and how did I improve this. There are two cores in every 
executors.(one for receiver and the other one to process bacth jobs),and there 
will be a batch of data every 10s.

 

There is two main metrics i will mention below. 

receiver launch time :From the first receiver started to the last one.(observed 
through spark web UI)

batch data processing time:  the time it takes to process a batch of 
data.(observed through spark web UI streaming)

 

At first, we set the number of executors to 200 which means to start 200 
receivers and everything goes well. launch time is around 50s.processing time 
for every batch is around 10s.

Then we set the number of executors to 500 which means to start 500 receivers

> TaskSchedulerImpl: Check pending tasks in advance when resource offers
> --
>
> Key: SPARK-33418
> URL: https://issues.apache.org/jira/browse/SPARK-33418
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dingbei
>Priority: Major
>
> It begins with the needs to  start a lot of spark streaming receivers .  The 
> launch time gets super long when it comes to more than 300 receivers. I will 
> show Tests data I did and how did I improve this. There are two cores in 
> every executors.(one for receiver and the other one to process bacth 
> jobs),and there will be a batch of data every 10s.
>  
> There is two main metrics i will mention below. 
> receiver launch time :From the first receiver started to the last 
> one.(observed through spark web UI)
> batch data processing time:  the time it takes to process a batch of 
> data.(observed through spark web UI streaming)
>  
> At first, we set the number of executors to 200 which means to start 200 
> receivers and everything goes well. launch time is around 50s.processing time 
> for every batch is around 10s.
> Then we set the number of executors to 500 which means to start 500 receivers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33418) TaskSchedulerImpl: Check pending tasks in advance when resource offers

2020-11-10 Thread dingbei (Jira)
dingbei created SPARK-33418:
---

 Summary: TaskSchedulerImpl: Check pending tasks in advance when 
resource offers
 Key: SPARK-33418
 URL: https://issues.apache.org/jira/browse/SPARK-33418
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: dingbei






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33417) Correct the behaviour of query filters in TPCDSQueryBenchmark

2020-11-10 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-33417.
--
Fix Version/s: 3.1.0
   3.0.2
   2.4.8
 Assignee: Takeshi Yamamuro
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/30324

> Correct the behaviour of query filters in TPCDSQueryBenchmark 
> --
>
> Key: SPARK-33417
> URL: https://issues.apache.org/jira/browse/SPARK-33417
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> This PR intends to fix the behaviour of query filters in TPCDSQueryBenchmark. 
> We can use an option --query-filter for selecting TPCDS queries to run, e.g., 
> --query-filter q6,q8,q13. But, the current master has a weird behaviour about 
> the option. For example, if we pass --query-filter q6 so as to run the TPCDS 
> q6 only, TPCDSQueryBenchmark runs q6 and q6-v2.7 because the filterQueries 
> method does not respect the name suffix. So, there is no way now to run the 
> TPCDS q6 only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33414) Migrate SHOW CREATE TABLE to new resolution framework

2020-11-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33414.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30321
[https://github.com/apache/spark/pull/30321]

> Migrate SHOW CREATE TABLE to new resolution framework
> -
>
> Key: SPARK-33414
> URL: https://issues.apache.org/jira/browse/SPARK-33414
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
> Fix For: 3.1.0
>
>
> Migrate SHOW CRATE TABLE to new resolution framework.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33414) Migrate SHOW CREATE TABLE to new resolution framework

2020-11-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33414:
---

Assignee: Terry Kim

> Migrate SHOW CREATE TABLE to new resolution framework
> -
>
> Key: SPARK-33414
> URL: https://issues.apache.org/jira/browse/SPARK-33414
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
>
> Migrate SHOW CRATE TABLE to new resolution framework.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33382) Unify v1 and v2 SHOW TABLES tests

2020-11-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33382:
---

Assignee: Maxim Gekk

> Unify v1 and v2 SHOW TABLES tests
> -
>
> Key: SPARK-33382
> URL: https://issues.apache.org/jira/browse/SPARK-33382
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Gather common tests for DSv1 and DSv2 SHOW TABLES command to a common test. 
> Mix this trait to datasource specific test suites.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33382) Unify v1 and v2 SHOW TABLES tests

2020-11-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33382.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30287
[https://github.com/apache/spark/pull/30287]

> Unify v1 and v2 SHOW TABLES tests
> -
>
> Key: SPARK-33382
> URL: https://issues.apache.org/jira/browse/SPARK-33382
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> Gather common tests for DSv1 and DSv2 SHOW TABLES command to a common test. 
> Mix this trait to datasource specific test suites.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26825) Spark Structure Streaming job failing when submitted in cluster mode

2020-11-10 Thread Vinod KC (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229719#comment-17229719
 ] 

Vinod KC commented on SPARK-26825:
--

As a workaround, we can set a {{checkpointLocation to stream.}}

 eg:  {{option("checkpointLocation", "path/to/checkpoint/dir")}}

{{So both in client and cluster mode, same hdfs path will be referred .}}

> Spark Structure Streaming job failing when submitted in cluster mode
> 
>
> Key: SPARK-26825
> URL: https://issues.apache.org/jira/browse/SPARK-26825
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Andre Araujo
>Priority: Major
>
> I have a structured streaming job that runs successfully when launched in 
> "client" mode. However, when launched in "cluster" mode it fails with the 
> following weird messages on the error log. Note that the path in the error 
> message is actually a local filesystem path that has been mistakenly prefixed 
> with a {{hdfs://}} scheme.
> {code}
> 19/02/01 12:53:14 ERROR streaming.StreamMetadata: Error writing stream 
> metadata StreamMetadata(68f9fb30-5853-49b4-b192-f1e0483e0d95) to 
> hdfs://ns1/data/yarn/nm/usercache/root/appcache/application_1548823131831_0160/container_1548823131831_0160_02_01/tmp/temporary-3789423a-6ded-4084-aab3-3b6301c34e07/metadataorg.apache.hadoop.security.AccessControlException:
>  Permission denied: user=root, access=WRITE, 
> inode="/":hdfs:supergroup:drwxr-xr-x
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:400)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:256)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:194)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1853)
> {code}
> I dug a little bit into this and here's what I think it's going on:
> # When a new streaming query is created, the {{StreamingQueryManager}} 
> determines the checkpoint location 
> [here|https://github.com/apache/spark/blob/d811369ce23186cbb3208ad665e15408e13fea87/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#L216].
>  If neither the user nor the Spark conf specify a checkpoint location, the 
> location is returned by a call to {{Utils.createTempDir(namePrefix = 
> s"temporary").getCanonicalPath}}. 
>Here, I see two issues:
> #* The canonical path returned by {{Utils.createTempDir}} does *not* have a 
> scheme ({{hdfs://}} or {{file://}}), so, it's ambiguous as to what type of 
> file system the path belongs to.
> #* Also note that the path returned by the {{Utils.createTempDir}} call is a 
> local path, not a HDFS path, as the paths returned by the other two 
> conditions. I executed {{Utils.createTempDir}} in a test job, both in cluster 
> and client modes, and the results are these:
> {code}
> *Client mode:*
> java.io.tmpdir=/tmp
> createTempDir(namePrefix = s"temporary") => 
> /tmp/temporary-c51f1466-fd50-40c7-b136-1f2f06672e25
> *Cluster mode:*
> java.io.tmpdir=/yarn/nm/usercache/root/appcache/application_154906473_0029/container_154906473_0029_01_01/tmp/
> createTempDir(namePrefix = s"temporary") => 
> /yarn/nm/usercache/root/appcache/application_154906473_0029/container_154906473_0029_01_01/tmp/temporary-47c13b28-14bd-4d1b-8acc-3e445948415e
> {code}
> # This temporary checkpoint location is then [passed to the 
> constructor|https://github.com/apache/spark/blob/d811369ce23186cbb3208ad665e15408e13fea87/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#L276]
>  of the {{MicroBatchExecution}} instance
> # This is the point where [{{resolvedCheckpointRoot}} is 
> calculated|https://github.com/apache/spark/blob/755f9c20761e3db900c6c2b202cd3d2c5bbfb7c0/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L89].
>  Here, it's where things start to break: since the path returned by 
> {{Utils.createTempDir}} doesn't have a scheme, and since HDFS is the default 
> filesystem, the code resolves the path as being a HDFS path, rather than a 
> local one, as shown below:
> {code}
> scala> import org.apache.hadoop.fs.Path
> import org.apache.hadoop.fs.Path
> scala> // value returned by the Utils.createTempDir method
> scala> val checkpointRoot = 
> "/yarn/nm/usercache/root/appcache/application_154906473_0029/container_154906473_0029_01_01/tmp/temporary-47c13b28-14bd-4d1b-8acc-3e445948415e"
> checkpointRoot: String = 
> /yarn/nm/usercache/root/appcache/application_154906473_0029/container_154906473_0029_01_01/tmp/temporary-47c13b

[jira] [Assigned] (SPARK-33417) Correct the behaviour of query filters in TPCDSQueryBenchmark

2020-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33417:


Assignee: (was: Apache Spark)

> Correct the behaviour of query filters in TPCDSQueryBenchmark 
> --
>
> Key: SPARK-33417
> URL: https://issues.apache.org/jira/browse/SPARK-33417
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This PR intends to fix the behaviour of query filters in TPCDSQueryBenchmark. 
> We can use an option --query-filter for selecting TPCDS queries to run, e.g., 
> --query-filter q6,q8,q13. But, the current master has a weird behaviour about 
> the option. For example, if we pass --query-filter q6 so as to run the TPCDS 
> q6 only, TPCDSQueryBenchmark runs q6 and q6-v2.7 because the filterQueries 
> method does not respect the name suffix. So, there is no way now to run the 
> TPCDS q6 only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33417) Correct the behaviour of query filters in TPCDSQueryBenchmark

2020-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33417:


Assignee: Apache Spark

> Correct the behaviour of query filters in TPCDSQueryBenchmark 
> --
>
> Key: SPARK-33417
> URL: https://issues.apache.org/jira/browse/SPARK-33417
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> This PR intends to fix the behaviour of query filters in TPCDSQueryBenchmark. 
> We can use an option --query-filter for selecting TPCDS queries to run, e.g., 
> --query-filter q6,q8,q13. But, the current master has a weird behaviour about 
> the option. For example, if we pass --query-filter q6 so as to run the TPCDS 
> q6 only, TPCDSQueryBenchmark runs q6 and q6-v2.7 because the filterQueries 
> method does not respect the name suffix. So, there is no way now to run the 
> TPCDS q6 only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33417) Correct the behaviour of query filters in TPCDSQueryBenchmark

2020-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229685#comment-17229685
 ] 

Apache Spark commented on SPARK-33417:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/30324

> Correct the behaviour of query filters in TPCDSQueryBenchmark 
> --
>
> Key: SPARK-33417
> URL: https://issues.apache.org/jira/browse/SPARK-33417
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This PR intends to fix the behaviour of query filters in TPCDSQueryBenchmark. 
> We can use an option --query-filter for selecting TPCDS queries to run, e.g., 
> --query-filter q6,q8,q13. But, the current master has a weird behaviour about 
> the option. For example, if we pass --query-filter q6 so as to run the TPCDS 
> q6 only, TPCDSQueryBenchmark runs q6 and q6-v2.7 because the filterQueries 
> method does not respect the name suffix. So, there is no way now to run the 
> TPCDS q6 only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33417) Correct the behaviour of query filters in TPCDSQueryBenchmark

2020-11-10 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-33417:
-
Description: This PR intends to fix the behaviour of query filters in 
TPCDSQueryBenchmark. We can use an option --query-filter for selecting TPCDS 
queries to run, e.g., --query-filter q6,q8,q13. But, the current master has a 
weird behaviour about the option. For example, if we pass --query-filter q6 so 
as to run the TPCDS q6 only, TPCDSQueryBenchmark runs q6 and q6-v2.7 because 
the filterQueries method does not respect the name suffix. So, there is no way 
now to run the TPCDS q6 only.  (was: This ticket targets at fixing the 
behaviour of query filters in {{TPCDSQueryBenchmark}}. We can use an option 
{{--query-filter}} for selecting TPCDS queries to run, e.g., {{--query-filter 
q6,q8,q13}}. But, the current master has a weird behaviour about the option. 
For example, if we pass {{--query-filter q6}} so as to run the TPCDS q6 only, 
{{TPCDSQueryBenchmark}} runs {{q6}} and {{q6-v2.7}} because the 
{{filterQueries}} method does not respect the name suffix. So, there is no way 
now to run the TPCDS q6 only.)

> Correct the behaviour of query filters in TPCDSQueryBenchmark 
> --
>
> Key: SPARK-33417
> URL: https://issues.apache.org/jira/browse/SPARK-33417
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This PR intends to fix the behaviour of query filters in TPCDSQueryBenchmark. 
> We can use an option --query-filter for selecting TPCDS queries to run, e.g., 
> --query-filter q6,q8,q13. But, the current master has a weird behaviour about 
> the option. For example, if we pass --query-filter q6 so as to run the TPCDS 
> q6 only, TPCDSQueryBenchmark runs q6 and q6-v2.7 because the filterQueries 
> method does not respect the name suffix. So, there is no way now to run the 
> TPCDS q6 only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33417) Correct the behaviour of query filters in TPCDSQueryBenchmark

2020-11-10 Thread Takeshi Yamamuro (Jira)
Takeshi Yamamuro created SPARK-33417:


 Summary: Correct the behaviour of query filters in 
TPCDSQueryBenchmark 
 Key: SPARK-33417
 URL: https://issues.apache.org/jira/browse/SPARK-33417
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.8, 3.0.2, 3.1.0
Reporter: Takeshi Yamamuro


This ticket targets at fixing the behaviour of query filters in 
{{TPCDSQueryBenchmark}}. We can use an option {{--query-filter}} for selecting 
TPCDS queries to run, e.g., {{--query-filter q6,q8,q13}}. But, the current 
master has a weird behaviour about the option. For example, if we pass 
{{--query-filter q6}} so as to run the TPCDS q6 only, {{TPCDSQueryBenchmark}} 
runs {{q6}} and {{q6-v2.7}} because the {{filterQueries}} method does not 
respect the name suffix. So, there is no way now to run the TPCDS q6 only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33410) Resolve SQL query reference a column by an alias

2020-11-10 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-33410.
-
Resolution: Won't Fix

This is not a SQL standard.

> Resolve SQL query reference a column by an alias
> 
>
> Key: SPARK-33410
> URL: https://issues.apache.org/jira/browse/SPARK-33410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> This pr add support resolve SQL query reference a column by an alias, for 
> example:
> ```sql
> select id + 1 as new_id, new_id + 1 as new_new_id from range(5);
> ```
> Teradata support this feature: 
> https://docs.teradata.com/reader/e79ET77~NzPDz~Ykinj44w/MKSYuTyx2UJWXzdHJf3~sQ



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33390) Make Literal support char array

2020-11-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33390:


Assignee: ulysses you

> Make Literal support char array
> ---
>
> Key: SPARK-33390
> URL: https://issues.apache.org/jira/browse/SPARK-33390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
>
> Make Literal support char array.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33390) Make Literal support char array

2020-11-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33390.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30295
[https://github.com/apache/spark/pull/30295]

> Make Literal support char array
> ---
>
> Key: SPARK-33390
> URL: https://issues.apache.org/jira/browse/SPARK-33390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.1.0
>
>
> Make Literal support char array.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33404) "date_trunc" expression returns incorrect results

2020-11-10 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-33404.
--
Fix Version/s: 3.1.0
 Assignee: Utkarsh Agarwal
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/30303

> "date_trunc" expression returns incorrect results
> -
>
> Key: SPARK-33404
> URL: https://issues.apache.org/jira/browse/SPARK-33404
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Utkarsh Agarwal
>Assignee: Utkarsh Agarwal
>Priority: Major
>  Labels: correctness
> Fix For: 3.1.0
>
>
> `date_trunc` SQL expression returns incorrect results for {{minute}} 
> formatting string.
> Context: The {{minute}} formatting string should truncate the timestamps such 
> that the seconds is set to ZERO.
> Repro (run the following commands in spark-shell):
> {quote}
> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> spark.sql("SELECT date_trunc('minute', '1769-10-17 17:10:02')").show()
> {quote}
> Spark currently incorrectly returns 
> {quote}
> 1769-10-17 17:10:02
> {quote}
> against the expected return value of 
> {quote}
> 1769-10-17 17:10:00
> {quote}
> This happens as {{truncTimestamp}} in package 
> {{org.apache.spark.sql.catalyst.util.DateTimeUtils}} incorrectly assumes that 
> time zone offsets can never have the granularity of a second and thus does 
> not account for time zone adjustment when truncating the timestamp to 
> {{minute}}. 
> This assumption is currently used when truncating the timestamps to 
> {{microsecond, millisecond, second, or minute}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33404) "date_trunc" expression returns incorrect results

2020-11-10 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-33404:
-
Affects Version/s: 3.1.0

> "date_trunc" expression returns incorrect results
> -
>
> Key: SPARK-33404
> URL: https://issues.apache.org/jira/browse/SPARK-33404
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Utkarsh Agarwal
>Priority: Major
>  Labels: correctness
>
> `date_trunc` SQL expression returns incorrect results for {{minute}} 
> formatting string.
> Context: The {{minute}} formatting string should truncate the timestamps such 
> that the seconds is set to ZERO.
> Repro (run the following commands in spark-shell):
> {quote}
> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> spark.sql("SELECT date_trunc('minute', '1769-10-17 17:10:02')").show()
> {quote}
> Spark currently incorrectly returns 
> {quote}
> 1769-10-17 17:10:02
> {quote}
> against the expected return value of 
> {quote}
> 1769-10-17 17:10:00
> {quote}
> This happens as {{truncTimestamp}} in package 
> {{org.apache.spark.sql.catalyst.util.DateTimeUtils}} incorrectly assumes that 
> time zone offsets can never have the granularity of a second and thus does 
> not account for time zone adjustment when truncating the timestamp to 
> {{minute}}. 
> This assumption is currently used when truncating the timestamps to 
> {{microsecond, millisecond, second, or minute}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33337) Support subexpression elimination in branches of conditional expressions

2020-11-10 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-7?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-7.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30245
[https://github.com/apache/spark/pull/30245]

> Support subexpression elimination in branches of conditional expressions
> 
>
> Key: SPARK-7
> URL: https://issues.apache.org/jira/browse/SPARK-7
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently we skip subexpression elimination in branches of conditional 
> expressions including {{If}}, {{CaseWhen}}, and {{Coalesce}}. Actually we can 
> do subexpression elimination for such branches if the subexpression is common 
> across all branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33415) Column.__repr__ shouldn't encode JVM response

2020-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229567#comment-17229567
 ] 

Apache Spark commented on SPARK-33415:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/30322

> Column.__repr__ shouldn't encode JVM response
> -
>
> Key: SPARK-33415
> URL: https://issues.apache.org/jira/browse/SPARK-33415
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> At the moment PySpark {{Column}} {{encodes}} JVM response in {{__repr__}} 
> method.
> As a result, column names using only ASCII characters get {{b}} prefix
> {code:python}
> >>> from pyspark.sql.functions import col 
> >>>   
> >>>  
> >>> col("abc")
> >>>   
> >>>  
> Column
> {code}
> and the others ugly byte string
> {code:python}
> >>> col("wąż")
> >>>   
> >>>  
> Column
> {code}
> This behaviour is inconsistent with other parts of the API, for example:
> {code:python}
> >>> spark.createDataFrame([], "`wąż` long")   
> >>>   
> >>>  
> DataFrame[wąż: bigint]
> {code}
> and Scala
> {code:scala}
> scala> col("wąż")
> res0: org.apache.spark.sql.Column = wąż
> {code}
> and R
> {code:r}
> > column("wąż")
> Column wąż 
> {code}
> Encoding has been originally introduced with SPARK-5859, but it doesn't seem 
> like it is really required.
> Desired behaviour
> {code:python}
> >>> col("wąż")
> >>>   
> >>>  
> Column<'wąż'>
> {code}
> or
> {code:python}
> >>> col("wąż")
> >>>   
> >>>  
> Column
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33415) Column.__repr__ shouldn't encode JVM response

2020-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33415:


Assignee: (was: Apache Spark)

> Column.__repr__ shouldn't encode JVM response
> -
>
> Key: SPARK-33415
> URL: https://issues.apache.org/jira/browse/SPARK-33415
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> At the moment PySpark {{Column}} {{encodes}} JVM response in {{__repr__}} 
> method.
> As a result, column names using only ASCII characters get {{b}} prefix
> {code:python}
> >>> from pyspark.sql.functions import col 
> >>>   
> >>>  
> >>> col("abc")
> >>>   
> >>>  
> Column
> {code}
> and the others ugly byte string
> {code:python}
> >>> col("wąż")
> >>>   
> >>>  
> Column
> {code}
> This behaviour is inconsistent with other parts of the API, for example:
> {code:python}
> >>> spark.createDataFrame([], "`wąż` long")   
> >>>   
> >>>  
> DataFrame[wąż: bigint]
> {code}
> and Scala
> {code:scala}
> scala> col("wąż")
> res0: org.apache.spark.sql.Column = wąż
> {code}
> and R
> {code:r}
> > column("wąż")
> Column wąż 
> {code}
> Encoding has been originally introduced with SPARK-5859, but it doesn't seem 
> like it is really required.
> Desired behaviour
> {code:python}
> >>> col("wąż")
> >>>   
> >>>  
> Column<'wąż'>
> {code}
> or
> {code:python}
> >>> col("wąż")
> >>>   
> >>>  
> Column
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33415) Column.__repr__ shouldn't encode JVM response

2020-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33415:


Assignee: Apache Spark

> Column.__repr__ shouldn't encode JVM response
> -
>
> Key: SPARK-33415
> URL: https://issues.apache.org/jira/browse/SPARK-33415
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Minor
>
> At the moment PySpark {{Column}} {{encodes}} JVM response in {{__repr__}} 
> method.
> As a result, column names using only ASCII characters get {{b}} prefix
> {code:python}
> >>> from pyspark.sql.functions import col 
> >>>   
> >>>  
> >>> col("abc")
> >>>   
> >>>  
> Column
> {code}
> and the others ugly byte string
> {code:python}
> >>> col("wąż")
> >>>   
> >>>  
> Column
> {code}
> This behaviour is inconsistent with other parts of the API, for example:
> {code:python}
> >>> spark.createDataFrame([], "`wąż` long")   
> >>>   
> >>>  
> DataFrame[wąż: bigint]
> {code}
> and Scala
> {code:scala}
> scala> col("wąż")
> res0: org.apache.spark.sql.Column = wąż
> {code}
> and R
> {code:r}
> > column("wąż")
> Column wąż 
> {code}
> Encoding has been originally introduced with SPARK-5859, but it doesn't seem 
> like it is really required.
> Desired behaviour
> {code:python}
> >>> col("wąż")
> >>>   
> >>>  
> Column<'wąż'>
> {code}
> or
> {code:python}
> >>> col("wąż")
> >>>   
> >>>  
> Column
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33415) Column.__repr__ shouldn't encode JVM response

2020-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229566#comment-17229566
 ] 

Apache Spark commented on SPARK-33415:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/30322

> Column.__repr__ shouldn't encode JVM response
> -
>
> Key: SPARK-33415
> URL: https://issues.apache.org/jira/browse/SPARK-33415
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> At the moment PySpark {{Column}} {{encodes}} JVM response in {{__repr__}} 
> method.
> As a result, column names using only ASCII characters get {{b}} prefix
> {code:python}
> >>> from pyspark.sql.functions import col 
> >>>   
> >>>  
> >>> col("abc")
> >>>   
> >>>  
> Column
> {code}
> and the others ugly byte string
> {code:python}
> >>> col("wąż")
> >>>   
> >>>  
> Column
> {code}
> This behaviour is inconsistent with other parts of the API, for example:
> {code:python}
> >>> spark.createDataFrame([], "`wąż` long")   
> >>>   
> >>>  
> DataFrame[wąż: bigint]
> {code}
> and Scala
> {code:scala}
> scala> col("wąż")
> res0: org.apache.spark.sql.Column = wąż
> {code}
> and R
> {code:r}
> > column("wąż")
> Column wąż 
> {code}
> Encoding has been originally introduced with SPARK-5859, but it doesn't seem 
> like it is really required.
> Desired behaviour
> {code:python}
> >>> col("wąż")
> >>>   
> >>>  
> Column<'wąż'>
> {code}
> or
> {code:python}
> >>> col("wąż")
> >>>   
> >>>  
> Column
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33415) Column.__repr__ shouldn't encode JVM response

2020-11-10 Thread Maciej Szymkiewicz (Jira)
Maciej Szymkiewicz created SPARK-33415:
--

 Summary: Column.__repr__ shouldn't encode JVM response
 Key: SPARK-33415
 URL: https://issues.apache.org/jira/browse/SPARK-33415
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.1.0
Reporter: Maciej Szymkiewicz


At the moment PySpark {{Column}} {{encodes}} JVM response in {{__repr__}} 
method.

As a result, column names using only ASCII characters get {{b}} prefix

{code:python}
>>> from pyspark.sql.functions import col   
>>> 
>>>  
>>> col("abc")  
>>> 
>>>  
Column
{code}

and the others ugly byte string

{code:python}
>>> col("wąż")  
>>> 
>>>  
Column
{code}

This behaviour is inconsistent with other parts of the API, for example:

{code:python}
>>> spark.createDataFrame([], "`wąż` long") 
>>> 
>>>  
DataFrame[wąż: bigint]
{code}

and Scala

{code:scala}
scala> col("wąż")
res0: org.apache.spark.sql.Column = wąż
{code}

and R

{code:r}
> column("wąż")
Column wąż 
{code}

Encoding has been originally introduced with SPARK-5859, but it doesn't seem 
like it is really required.

Desired behaviour

{code:python}
>>> col("wąż")  
>>> 
>>>  
Column<'wąż'>
{code}

or

{code:python}
>>> col("wąż")  
>>> 
>>>  
Column
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33413) SparkUI / Executor Page doesn't work where under external reverse proxy

2020-11-10 Thread Pierre Leresteux (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Leresteux updated SPARK-33413:
-
Description: 
When using SparkUI under an external reverse proxy (NGinx, Træfik, ... ), you 
can configure SparkUI to be correctly displayed (html page and all static 
resources). But in the executors page, when calling the /api/, the URL don't 
use the Spark proxyBase configuration.

 

Here is an example :

I use Træfik to access to the SparkUI using the path : /sparkui (a proxy 
redirect is made when I call [http://acme.com/sparkui] to the SparkUI) and I 
use this Spark config : 
{code:java}
spark.ui.proxyBase /sparkui
spark.ui.proxyRedirectUri /{code}
using this config, when calling [http://acme.com/sparkui] I'll be redirected to 
[http://acme.com/sparkui/jobs/] and all is currently serve (HTML and all static 
files)

 

But I go on [http://acme.com/sparkui/executors/] a XHR call is made using this 
URL [http://acme.com/api/v1/applications] so it doesn't use the reverse proxy 
rule ... the correct URL should be : 
[http://acme.com/sparkui/api/v1/applications] 

 

I discover on the code (here for example 
[https://github.com/apache/spark/blob/branch-3.0/core/src/main/resources/org/apache/spark/ui/static/utils.js#L193)]
 that SparkUI use the location.origin (in my case 
[http://acme.com|http://acme.com%29./]). I think it should use the complete URL 
(under the last part like /executors/) instead or receive the proxyBase 
configuration to add it after the location.origin ... 

 

 

  was:
When using SparkUI under an externel reverse proxy (NGinx, Træfik, ... ), you 
can configure SparkUI to be correctly displayed (html page and all static 
resources). But in the executors page, when calling the /api/, the URL don't 
use the Spark proxyBase configuration.

 

Here is an example :

I use Træfik to access to the SparkUI using the path : /sparkui (a proxy 
redirect is made when I call [http://acme.com/sparkui] to the SparkUI) and I 
use this Spark config : 


{code:java}
spark.ui.proxyBase /sparkui
spark.ui.proxyRedirectUri /{code}
using this config, when calling [http://acme.com/sparkui] I'll be redirected to 
[http://acme.com/sparkui/jobs/] and all is currently serve (HTML and all static 
files)

 

But I go on [http://acme.com/sparkui/executors/] a XHR call is made using this 
URL [http://acme.com/api/v1/applications] so it doesn't use the reverse proxy 
rule ... the correct URL should be : 
[http://acme.com/sparkui/api/v1/applications] 

 

I discover on the code (here for example 
[https://github.com/apache/spark/blob/branch-3.0/core/src/main/resources/org/apache/spark/ui/static/utils.js#L193)]
 that SparkUI use the location.origin (in my case 
[http://acme.com|http://acme.com%29./]). I think it should use the complete URL 
(under the last part like /executors/) instead or receive the proxyBase 
configuration to add it after the location.origin ... 

 

 


> SparkUI / Executor Page doesn't work where under external reverse proxy
> ---
>
> Key: SPARK-33413
> URL: https://issues.apache.org/jira/browse/SPARK-33413
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Pierre Leresteux
>Priority: Minor
>
> When using SparkUI under an external reverse proxy (NGinx, Træfik, ... ), you 
> can configure SparkUI to be correctly displayed (html page and all static 
> resources). But in the executors page, when calling the /api/, the URL don't 
> use the Spark proxyBase configuration.
>  
> Here is an example :
> I use Træfik to access to the SparkUI using the path : /sparkui (a proxy 
> redirect is made when I call [http://acme.com/sparkui] to the SparkUI) and I 
> use this Spark config : 
> {code:java}
> spark.ui.proxyBase /sparkui
> spark.ui.proxyRedirectUri /{code}
> using this config, when calling [http://acme.com/sparkui] I'll be redirected 
> to [http://acme.com/sparkui/jobs/] and all is currently serve (HTML and all 
> static files)
>  
> But I go on [http://acme.com/sparkui/executors/] a XHR call is made using 
> this URL [http://acme.com/api/v1/applications] so it doesn't use the reverse 
> proxy rule ... the correct URL should be : 
> [http://acme.com/sparkui/api/v1/applications] 
>  
> I discover on the code (here for example 
> [https://github.com/apache/spark/blob/branch-3.0/core/src/main/resources/org/apache/spark/ui/static/utils.js#L193)]
>  that SparkUI use the location.origin (in my case 
> [http://acme.com|http://acme.com%29./]). I think it should use the complete 
> URL (under the last part like /executors/) instead or receive the proxyBase 
> configuration to add it after the location.origin ... 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-

[jira] [Assigned] (SPARK-33414) Migrate SHOW CREATE TABLE to new resolution framework

2020-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33414:


Assignee: Apache Spark

> Migrate SHOW CREATE TABLE to new resolution framework
> -
>
> Key: SPARK-33414
> URL: https://issues.apache.org/jira/browse/SPARK-33414
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Minor
>
> Migrate SHOW CRATE TABLE to new resolution framework.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33414) Migrate SHOW CREATE TABLE to new resolution framework

2020-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33414:


Assignee: (was: Apache Spark)

> Migrate SHOW CREATE TABLE to new resolution framework
> -
>
> Key: SPARK-33414
> URL: https://issues.apache.org/jira/browse/SPARK-33414
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Minor
>
> Migrate SHOW CRATE TABLE to new resolution framework.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33414) Migrate SHOW CREATE TABLE to new resolution framework

2020-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229479#comment-17229479
 ] 

Apache Spark commented on SPARK-33414:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/30321

> Migrate SHOW CREATE TABLE to new resolution framework
> -
>
> Key: SPARK-33414
> URL: https://issues.apache.org/jira/browse/SPARK-33414
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Minor
>
> Migrate SHOW CRATE TABLE to new resolution framework.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33414) Migrate SHOW CREATE TABLE to new resolution framework

2020-11-10 Thread Terry Kim (Jira)
Terry Kim created SPARK-33414:
-

 Summary: Migrate SHOW CREATE TABLE to new resolution framework
 Key: SPARK-33414
 URL: https://issues.apache.org/jira/browse/SPARK-33414
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Terry Kim


Migrate SHOW CRATE TABLE to new resolution framework.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33413) SparkUI / Executor Page doesn't work were under external reverse proxy

2020-11-10 Thread Pierre Leresteux (Jira)
Pierre Leresteux created SPARK-33413:


 Summary: SparkUI / Executor Page doesn't work were under external 
reverse proxy
 Key: SPARK-33413
 URL: https://issues.apache.org/jira/browse/SPARK-33413
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.0.1, 2.4.7
Reporter: Pierre Leresteux


When using SparkUI under an externel reverse proxy (NGinx, Træfik, ... ), you 
can configure SparkUI to be correctly displayed (html page and all static 
resources). But in the executors page, when calling the /api/, the URL don't 
use the Spark proxyBase configuration.

 

Here is an example :

I use Træfik to access to the SparkUI using the path : /sparkui (a proxy 
redirect is made when I call [http://acme.com/sparkui] to the SparkUI) and I 
use this Spark config : 
spark.ui.proxyBase /sparkui
spark.ui.proxyRedirectUri /
using this config, when calling [http://acme.com/sparkui] I'll be redirected to 
[http://acme.com/sparkui/jobs/] and all is currently serve (HTML and all static 
files)

 

But I go on [http://acme.com/sparkui/executors/] a XHR call is made using this 
URL [http://acme.com/api/v1/applications] so it doesn't use the reverse proxy 
rule ... the correct URL should be : 
[http://acme.com/sparkui/api/v1/applications] 

 

I discover on the code (here for example 
[https://github.com/apache/spark/blob/branch-3.0/core/src/main/resources/org/apache/spark/ui/static/utils.js#L193)]
 that SparkUI use the location.origin (in my case 
[http://acme.com|http://acme.com%29./]). I think it should use the complete URL 
(under the last part like /executors/) instead or receive the proxyBase 
configuration to add it after the location.origin ... 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33413) SparkUI / Executor Page doesn't work where under external reverse proxy

2020-11-10 Thread Pierre Leresteux (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Leresteux updated SPARK-33413:
-
Summary: SparkUI / Executor Page doesn't work where under external reverse 
proxy  (was: SparkUI / Executor Page doesn't work were under external reverse 
proxy)

> SparkUI / Executor Page doesn't work where under external reverse proxy
> ---
>
> Key: SPARK-33413
> URL: https://issues.apache.org/jira/browse/SPARK-33413
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Pierre Leresteux
>Priority: Minor
>
> When using SparkUI under an externel reverse proxy (NGinx, Træfik, ... ), you 
> can configure SparkUI to be correctly displayed (html page and all static 
> resources). But in the executors page, when calling the /api/, the URL don't 
> use the Spark proxyBase configuration.
>  
> Here is an example :
> I use Træfik to access to the SparkUI using the path : /sparkui (a proxy 
> redirect is made when I call [http://acme.com/sparkui] to the SparkUI) and I 
> use this Spark config : 
> {code:java}
> spark.ui.proxyBase /sparkui
> spark.ui.proxyRedirectUri /{code}
> using this config, when calling [http://acme.com/sparkui] I'll be redirected 
> to [http://acme.com/sparkui/jobs/] and all is currently serve (HTML and all 
> static files)
>  
> But I go on [http://acme.com/sparkui/executors/] a XHR call is made using 
> this URL [http://acme.com/api/v1/applications] so it doesn't use the reverse 
> proxy rule ... the correct URL should be : 
> [http://acme.com/sparkui/api/v1/applications] 
>  
> I discover on the code (here for example 
> [https://github.com/apache/spark/blob/branch-3.0/core/src/main/resources/org/apache/spark/ui/static/utils.js#L193)]
>  that SparkUI use the location.origin (in my case 
> [http://acme.com|http://acme.com%29./]). I think it should use the complete 
> URL (under the last part like /executors/) instead or receive the proxyBase 
> configuration to add it after the location.origin ... 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33413) SparkUI / Executor Page doesn't work were under external reverse proxy

2020-11-10 Thread Pierre Leresteux (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Leresteux updated SPARK-33413:
-
Description: 
When using SparkUI under an externel reverse proxy (NGinx, Træfik, ... ), you 
can configure SparkUI to be correctly displayed (html page and all static 
resources). But in the executors page, when calling the /api/, the URL don't 
use the Spark proxyBase configuration.

 

Here is an example :

I use Træfik to access to the SparkUI using the path : /sparkui (a proxy 
redirect is made when I call [http://acme.com/sparkui] to the SparkUI) and I 
use this Spark config : 


{code:java}
spark.ui.proxyBase /sparkui
spark.ui.proxyRedirectUri /{code}
using this config, when calling [http://acme.com/sparkui] I'll be redirected to 
[http://acme.com/sparkui/jobs/] and all is currently serve (HTML and all static 
files)

 

But I go on [http://acme.com/sparkui/executors/] a XHR call is made using this 
URL [http://acme.com/api/v1/applications] so it doesn't use the reverse proxy 
rule ... the correct URL should be : 
[http://acme.com/sparkui/api/v1/applications] 

 

I discover on the code (here for example 
[https://github.com/apache/spark/blob/branch-3.0/core/src/main/resources/org/apache/spark/ui/static/utils.js#L193)]
 that SparkUI use the location.origin (in my case 
[http://acme.com|http://acme.com%29./]). I think it should use the complete URL 
(under the last part like /executors/) instead or receive the proxyBase 
configuration to add it after the location.origin ... 

 

 

  was:
When using SparkUI under an externel reverse proxy (NGinx, Træfik, ... ), you 
can configure SparkUI to be correctly displayed (html page and all static 
resources). But in the executors page, when calling the /api/, the URL don't 
use the Spark proxyBase configuration.

 

Here is an example :

I use Træfik to access to the SparkUI using the path : /sparkui (a proxy 
redirect is made when I call [http://acme.com/sparkui] to the SparkUI) and I 
use this Spark config : 
{{}}
{code:java}
spark.ui.proxyBase /sparkui
spark.ui.proxyRedirectUri /{code}
using this config, when calling [http://acme.com/sparkui] I'll be redirected to 
[http://acme.com/sparkui/jobs/] and all is currently serve (HTML and all static 
files)

 

But I go on [http://acme.com/sparkui/executors/] a XHR call is made using this 
URL [http://acme.com/api/v1/applications] so it doesn't use the reverse proxy 
rule ... the correct URL should be : 
[http://acme.com/sparkui/api/v1/applications] 

 

I discover on the code (here for example 
[https://github.com/apache/spark/blob/branch-3.0/core/src/main/resources/org/apache/spark/ui/static/utils.js#L193)]
 that SparkUI use the location.origin (in my case 
[http://acme.com|http://acme.com%29./]). I think it should use the complete URL 
(under the last part like /executors/) instead or receive the proxyBase 
configuration to add it after the location.origin ... 

 

 


> SparkUI / Executor Page doesn't work were under external reverse proxy
> --
>
> Key: SPARK-33413
> URL: https://issues.apache.org/jira/browse/SPARK-33413
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Pierre Leresteux
>Priority: Minor
>
> When using SparkUI under an externel reverse proxy (NGinx, Træfik, ... ), you 
> can configure SparkUI to be correctly displayed (html page and all static 
> resources). But in the executors page, when calling the /api/, the URL don't 
> use the Spark proxyBase configuration.
>  
> Here is an example :
> I use Træfik to access to the SparkUI using the path : /sparkui (a proxy 
> redirect is made when I call [http://acme.com/sparkui] to the SparkUI) and I 
> use this Spark config : 
> {code:java}
> spark.ui.proxyBase /sparkui
> spark.ui.proxyRedirectUri /{code}
> using this config, when calling [http://acme.com/sparkui] I'll be redirected 
> to [http://acme.com/sparkui/jobs/] and all is currently serve (HTML and all 
> static files)
>  
> But I go on [http://acme.com/sparkui/executors/] a XHR call is made using 
> this URL [http://acme.com/api/v1/applications] so it doesn't use the reverse 
> proxy rule ... the correct URL should be : 
> [http://acme.com/sparkui/api/v1/applications] 
>  
> I discover on the code (here for example 
> [https://github.com/apache/spark/blob/branch-3.0/core/src/main/resources/org/apache/spark/ui/static/utils.js#L193)]
>  that SparkUI use the location.origin (in my case 
> [http://acme.com|http://acme.com%29./]). I think it should use the complete 
> URL (under the last part like /executors/) instead or receive the proxyBase 
> configuration to add it after the location.origin ... 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

--

[jira] [Updated] (SPARK-33413) SparkUI / Executor Page doesn't work were under external reverse proxy

2020-11-10 Thread Pierre Leresteux (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Leresteux updated SPARK-33413:
-
Description: 
When using SparkUI under an externel reverse proxy (NGinx, Træfik, ... ), you 
can configure SparkUI to be correctly displayed (html page and all static 
resources). But in the executors page, when calling the /api/, the URL don't 
use the Spark proxyBase configuration.

 

Here is an example :

I use Træfik to access to the SparkUI using the path : /sparkui (a proxy 
redirect is made when I call [http://acme.com/sparkui] to the SparkUI) and I 
use this Spark config : 
{{}}
{code:java}
spark.ui.proxyBase /sparkui
spark.ui.proxyRedirectUri /{code}
using this config, when calling [http://acme.com/sparkui] I'll be redirected to 
[http://acme.com/sparkui/jobs/] and all is currently serve (HTML and all static 
files)

 

But I go on [http://acme.com/sparkui/executors/] a XHR call is made using this 
URL [http://acme.com/api/v1/applications] so it doesn't use the reverse proxy 
rule ... the correct URL should be : 
[http://acme.com/sparkui/api/v1/applications] 

 

I discover on the code (here for example 
[https://github.com/apache/spark/blob/branch-3.0/core/src/main/resources/org/apache/spark/ui/static/utils.js#L193)]
 that SparkUI use the location.origin (in my case 
[http://acme.com|http://acme.com%29./]). I think it should use the complete URL 
(under the last part like /executors/) instead or receive the proxyBase 
configuration to add it after the location.origin ... 

 

 

  was:
When using SparkUI under an externel reverse proxy (NGinx, Træfik, ... ), you 
can configure SparkUI to be correctly displayed (html page and all static 
resources). But in the executors page, when calling the /api/, the URL don't 
use the Spark proxyBase configuration.

 

Here is an example :

I use Træfik to access to the SparkUI using the path : /sparkui (a proxy 
redirect is made when I call [http://acme.com/sparkui] to the SparkUI) and I 
use this Spark config : 
spark.ui.proxyBase /sparkui
spark.ui.proxyRedirectUri /
using this config, when calling [http://acme.com/sparkui] I'll be redirected to 
[http://acme.com/sparkui/jobs/] and all is currently serve (HTML and all static 
files)

 

But I go on [http://acme.com/sparkui/executors/] a XHR call is made using this 
URL [http://acme.com/api/v1/applications] so it doesn't use the reverse proxy 
rule ... the correct URL should be : 
[http://acme.com/sparkui/api/v1/applications] 

 

I discover on the code (here for example 
[https://github.com/apache/spark/blob/branch-3.0/core/src/main/resources/org/apache/spark/ui/static/utils.js#L193)]
 that SparkUI use the location.origin (in my case 
[http://acme.com|http://acme.com%29./]). I think it should use the complete URL 
(under the last part like /executors/) instead or receive the proxyBase 
configuration to add it after the location.origin ... 

 

 


> SparkUI / Executor Page doesn't work were under external reverse proxy
> --
>
> Key: SPARK-33413
> URL: https://issues.apache.org/jira/browse/SPARK-33413
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Pierre Leresteux
>Priority: Minor
>
> When using SparkUI under an externel reverse proxy (NGinx, Træfik, ... ), you 
> can configure SparkUI to be correctly displayed (html page and all static 
> resources). But in the executors page, when calling the /api/, the URL don't 
> use the Spark proxyBase configuration.
>  
> Here is an example :
> I use Træfik to access to the SparkUI using the path : /sparkui (a proxy 
> redirect is made when I call [http://acme.com/sparkui] to the SparkUI) and I 
> use this Spark config : 
> {{}}
> {code:java}
> spark.ui.proxyBase /sparkui
> spark.ui.proxyRedirectUri /{code}
> using this config, when calling [http://acme.com/sparkui] I'll be redirected 
> to [http://acme.com/sparkui/jobs/] and all is currently serve (HTML and all 
> static files)
>  
> But I go on [http://acme.com/sparkui/executors/] a XHR call is made using 
> this URL [http://acme.com/api/v1/applications] so it doesn't use the reverse 
> proxy rule ... the correct URL should be : 
> [http://acme.com/sparkui/api/v1/applications] 
>  
> I discover on the code (here for example 
> [https://github.com/apache/spark/blob/branch-3.0/core/src/main/resources/org/apache/spark/ui/static/utils.js#L193)]
>  that SparkUI use the location.origin (in my case 
> [http://acme.com|http://acme.com%29./]). I think it should use the complete 
> URL (under the last part like /executors/) instead or receive the proxyBase 
> configuration to add it after the location.origin ... 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---

[jira] [Updated] (SPARK-33412) OverwriteByExpression should resolve its delete condition based on the table relation not the input query

2020-11-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-33412:

Summary: OverwriteByExpression should resolve its delete condition based on 
the table relation not the input query  (was: update 
OverwriteByExpression.deleteExpr after resolving v2 write commands)

> OverwriteByExpression should resolve its delete condition based on the table 
> relation not the input query
> -
>
> Key: SPARK-33412
> URL: https://issues.apache.org/jira/browse/SPARK-33412
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-33374) [CORE] Remove unnecessary python path from spark home

2020-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-33374.
-

> [CORE] Remove unnecessary python path from spark home
> -
>
> Key: SPARK-33374
> URL: https://issues.apache.org/jira/browse/SPARK-33374
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Zhongwei Zhu
>Priority: Major
>
> Currently, spark-submit will upload pyspark.zip and py4j-0.10.9-src.zip into 
> staging folder, and both files will be added into PYTHONPATH. So it's 
> unnecessary to add duplicate files in current spark home folder on local 
> machine. 
> Output of `sys.path` as below:
> 'D:\\data\\yarnnm\\local\\usercache\\z\\appcache\\application_1603546638930_150736\\container_e1148_1603546638930_150736_01_02\\pyspark.zip',
> 'D:\\data\\yarnnm\\local\\usercache\\z\\appcache\\application_1603546638930_150736\\container_e1148_1603546638930_150736_01_02\\py4j-0.10.7-src.zip',
> 'D:\\data\\spark.latest\\python\\lib\\pyspark.zip',
> 'D:\\data\\spark.latest\\python\\lib\\py4j-0.10.7-src.zip',



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33374) [CORE] Remove unnecessary python path from spark home

2020-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33374.
---
Resolution: Invalid

> [CORE] Remove unnecessary python path from spark home
> -
>
> Key: SPARK-33374
> URL: https://issues.apache.org/jira/browse/SPARK-33374
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Zhongwei Zhu
>Priority: Major
>
> Currently, spark-submit will upload pyspark.zip and py4j-0.10.9-src.zip into 
> staging folder, and both files will be added into PYTHONPATH. So it's 
> unnecessary to add duplicate files in current spark home folder on local 
> machine. 
> Output of `sys.path` as below:
> 'D:\\data\\yarnnm\\local\\usercache\\z\\appcache\\application_1603546638930_150736\\container_e1148_1603546638930_150736_01_02\\pyspark.zip',
> 'D:\\data\\yarnnm\\local\\usercache\\z\\appcache\\application_1603546638930_150736\\container_e1148_1603546638930_150736_01_02\\py4j-0.10.7-src.zip',
> 'D:\\data\\spark.latest\\python\\lib\\pyspark.zip',
> 'D:\\data\\spark.latest\\python\\lib\\py4j-0.10.7-src.zip',



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33402) SparkHadoopWriter to pass job UUID down in spark.sql.sources.writeJobUUID

2020-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229344#comment-17229344
 ] 

Apache Spark commented on SPARK-33402:
--

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/30319

> SparkHadoopWriter to pass job UUID down in spark.sql.sources.writeJobUUID
> -
>
> Key: SPARK-33402
> URL: https://issues.apache.org/jira/browse/SPARK-33402
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.1.0
>Reporter: Steve Loughran
>Priority: Major
>
> SPARK-33230 restored setting a unique job ID in a spark sql job writing 
> through the hadoop output formatters, but saving files from an RDD don't 
> because SparkHadoopWriter doesn't insert the UUID
> Proposed: set the same property



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33402) SparkHadoopWriter to pass job UUID down in spark.sql.sources.writeJobUUID

2020-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229343#comment-17229343
 ] 

Apache Spark commented on SPARK-33402:
--

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/30319

> SparkHadoopWriter to pass job UUID down in spark.sql.sources.writeJobUUID
> -
>
> Key: SPARK-33402
> URL: https://issues.apache.org/jira/browse/SPARK-33402
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.1.0
>Reporter: Steve Loughran
>Priority: Major
>
> SPARK-33230 restored setting a unique job ID in a spark sql job writing 
> through the hadoop output formatters, but saving files from an RDD don't 
> because SparkHadoopWriter doesn't insert the UUID
> Proposed: set the same property



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33402) SparkHadoopWriter to pass job UUID down in spark.sql.sources.writeJobUUID

2020-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33402:


Assignee: Apache Spark

> SparkHadoopWriter to pass job UUID down in spark.sql.sources.writeJobUUID
> -
>
> Key: SPARK-33402
> URL: https://issues.apache.org/jira/browse/SPARK-33402
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.1.0
>Reporter: Steve Loughran
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-33230 restored setting a unique job ID in a spark sql job writing 
> through the hadoop output formatters, but saving files from an RDD don't 
> because SparkHadoopWriter doesn't insert the UUID
> Proposed: set the same property



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33402) SparkHadoopWriter to pass job UUID down in spark.sql.sources.writeJobUUID

2020-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33402:


Assignee: (was: Apache Spark)

> SparkHadoopWriter to pass job UUID down in spark.sql.sources.writeJobUUID
> -
>
> Key: SPARK-33402
> URL: https://issues.apache.org/jira/browse/SPARK-33402
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.1.0
>Reporter: Steve Loughran
>Priority: Major
>
> SPARK-33230 restored setting a unique job ID in a spark sql job writing 
> through the hadoop output formatters, but saving files from an RDD don't 
> because SparkHadoopWriter doesn't insert the UUID
> Proposed: set the same property



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33254) Migration to NumPy documentation style in Core (pyspark.*, pyspark.resource.*, etc.)

2020-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33254:


Assignee: (was: Apache Spark)

> Migration to NumPy documentation style in Core (pyspark.*, 
> pyspark.resource.*, etc.)
> 
>
> Key: SPARK-33254
> URL: https://issues.apache.org/jira/browse/SPARK-33254
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
>  This JIRA targets to migrate to NumPy documentation style in Core 
> (pyspark.\*, pyspark.resource.\*, etc.). Please also see the parent JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33254) Migration to NumPy documentation style in Core (pyspark.*, pyspark.resource.*, etc.)

2020-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33254:


Assignee: Apache Spark

> Migration to NumPy documentation style in Core (pyspark.*, 
> pyspark.resource.*, etc.)
> 
>
> Key: SPARK-33254
> URL: https://issues.apache.org/jira/browse/SPARK-33254
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
>  This JIRA targets to migrate to NumPy documentation style in Core 
> (pyspark.\*, pyspark.resource.\*, etc.). Please also see the parent JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33254) Migration to NumPy documentation style in Core (pyspark.*, pyspark.resource.*, etc.)

2020-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229342#comment-17229342
 ] 

Apache Spark commented on SPARK-33254:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/30320

> Migration to NumPy documentation style in Core (pyspark.*, 
> pyspark.resource.*, etc.)
> 
>
> Key: SPARK-33254
> URL: https://issues.apache.org/jira/browse/SPARK-33254
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
>  This JIRA targets to migrate to NumPy documentation style in Core 
> (pyspark.\*, pyspark.resource.\*, etc.). Please also see the parent JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33401) Vector type column is not possible to create using spark SQL

2020-11-10 Thread Pavlo Borshchenko (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229328#comment-17229328
 ] 

Pavlo Borshchenko commented on SPARK-33401:
---

[~maropu] it is a bug from the point of view of not being able to create a 
vector type column from SQL.

> Vector type column is not possible to create using spark SQL
> 
>
> Key: SPARK-33401
> URL: https://issues.apache.org/jira/browse/SPARK-33401
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Pavlo Borshchenko
>Priority: Major
>
>  
> Created table with vector type column:
> {code:java}
> import org.apache.spark.mllib.linalg.Vector
> import org.apache.spark.mllib.linalg.VectorUDT
> import org.apache.spark.mllib.linalg.Vectors
> case class Test(features: Vector) 
> Seq(Test(Vectors.dense(Array(1d, 2d, 3d.toDF()
>  .write
>  .mode("overwrite")
>  .saveAsTable("pborshchenko.test_vector_spark_0911_1")
> {code}
>  
> Show the create table statement for this created table:
> {code:java}
> spark.sql("SHOW CREATE TABLE pborshchenko.test_vector_spark_0911_1"){code}
> Got:
> {code:java}
> CREATE TABLE `pborshchenko`.`test_vector_spark_0911_1` (
>  `features` STRUCT<`type`: TINYINT, `size`: INT, `indices`: ARRAY, 
> `values`: ARRAY>)
> USING parquet{code}
> Create the same table with index 2 at the end:
> {code:java}
> spark.sql("CREATE TABLE `pborshchenko`.`test_vector_spark_0911_2` 
> (\n`features` STRUCT<`type`: TINYINT, `size`: INT, `indices`: ARRAY, 
> `values`: ARRAY>)\nUSING parquet"){code}
> Try to insert new values to the table created from SQL:
>  
> {code:java}
> import org.apache.spark.mllib.linalg.Vector
> import org.apache.spark.mllib.linalg.VectorUDT
> import org.apache.spark.mllib.linalg.Vectors
> case class Test(features: Vector)
> Seq(Test(Vectors.dense(Array(1d, 2d, 3d.toDF()
>  .write
>  .mode(SaveMode.Append)
>  .insertInto("pborshchenko.test_vector_spark_0911_2")
> {code}
>  
> Got:
>  
> {code:java}
>  AnalysisException: Cannot write incompatible data to table 
> '`pborshchenko`.`test_vector_spark_0911_2`': - Cannot write 'features': 
> struct,values:array> is 
> incompatible with 
> struct,values:array>;  - 
> Cannot write 'features': 
> struct,values:array> is 
> incompatible with 
> struct,values:array>; at 
> org.apache.spark.sql.catalyst.analysis.TableOutputResolver$.resolveOutputColumns(TableOutputResolver.scala:72)
>  at 
> org.apache.spark.sql.execution.datasources.PreprocessTableInsertion.org$apache$spark$sql$execution$datasources$PreprocessTableInsertion$$preprocess(rules.scala:467)
>  at 
> org.apache.spark.sql.execution.datasources.PreprocessTableInsertion$$anonfun$apply$3.applyOrElse(rules.scala:494)
>  at 
> org.apache.spark.sql.execution.datasources.PreprocessTableInsertion$$anonfun$apply$3.applyOrElse(rules.scala:486)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:112)
> {code}
>  
> The reason that table created from spark SQL has the type STRUCT, not vector, 
> but this struct is the right representation for vector type.
> AC: Should be possible to create a table using spark SQL with vector type 
> column and after that write to it without any errors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33404) "date_trunc" expression returns incorrect results

2020-11-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-33404:
-
Fix Version/s: (was: 3.1.0)

> "date_trunc" expression returns incorrect results
> -
>
> Key: SPARK-33404
> URL: https://issues.apache.org/jira/browse/SPARK-33404
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Utkarsh Agarwal
>Priority: Major
>  Labels: correctness
>
> `date_trunc` SQL expression returns incorrect results for {{minute}} 
> formatting string.
> Context: The {{minute}} formatting string should truncate the timestamps such 
> that the seconds is set to ZERO.
> Repro (run the following commands in spark-shell):
> {quote}
> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> spark.sql("SELECT date_trunc('minute', '1769-10-17 17:10:02')").show()
> {quote}
> Spark currently incorrectly returns 
> {quote}
> 1769-10-17 17:10:02
> {quote}
> against the expected return value of 
> {quote}
> 1769-10-17 17:10:00
> {quote}
> This happens as {{truncTimestamp}} in package 
> {{org.apache.spark.sql.catalyst.util.DateTimeUtils}} incorrectly assumes that 
> time zone offsets can never have the granularity of a second and thus does 
> not account for time zone adjustment when truncating the timestamp to 
> {{minute}}. 
> This assumption is currently used when truncating the timestamps to 
> {{microsecond, millisecond, second, or minute}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33376) Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader

2020-11-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33376.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30284
[https://github.com/apache/spark/pull/30284]

> Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader
> ---
>
> Key: SPARK-33376
> URL: https://issues.apache.org/jira/browse/SPARK-33376
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, when initializing {{IsolatedClientLoader}}, ppl can specify to 
> either share Hadoop classes from Spark or not. In the latter case it's 
> supposed to only loads the Hadoop classes from the Hive jars themselves.
> However this feature is currently used in two cases: 1) unit tests, 2) when 
> the Hadoop version defined in Maven can not be found when 
> {{spark.sql.hive.metastore.jars == "maven"}}. Also when 
> {{sharesHadoopClasses}} is false, it isn't really only using Hadoop classes 
> from Hive jars: Spark also download {{hadoop-client}} jar and put it together 
> with the Hive jars, and the Hadoop version used by {{hadoop-client}} is the 
> same version used by Spark itself. This could potentially cause issues 
> because we are mixing two versions of Hadoop jars in the classpath.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33302) Failed to push down filters through Expand

2020-11-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33302:
---

Assignee: angerszhu

> Failed to push down filters through Expand
> --
>
> Key: SPARK-33302
> URL: https://issues.apache.org/jira/browse/SPARK-33302
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.4, 3.0.1, 3.1.0
>Reporter: Yuming Wang
>Assignee: angerszhu
>Priority: Major
>
> How to reproduce this issue:
> {code:sql}
> create table SPARK_33302_1(pid int, uid int, sid int, dt date, suid int) 
> using parquet;
> create table SPARK_33302_2(pid int, vs int, uid int, csid int) using parquet;
> SELECT
>years,
>appversion,   
>SUM(uusers) AS users  
> FROM   (SELECT
>Date_trunc('year', dt)  AS years,
>CASE  
>  WHEN h.pid = 3 THEN 'iOS'   
>  WHEN h.pid = 4 THEN 'Android'   
>  ELSE 'Other'
>END AS viewport,  
>h.vsAS appversion,
>Count(DISTINCT u.uid)   AS uusers
>,Count(DISTINCT u.suid) AS srcusers
> FROM   SPARK_33302_1 u   
>join SPARK_33302_2 h  
>  ON h.uid = u.uid
> GROUP  BY 1, 
>   2, 
>   3) AS a
> WHERE  viewport = 'iOS'  
> GROUP  BY 1, 
>   2
> {code}
> {noformat}
> == Physical Plan ==
> *(5) HashAggregate(keys=[years#30, appversion#32], 
> functions=[sum(uusers#33L)])
> +- Exchange hashpartitioning(years#30, appversion#32, 200), true, [id=#251]
>+- *(4) HashAggregate(keys=[years#30, appversion#32], 
> functions=[partial_sum(uusers#33L)])
>   +- *(4) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS 
> TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
> 'Android' ELSE 'Other' END#46, vs#12], functions=[count(if ((gid#44 = 1)) 
> u.`uid`#47 else null)])
>  +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` AS 
> TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
> 'Android' ELSE 'Other' END#46, vs#12, 200), true, [id=#246]
> +- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS 
> TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
> 'Android' ELSE 'Other' END#46, vs#12], functions=[partial_count(if ((gid#44 = 
> 1)) u.`uid`#47 else null)])
>+- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS 
> TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
> 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], 
> functions=[])
>   +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` 
> AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
> 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44, 200), 
> true, [id=#241]
>  +- *(2) HashAggregate(keys=[date_trunc('year', 
> CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN 
> (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, 
> u.`suid`#48, gid#44], functions=[])
> +- *(2) Filter (CASE WHEN (h.`pid` = 3) THEN 'iOS' 
> WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46 = iOS)
>+- *(2) Expand [ArrayBuffer(date_trunc(year, 
> cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN iOS 
> WHEN (pid#11 = 4) THEN Android ELSE Other END, vs#12, uid#7, null, 1), 
> ArrayBuffer(date_trunc(year, cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE 
> WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END, 
> vs#12, null, suid#10, 2)], [date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, 
> CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 
> 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44]
>   +- *(2) Project [uid#7, dt#9, suid#10, pid#11, 
> vs#12]
>  +- *(2) BroadcastHashJoin [uid#7], [uid#13], 
> Inner, BuildRight
> :- *(2) Project [uid#7, dt#9, suid#10]
> :  +- *(2) Filter isnotnull(uid#7)
> 

[jira] [Resolved] (SPARK-33302) Failed to push down filters through Expand

2020-11-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33302.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30278
[https://github.com/apache/spark/pull/30278]

> Failed to push down filters through Expand
> --
>
> Key: SPARK-33302
> URL: https://issues.apache.org/jira/browse/SPARK-33302
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.4, 3.0.1, 3.1.0
>Reporter: Yuming Wang
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.1.0
>
>
> How to reproduce this issue:
> {code:sql}
> create table SPARK_33302_1(pid int, uid int, sid int, dt date, suid int) 
> using parquet;
> create table SPARK_33302_2(pid int, vs int, uid int, csid int) using parquet;
> SELECT
>years,
>appversion,   
>SUM(uusers) AS users  
> FROM   (SELECT
>Date_trunc('year', dt)  AS years,
>CASE  
>  WHEN h.pid = 3 THEN 'iOS'   
>  WHEN h.pid = 4 THEN 'Android'   
>  ELSE 'Other'
>END AS viewport,  
>h.vsAS appversion,
>Count(DISTINCT u.uid)   AS uusers
>,Count(DISTINCT u.suid) AS srcusers
> FROM   SPARK_33302_1 u   
>join SPARK_33302_2 h  
>  ON h.uid = u.uid
> GROUP  BY 1, 
>   2, 
>   3) AS a
> WHERE  viewport = 'iOS'  
> GROUP  BY 1, 
>   2
> {code}
> {noformat}
> == Physical Plan ==
> *(5) HashAggregate(keys=[years#30, appversion#32], 
> functions=[sum(uusers#33L)])
> +- Exchange hashpartitioning(years#30, appversion#32, 200), true, [id=#251]
>+- *(4) HashAggregate(keys=[years#30, appversion#32], 
> functions=[partial_sum(uusers#33L)])
>   +- *(4) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS 
> TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
> 'Android' ELSE 'Other' END#46, vs#12], functions=[count(if ((gid#44 = 1)) 
> u.`uid`#47 else null)])
>  +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` AS 
> TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
> 'Android' ELSE 'Other' END#46, vs#12, 200), true, [id=#246]
> +- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS 
> TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
> 'Android' ELSE 'Other' END#46, vs#12], functions=[partial_count(if ((gid#44 = 
> 1)) u.`uid`#47 else null)])
>+- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS 
> TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
> 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], 
> functions=[])
>   +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` 
> AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 
> 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44, 200), 
> true, [id=#241]
>  +- *(2) HashAggregate(keys=[date_trunc('year', 
> CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN 
> (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, 
> u.`suid`#48, gid#44], functions=[])
> +- *(2) Filter (CASE WHEN (h.`pid` = 3) THEN 'iOS' 
> WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46 = iOS)
>+- *(2) Expand [ArrayBuffer(date_trunc(year, 
> cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN iOS 
> WHEN (pid#11 = 4) THEN Android ELSE Other END, vs#12, uid#7, null, 1), 
> ArrayBuffer(date_trunc(year, cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE 
> WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END, 
> vs#12, null, suid#10, 2)], [date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, 
> CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 
> 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44]
>   +- *(2) Project [uid#7, dt#9, suid#10, pid#11, 
> vs#12]
>  +- *(2) BroadcastHashJoin [uid#7], [uid#13], 
> Inner, BuildRight
>

[jira] [Assigned] (SPARK-33305) DSv2: DROP TABLE command should also invalidate cache

2020-11-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33305:
---

Assignee: Chao Sun

> DSv2: DROP TABLE command should also invalidate cache
> -
>
> Key: SPARK-33305
> URL: https://issues.apache.org/jira/browse/SPARK-33305
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> Different from DSv1, {{DROP TABLE}} command in DSv2 currently only drops the 
> table but doesn't invalidate all caches referencing the table. We should make 
> the behavior consistent between v1 and v2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33305) DSv2: DROP TABLE command should also invalidate cache

2020-11-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33305.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30211
[https://github.com/apache/spark/pull/30211]

> DSv2: DROP TABLE command should also invalidate cache
> -
>
> Key: SPARK-33305
> URL: https://issues.apache.org/jira/browse/SPARK-33305
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.1.0
>
>
> Different from DSv1, {{DROP TABLE}} command in DSv2 currently only drops the 
> table but doesn't invalidate all caches referencing the table. We should make 
> the behavior consistent between v1 and v2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33412) update OverwriteByExpression.deleteExpr after resolving v2 write commands

2020-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229192#comment-17229192
 ] 

Apache Spark commented on SPARK-33412:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/30318

> update OverwriteByExpression.deleteExpr after resolving v2 write commands
> -
>
> Key: SPARK-33412
> URL: https://issues.apache.org/jira/browse/SPARK-33412
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33412) update OverwriteByExpression.deleteExpr after resolving v2 write commands

2020-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33412:


Assignee: Wenchen Fan  (was: Apache Spark)

> update OverwriteByExpression.deleteExpr after resolving v2 write commands
> -
>
> Key: SPARK-33412
> URL: https://issues.apache.org/jira/browse/SPARK-33412
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33412) update OverwriteByExpression.deleteExpr after resolving v2 write commands

2020-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229185#comment-17229185
 ] 

Apache Spark commented on SPARK-33412:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/30318

> update OverwriteByExpression.deleteExpr after resolving v2 write commands
> -
>
> Key: SPARK-33412
> URL: https://issues.apache.org/jira/browse/SPARK-33412
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33412) update OverwriteByExpression.deleteExpr after resolving v2 write commands

2020-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33412:


Assignee: Apache Spark  (was: Wenchen Fan)

> update OverwriteByExpression.deleteExpr after resolving v2 write commands
> -
>
> Key: SPARK-33412
> URL: https://issues.apache.org/jira/browse/SPARK-33412
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33412) update OverwriteByExpression.deleteExpr after resolving v2 write commands

2020-11-10 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-33412:
---

 Summary: update OverwriteByExpression.deleteExpr after resolving 
v2 write commands
 Key: SPARK-33412
 URL: https://issues.apache.org/jira/browse/SPARK-33412
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33372) Fix InSet bucket pruning

2020-11-10 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33372:

Fix Version/s: 2.4.8

> Fix InSet bucket pruning
> 
>
> Key: SPARK-33372
> URL: https://issues.apache.org/jira/browse/SPARK-33372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> This pr fix InSet bucket pruning because of it's values should not be Literal:
> https://github.com/apache/spark/blob/cbd3fdea62dab73fc4a96702de8fd1f07722da66/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L253-L255



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33409) Spark job can not be killed in BoradcastNestedLoopJoin

2020-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33409:


Assignee: (was: Apache Spark)

> Spark job can not be killed in BoradcastNestedLoopJoin
> --
>
> Key: SPARK-33409
> URL: https://issues.apache.org/jira/browse/SPARK-33409
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: zhou xiang
>Priority: Major
>
>  
> If we kill a spark job in spark web UI, the task content will be marked 
> interrupted, as the code below shows:
> {code:java}
> /**
>  * Kills a task by setting the interrupted flag to true. This relies on the 
> upper level Spark
>  * code and user code to properly handle the flag. This function should be 
> idempotent so it can
>  * be called multiple times.
>  * If interruptThread is true, we will also call Thread.interrupt() on the 
> Task's executor thread.
>  */
> def kill(interruptThread: Boolean, reason: String): Unit = {
>   require(reason != null)
>   _reasonIfKilled = reason
>   if (context != null) {
> context.markInterrupted(reason)
>   }
>   if (interruptThread && taskThread != null) {
> taskThread.interrupt()
>   }
> }{code}
>  
> And spark will check the interrupt flag during the loop to stop it. Like this:
> {code:java}
>  /**
>  * :: DeveloperApi ::
>  * An iterator that wraps around an existing iterator to provide task killing 
> functionality.
>  * It works by checking the interrupted flag in [[TaskContext]].
>  */
> @DeveloperApi
> class InterruptibleIterator[+T](val context: TaskContext, val delegate: 
> Iterator[T])
>   extends Iterator[T] {
>   def hasNext: Boolean = {
> // TODO(aarondav/rxin): Check Thread.interrupted instead of 
> context.interrupted if interrupt
> // is allowed. The assumption is that Thread.interrupted does not have a 
> memory fence in read
> // (just a volatile field in C), while context.interrupted is a volatile 
> in the JVM, which
> // introduces an expensive read fence.
> context.killTaskIfInterrupted()
> delegate.hasNext
>   }
>   def next(): T = delegate.next()
> }{code}
> In my case, there is a "not in" in my spark sql,  which leads to the 
> "BoradcastNestedLoopJoin"
> The related code as below:
> {code:java}
> private def leftExistenceJoin(
> relation: Broadcast[Array[InternalRow]],
> exists: Boolean): RDD[InternalRow] = {
>   assert(buildSide == BuildRight)
>   streamed.execute().mapPartitionsInternal { streamedIter =>
> val buildRows = relation.value
> val joinedRow = new JoinedRow
> if (condition.isDefined) {
>   streamedIter.filter(l =>
> buildRows.exists(r => boundCondition(joinedRow(l, r))) == exists
>   )
> } else if (buildRows.nonEmpty == exists) {
>   streamedIter
> } else {
>   Iterator.empty
> }
>   }
> }{code}
> The "streamedIter" and "buildRows" both have millions of records,  the 
> executor get stuck in the join loop, I found something wrong in my sql and 
> try to kill the job, but the executor thread is not interrupted. I have to 
> restart the executor to stop it.
> I think we should also do this check: " context.killTaskIfInterrupted() "  in 
> BoradcastNestedLoopJoin to support real cancel.
>  
>  
> {code:java}
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33409) Spark job can not be killed in BoradcastNestedLoopJoin

2020-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229168#comment-17229168
 ] 

Apache Spark commented on SPARK-33409:
--

User 'constzhou' has created a pull request for this issue:
https://github.com/apache/spark/pull/30317

> Spark job can not be killed in BoradcastNestedLoopJoin
> --
>
> Key: SPARK-33409
> URL: https://issues.apache.org/jira/browse/SPARK-33409
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: zhou xiang
>Priority: Major
>
>  
> If we kill a spark job in spark web UI, the task content will be marked 
> interrupted, as the code below shows:
> {code:java}
> /**
>  * Kills a task by setting the interrupted flag to true. This relies on the 
> upper level Spark
>  * code and user code to properly handle the flag. This function should be 
> idempotent so it can
>  * be called multiple times.
>  * If interruptThread is true, we will also call Thread.interrupt() on the 
> Task's executor thread.
>  */
> def kill(interruptThread: Boolean, reason: String): Unit = {
>   require(reason != null)
>   _reasonIfKilled = reason
>   if (context != null) {
> context.markInterrupted(reason)
>   }
>   if (interruptThread && taskThread != null) {
> taskThread.interrupt()
>   }
> }{code}
>  
> And spark will check the interrupt flag during the loop to stop it. Like this:
> {code:java}
>  /**
>  * :: DeveloperApi ::
>  * An iterator that wraps around an existing iterator to provide task killing 
> functionality.
>  * It works by checking the interrupted flag in [[TaskContext]].
>  */
> @DeveloperApi
> class InterruptibleIterator[+T](val context: TaskContext, val delegate: 
> Iterator[T])
>   extends Iterator[T] {
>   def hasNext: Boolean = {
> // TODO(aarondav/rxin): Check Thread.interrupted instead of 
> context.interrupted if interrupt
> // is allowed. The assumption is that Thread.interrupted does not have a 
> memory fence in read
> // (just a volatile field in C), while context.interrupted is a volatile 
> in the JVM, which
> // introduces an expensive read fence.
> context.killTaskIfInterrupted()
> delegate.hasNext
>   }
>   def next(): T = delegate.next()
> }{code}
> In my case, there is a "not in" in my spark sql,  which leads to the 
> "BoradcastNestedLoopJoin"
> The related code as below:
> {code:java}
> private def leftExistenceJoin(
> relation: Broadcast[Array[InternalRow]],
> exists: Boolean): RDD[InternalRow] = {
>   assert(buildSide == BuildRight)
>   streamed.execute().mapPartitionsInternal { streamedIter =>
> val buildRows = relation.value
> val joinedRow = new JoinedRow
> if (condition.isDefined) {
>   streamedIter.filter(l =>
> buildRows.exists(r => boundCondition(joinedRow(l, r))) == exists
>   )
> } else if (buildRows.nonEmpty == exists) {
>   streamedIter
> } else {
>   Iterator.empty
> }
>   }
> }{code}
> The "streamedIter" and "buildRows" both have millions of records,  the 
> executor get stuck in the join loop, I found something wrong in my sql and 
> try to kill the job, but the executor thread is not interrupted. I have to 
> restart the executor to stop it.
> I think we should also do this check: " context.killTaskIfInterrupted() "  in 
> BoradcastNestedLoopJoin to support real cancel.
>  
>  
> {code:java}
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33409) Spark job can not be killed in BoradcastNestedLoopJoin

2020-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33409:


Assignee: Apache Spark

> Spark job can not be killed in BoradcastNestedLoopJoin
> --
>
> Key: SPARK-33409
> URL: https://issues.apache.org/jira/browse/SPARK-33409
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: zhou xiang
>Assignee: Apache Spark
>Priority: Major
>
>  
> If we kill a spark job in spark web UI, the task content will be marked 
> interrupted, as the code below shows:
> {code:java}
> /**
>  * Kills a task by setting the interrupted flag to true. This relies on the 
> upper level Spark
>  * code and user code to properly handle the flag. This function should be 
> idempotent so it can
>  * be called multiple times.
>  * If interruptThread is true, we will also call Thread.interrupt() on the 
> Task's executor thread.
>  */
> def kill(interruptThread: Boolean, reason: String): Unit = {
>   require(reason != null)
>   _reasonIfKilled = reason
>   if (context != null) {
> context.markInterrupted(reason)
>   }
>   if (interruptThread && taskThread != null) {
> taskThread.interrupt()
>   }
> }{code}
>  
> And spark will check the interrupt flag during the loop to stop it. Like this:
> {code:java}
>  /**
>  * :: DeveloperApi ::
>  * An iterator that wraps around an existing iterator to provide task killing 
> functionality.
>  * It works by checking the interrupted flag in [[TaskContext]].
>  */
> @DeveloperApi
> class InterruptibleIterator[+T](val context: TaskContext, val delegate: 
> Iterator[T])
>   extends Iterator[T] {
>   def hasNext: Boolean = {
> // TODO(aarondav/rxin): Check Thread.interrupted instead of 
> context.interrupted if interrupt
> // is allowed. The assumption is that Thread.interrupted does not have a 
> memory fence in read
> // (just a volatile field in C), while context.interrupted is a volatile 
> in the JVM, which
> // introduces an expensive read fence.
> context.killTaskIfInterrupted()
> delegate.hasNext
>   }
>   def next(): T = delegate.next()
> }{code}
> In my case, there is a "not in" in my spark sql,  which leads to the 
> "BoradcastNestedLoopJoin"
> The related code as below:
> {code:java}
> private def leftExistenceJoin(
> relation: Broadcast[Array[InternalRow]],
> exists: Boolean): RDD[InternalRow] = {
>   assert(buildSide == BuildRight)
>   streamed.execute().mapPartitionsInternal { streamedIter =>
> val buildRows = relation.value
> val joinedRow = new JoinedRow
> if (condition.isDefined) {
>   streamedIter.filter(l =>
> buildRows.exists(r => boundCondition(joinedRow(l, r))) == exists
>   )
> } else if (buildRows.nonEmpty == exists) {
>   streamedIter
> } else {
>   Iterator.empty
> }
>   }
> }{code}
> The "streamedIter" and "buildRows" both have millions of records,  the 
> executor get stuck in the join loop, I found something wrong in my sql and 
> try to kill the job, but the executor thread is not interrupted. I have to 
> restart the executor to stop it.
> I think we should also do this check: " context.killTaskIfInterrupted() "  in 
> BoradcastNestedLoopJoin to support real cancel.
>  
>  
> {code:java}
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33411) Cardinality estimation of union, range and sort logical operators

2020-11-10 Thread Ayushi Agarwal (Jira)
Ayushi Agarwal created SPARK-33411:
--

 Summary: Cardinality estimation of union, range and sort logical 
operators
 Key: SPARK-33411
 URL: https://issues.apache.org/jira/browse/SPARK-33411
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.1
Reporter: Ayushi Agarwal


Support cardinality estimation for union, sort and range operators to enhance 
https://issues.apache.org/jira/browse/SPARK-16026



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33330) Catalyst StringType converter unable to convert enum type

2020-11-10 Thread Miguel Duarte (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-0?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miguel Duarte updated SPARK-0:
--
Description: 
Given that:
 # JavaTypeInference maps [Enums to 
StringType|https://github.com/apache/spark/blob/55105a0784459331d5506eee9f37c2e655a2a6a0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala#L135]
 
 # The converter for [StringType is 
StringConverter|https://github.com/apache/spark/blob/55105a0784459331d5506eee9f37c2e655a2a6a0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L64]
 # StringConverter is unable to convert an [Enums to 
UTF8String|https://github.com/apache/spark/blob/55105a0784459331d5506eee9f37c2e655a2a6a0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L294],
 failing with an InvalidArgumentException

 

It can be argued that CatalystTypeConverters should align with the 
exprectations set by JavaTypeInference and convert enums to their string 
representation. 

 

Edit:

Added PRs for 3.0.X and 2.4.X branches

  was:
Given that:
 # JavaTypeInference maps [Enums to 
StringType|https://github.com/apache/spark/blob/55105a0784459331d5506eee9f37c2e655a2a6a0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala#L135]
 
 # The converter for [StringType is 
StringConverter|https://github.com/apache/spark/blob/55105a0784459331d5506eee9f37c2e655a2a6a0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L64]
 # StringConverter is unable to convert an [Enums to 
UTF8String|https://github.com/apache/spark/blob/55105a0784459331d5506eee9f37c2e655a2a6a0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L294],
 failing with an InvalidArgumentException

 

It can be argued that CatalystTypeConverters should align with the 
exprectations set by JavaTypeInference and convert enums to their string 
representation. 


> Catalyst StringType converter unable to convert enum type
> -
>
> Key: SPARK-0
> URL: https://issues.apache.org/jira/browse/SPARK-0
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Miguel Duarte
>Priority: Major
>
> Given that:
>  # JavaTypeInference maps [Enums to 
> StringType|https://github.com/apache/spark/blob/55105a0784459331d5506eee9f37c2e655a2a6a0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala#L135]
>  
>  # The converter for [StringType is 
> StringConverter|https://github.com/apache/spark/blob/55105a0784459331d5506eee9f37c2e655a2a6a0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L64]
>  # StringConverter is unable to convert an [Enums to 
> UTF8String|https://github.com/apache/spark/blob/55105a0784459331d5506eee9f37c2e655a2a6a0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L294],
>  failing with an InvalidArgumentException
>  
> It can be argued that CatalystTypeConverters should align with the 
> exprectations set by JavaTypeInference and convert enums to their string 
> representation. 
>  
> Edit:
> Added PRs for 3.0.X and 2.4.X branches



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33330) Catalyst StringType converter unable to convert enum type

2020-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-0?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229148#comment-17229148
 ] 

Apache Spark commented on SPARK-0:
--

User 'malduarte' has created a pull request for this issue:
https://github.com/apache/spark/pull/30316

> Catalyst StringType converter unable to convert enum type
> -
>
> Key: SPARK-0
> URL: https://issues.apache.org/jira/browse/SPARK-0
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Miguel Duarte
>Priority: Major
>
> Given that:
>  # JavaTypeInference maps [Enums to 
> StringType|https://github.com/apache/spark/blob/55105a0784459331d5506eee9f37c2e655a2a6a0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala#L135]
>  
>  # The converter for [StringType is 
> StringConverter|https://github.com/apache/spark/blob/55105a0784459331d5506eee9f37c2e655a2a6a0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L64]
>  # StringConverter is unable to convert an [Enums to 
> UTF8String|https://github.com/apache/spark/blob/55105a0784459331d5506eee9f37c2e655a2a6a0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L294],
>  failing with an InvalidArgumentException
>  
> It can be argued that CatalystTypeConverters should align with the 
> exprectations set by JavaTypeInference and convert enums to their string 
> representation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33330) Catalyst StringType converter unable to convert enum type

2020-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-0?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229145#comment-17229145
 ] 

Apache Spark commented on SPARK-0:
--

User 'malduarte' has created a pull request for this issue:
https://github.com/apache/spark/pull/30316

> Catalyst StringType converter unable to convert enum type
> -
>
> Key: SPARK-0
> URL: https://issues.apache.org/jira/browse/SPARK-0
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Miguel Duarte
>Priority: Major
>
> Given that:
>  # JavaTypeInference maps [Enums to 
> StringType|https://github.com/apache/spark/blob/55105a0784459331d5506eee9f37c2e655a2a6a0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala#L135]
>  
>  # The converter for [StringType is 
> StringConverter|https://github.com/apache/spark/blob/55105a0784459331d5506eee9f37c2e655a2a6a0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L64]
>  # StringConverter is unable to convert an [Enums to 
> UTF8String|https://github.com/apache/spark/blob/55105a0784459331d5506eee9f37c2e655a2a6a0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L294],
>  failing with an InvalidArgumentException
>  
> It can be argued that CatalystTypeConverters should align with the 
> exprectations set by JavaTypeInference and convert enums to their string 
> representation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33388) Merge In and InSet predicate

2020-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229146#comment-17229146
 ] 

Apache Spark commented on SPARK-33388:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/30315

> Merge In and InSet predicate
> 
>
> Key: SPARK-33388
> URL: https://issues.apache.org/jira/browse/SPARK-33388
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Maybe we should create a base class for {{In}} and {{InSet}}, so that these 2 
> classes are only different in the expression tree, eval and codegen are the 
> same.
> [https://github.com/apache/spark/pull/28269#issuecomment-655365714]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33388) Merge In and InSet predicate

2020-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33388:


Assignee: Apache Spark

> Merge In and InSet predicate
> 
>
> Key: SPARK-33388
> URL: https://issues.apache.org/jira/browse/SPARK-33388
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> Maybe we should create a base class for {{In}} and {{InSet}}, so that these 2 
> classes are only different in the expression tree, eval and codegen are the 
> same.
> [https://github.com/apache/spark/pull/28269#issuecomment-655365714]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33388) Merge In and InSet predicate

2020-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33388:


Assignee: (was: Apache Spark)

> Merge In and InSet predicate
> 
>
> Key: SPARK-33388
> URL: https://issues.apache.org/jira/browse/SPARK-33388
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Maybe we should create a base class for {{In}} and {{InSet}}, so that these 2 
> classes are only different in the expression tree, eval and codegen are the 
> same.
> [https://github.com/apache/spark/pull/28269#issuecomment-655365714]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33388) Merge In and InSet predicate

2020-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229144#comment-17229144
 ] 

Apache Spark commented on SPARK-33388:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/30315

> Merge In and InSet predicate
> 
>
> Key: SPARK-33388
> URL: https://issues.apache.org/jira/browse/SPARK-33388
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Maybe we should create a base class for {{In}} and {{InSet}}, so that these 2 
> classes are only different in the expression tree, eval and codegen are the 
> same.
> [https://github.com/apache/spark/pull/28269#issuecomment-655365714]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33410) Resolve SQL query reference a column by an alias

2020-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229127#comment-17229127
 ] 

Apache Spark commented on SPARK-33410:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/30314

> Resolve SQL query reference a column by an alias
> 
>
> Key: SPARK-33410
> URL: https://issues.apache.org/jira/browse/SPARK-33410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> This pr add support resolve SQL query reference a column by an alias, for 
> example:
> ```sql
> select id + 1 as new_id, new_id + 1 as new_new_id from range(5);
> ```
> Teradata support this feature: 
> https://docs.teradata.com/reader/e79ET77~NzPDz~Ykinj44w/MKSYuTyx2UJWXzdHJf3~sQ



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33410) Resolve SQL query reference a column by an alias

2020-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229123#comment-17229123
 ] 

Apache Spark commented on SPARK-33410:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/30314

> Resolve SQL query reference a column by an alias
> 
>
> Key: SPARK-33410
> URL: https://issues.apache.org/jira/browse/SPARK-33410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> This pr add support resolve SQL query reference a column by an alias, for 
> example:
> ```sql
> select id + 1 as new_id, new_id + 1 as new_new_id from range(5);
> ```
> Teradata support this feature: 
> https://docs.teradata.com/reader/e79ET77~NzPDz~Ykinj44w/MKSYuTyx2UJWXzdHJf3~sQ



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33410) Resolve SQL query reference a column by an alias

2020-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33410:


Assignee: Apache Spark

> Resolve SQL query reference a column by an alias
> 
>
> Key: SPARK-33410
> URL: https://issues.apache.org/jira/browse/SPARK-33410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> This pr add support resolve SQL query reference a column by an alias, for 
> example:
> ```sql
> select id + 1 as new_id, new_id + 1 as new_new_id from range(5);
> ```
> Teradata support this feature: 
> https://docs.teradata.com/reader/e79ET77~NzPDz~Ykinj44w/MKSYuTyx2UJWXzdHJf3~sQ



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33410) Resolve SQL query reference a column by an alias

2020-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33410:


Assignee: (was: Apache Spark)

> Resolve SQL query reference a column by an alias
> 
>
> Key: SPARK-33410
> URL: https://issues.apache.org/jira/browse/SPARK-33410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> This pr add support resolve SQL query reference a column by an alias, for 
> example:
> ```sql
> select id + 1 as new_id, new_id + 1 as new_new_id from range(5);
> ```
> Teradata support this feature: 
> https://docs.teradata.com/reader/e79ET77~NzPDz~Ykinj44w/MKSYuTyx2UJWXzdHJf3~sQ



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33410) Resolve SQL query reference a column by an alias

2020-11-10 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33410:

Description: 
This pr add support resolve SQL query reference a column by an alias, for 
example:
```sql
select id + 1 as new_id, new_id + 1 as new_new_id from range(5);
```

Teradata support this feature: 
https://docs.teradata.com/reader/e79ET77~NzPDz~Ykinj44w/MKSYuTyx2UJWXzdHJf3~sQ

> Resolve SQL query reference a column by an alias
> 
>
> Key: SPARK-33410
> URL: https://issues.apache.org/jira/browse/SPARK-33410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> This pr add support resolve SQL query reference a column by an alias, for 
> example:
> ```sql
> select id + 1 as new_id, new_id + 1 as new_new_id from range(5);
> ```
> Teradata support this feature: 
> https://docs.teradata.com/reader/e79ET77~NzPDz~Ykinj44w/MKSYuTyx2UJWXzdHJf3~sQ



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33339) Pyspark application will hang due to non Exception

2020-11-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-9?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-9:


Assignee: lrz

> Pyspark application will hang due to non Exception
> --
>
> Key: SPARK-9
> URL: https://issues.apache.org/jira/browse/SPARK-9
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0, 3.0.1
>Reporter: lrz
>Assignee: lrz
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> When a system.exit exception occurs during the process, the python worker 
> exits abnormally, and then the executor task is still waiting for the worker 
> for reading from socket, causing it to hang.
> The system.exit exception may be caused by the user's error code, but spark 
> should at least throw an error to remind the user, not get stuck
> we can run a simple test to reproduce this case:
> ```
> from pyspark.sql import SparkSession
> def err(line):
>   raise SystemExit
> spark = SparkSession.builder.appName("test").getOrCreate()
> spark.sparkContext.parallelize(range(1,2), 2).map(err).collect()
> spark.stop()
> ``` 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33339) Pyspark application will hang due to non Exception

2020-11-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-9?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-9.
--
Fix Version/s: 3.1.0
   3.0.2
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/30248

> Pyspark application will hang due to non Exception
> --
>
> Key: SPARK-9
> URL: https://issues.apache.org/jira/browse/SPARK-9
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0, 3.0.1
>Reporter: lrz
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> When a system.exit exception occurs during the process, the python worker 
> exits abnormally, and then the executor task is still waiting for the worker 
> for reading from socket, causing it to hang.
> The system.exit exception may be caused by the user's error code, but spark 
> should at least throw an error to remind the user, not get stuck
> we can run a simple test to reproduce this case:
> ```
> from pyspark.sql import SparkSession
> def err(line):
>   raise SystemExit
> spark = SparkSession.builder.appName("test").getOrCreate()
> spark.sparkContext.parallelize(range(1,2), 2).map(err).collect()
> spark.stop()
> ``` 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33410) Resolve SQL query reference a column by an alias

2020-11-10 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-33410:
---

 Summary: Resolve SQL query reference a column by an alias
 Key: SPARK-33410
 URL: https://issues.apache.org/jira/browse/SPARK-33410
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33251) Migration to NumPy documentation style in ML (pyspark.ml.*)

2020-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229116#comment-17229116
 ] 

Apache Spark commented on SPARK-33251:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/30313

> Migration to NumPy documentation style in ML (pyspark.ml.*)
> ---
>
> Key: SPARK-33251
> URL: https://issues.apache.org/jira/browse/SPARK-33251
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
>  This JIRA targets to migrate to NumPy documentation style in ML 
> (pyspark.ml.*). Please also see the parent JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org