[jira] [Updated] (SPARK-34843) JDBCRelation columnPartition function improperly determines stride size

2021-03-24 Thread Jason Yarbrough (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Yarbrough updated SPARK-34843:

Description: 
Currently, in JDBCRelation (line 123), the stride size is calculated as follows:

val stride: Long = upperBound / numPartitions - lowerBound / numPartitions

 

Due to truncation happening on both divisions, the stride size can fall short 
of what it should be. This can lead to a big difference between the provided 
upper bound and the actual start of the last partition.

I propose this formula, as it is much more accurate and leads to better 
distribution:

val stride = (upperBound / numPartitions.toFloat - lowerBound / 
numPartitions.toFloat).toLong

 

An example (using a date column):

Say you're creating 1,000 partitions. If you provide a lower bound of 
1927-04-05 (this gets translated to -15611), and an upper bound of 2020-10-27 
(translated to 18563), Spark determines the stride size as follows:

 

(18563L / 1000L) - (-15611 / 1000L) = 33

Starting from the lower bound, doing strides of 33, you'll end up at 
2017-07-08. This is over 3 years of extra data that will go into the last 
partition, and depending on the shape of the data could cause a very long 
running task at the end of a job.

 

Using the formula I'm proposing, you'd get:

((18563L / 1000F) - (-15611 / 1000F)).toLong = 34

This would put the upper bound at 2020-04-02, which is much closer to the 
original supplied upper bound. This is the best you can do to get as close as 
possible to the upper bound (without adjusting the number of partitions). For 
example, a stride size of 35 would go well past the supplied upper bound (over 
2 years, 2022-11-22).

 

In the above example, there is only a difference of 1 between the stride size 
using the current formula and the stride size using the proposed formula, but 
with greater distance between the lower and upper bounds, or a lower number of 
partitions, the difference can be much greater. 

  was:
Currently, in JDBCRelation (line 123), the stride size is calculated as follows:

val stride: Long = upperBound / numPartitions - lowerBound / numPartitions

 

Due to truncation happening on both divisions, the stride size can fall short 
of what it should be. This can lead to a big difference between the provided 
upper bound and the actual start of the last partition.

I propose this formula, as it is much more accurate and leads to better 
distribution:

val stride = (upperBound / numPartitions.toFloat - lowerBound / 
numPartitions.toFloat).toLong

 

An example (using a date column):

Say you're creating 1,000 partitions. If you provide a lower bound of 
1927-04-05 (this gets translated to -15611), and an upper bound of 2020-10-27 
(translated to 18563), Spark determines the stride size as follows:

 

(18563L / 1000L) - (-15611 / 1000L) = 33

Starting from the lower bound, doing 998 strides of 33, you end up at 
2017-06-05 (currently, it will actually do 2017-07-08 due to adding the first 
stride into the lower partition). This is over 3 years of extra data that will 
go into the last partition, and depending on the shape of the data could cause 
a very long running task at the end of a job.

 

Using the formula I'm proposing, you'd get:

((18563L / 1000F) - (-15611 / 1000F)).toLong = 34

This would put the upper bound at 2020-02-28 (currently, it will actually do 
2020-04-02 due to adding the first stride into the lower partition), which is 
much closer to the original supplied upper bound. This is the best you can do 
to get as close as possible to the upper bound (without adjusting the number of 
partitions). For example, a stride size of 35 would go well past the supplied 
upper bound (over 2 years, 2022-11-22).

 

In the above example, there is only a difference of 1 between the stride size 
using the current formula and the stride size using the proposed formula, but 
with greater distance between the lower and upper bounds, or a lower number of 
partitions, the difference can be much greater. 


> JDBCRelation columnPartition function improperly determines stride size
> ---
>
> Key: SPARK-34843
> URL: https://issues.apache.org/jira/browse/SPARK-34843
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Yarbrough
>Priority: Minor
> Attachments: SPARK-34843.patch
>
>
> Currently, in JDBCRelation (line 123), the stride size is calculated as 
> follows:
> val stride: Long = upperBound / numPartitions - lowerBound / numPartitions
>  
> Due to truncation happening on both divisions, the stride size can fall short 
> of what it should be. This can lead to a big difference between the provided 
> upper bound and the actual start of the last partition.
> I 

[jira] [Comment Edited] (SPARK-34843) JDBCRelation columnPartition function improperly determines stride size

2021-03-24 Thread Jason Yarbrough (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308343#comment-17308343
 ] 

Jason Yarbrough edited comment on SPARK-34843 at 3/25/21, 4:21 AM:
---

I've modified the formula slightly (reflected in the description). I've also 
added logic to determine the extra amount per stride that is lost, take that, 
cut it in half, and then add that many strides to the starting value of the 
first partition. That will more evenly distribute the first and last 
partitions, and bring the middle of the partitions closer to the midpoint of 
the lower and upper bounds.

I'll get a PR in soon. Need to check the other unit tests and make sure nothing 
regressed.


was (Author: hanover-fiste):
I've modified the formula slightly (reflected in the description). I've also 
added logic to determine the extra amount per stride that is lost, take that, 
cut it in half (round down), and then add that many strides to the starting 
value of the first partition. That will more evenly distribute the first and 
last partitions, and bring the middle of the partitions closer to the midpoint 
of the lower and upper bounds.

I'll get a PR in soon. Need to check the unit tests and make sure nothing 
regressed.

> JDBCRelation columnPartition function improperly determines stride size
> ---
>
> Key: SPARK-34843
> URL: https://issues.apache.org/jira/browse/SPARK-34843
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Yarbrough
>Priority: Minor
> Attachments: SPARK-34843.patch
>
>
> Currently, in JDBCRelation (line 123), the stride size is calculated as 
> follows:
> val stride: Long = upperBound / numPartitions - lowerBound / numPartitions
>  
> Due to truncation happening on both divisions, the stride size can fall short 
> of what it should be. This can lead to a big difference between the provided 
> upper bound and the actual start of the last partition.
> I propose this formula, as it is much more accurate and leads to better 
> distribution:
> val stride = (upperBound / numPartitions.toFloat - lowerBound / 
> numPartitions.toFloat).toLong
>  
> An example (using a date column):
> Say you're creating 1,000 partitions. If you provide a lower bound of 
> 1927-04-05 (this gets translated to -15611), and an upper bound of 2020-10-27 
> (translated to 18563), Spark determines the stride size as follows:
>  
> (18563L / 1000L) - (-15611 / 1000L) = 33
> Starting from the lower bound, doing 998 strides of 33, you end up at 
> 2017-06-05 (currently, it will actually do 2017-07-08 due to adding the first 
> stride into the lower partition). This is over 3 years of extra data that 
> will go into the last partition, and depending on the shape of the data could 
> cause a very long running task at the end of a job.
>  
> Using the formula I'm proposing, you'd get:
> ((18563L / 1000F) - (-15611 / 1000F)).toLong = 34
> This would put the upper bound at 2020-02-28 (currently, it will actually do 
> 2020-04-02 due to adding the first stride into the lower partition), which is 
> much closer to the original supplied upper bound. This is the best you can do 
> to get as close as possible to the upper bound (without adjusting the number 
> of partitions). For example, a stride size of 35 would go well past the 
> supplied upper bound (over 2 years, 2022-11-22).
>  
> In the above example, there is only a difference of 1 between the stride size 
> using the current formula and the stride size using the proposed formula, but 
> with greater distance between the lower and upper bounds, or a lower number 
> of partitions, the difference can be much greater. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34843) JDBCRelation columnPartition function improperly determines stride size

2021-03-24 Thread Jason Yarbrough (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308343#comment-17308343
 ] 

Jason Yarbrough commented on SPARK-34843:
-

I've modified the formula slightly (reflected in the description). I've also 
added logic to determine the extra amount per stride that is lost, take that, 
cut it in half (round down), and then add that many strides to the starting 
value of the first partition. That will more evenly distribute the first and 
last partitions, and bring the middle of the partitions closer to the midpoint 
of the lower and upper bounds.

I'll get a PR in soon. Need to check the unit tests and make sure nothing 
regressed.

> JDBCRelation columnPartition function improperly determines stride size
> ---
>
> Key: SPARK-34843
> URL: https://issues.apache.org/jira/browse/SPARK-34843
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Yarbrough
>Priority: Minor
> Attachments: SPARK-34843.patch
>
>
> Currently, in JDBCRelation (line 123), the stride size is calculated as 
> follows:
> val stride: Long = upperBound / numPartitions - lowerBound / numPartitions
>  
> Due to truncation happening on both divisions, the stride size can fall short 
> of what it should be. This can lead to a big difference between the provided 
> upper bound and the actual start of the last partition.
> I propose this formula, as it is much more accurate and leads to better 
> distribution:
> val stride = (upperBound / numPartitions.toFloat - lowerBound / 
> numPartitions.toFloat).toLong
>  
> An example (using a date column):
> Say you're creating 1,000 partitions. If you provide a lower bound of 
> 1927-04-05 (this gets translated to -15611), and an upper bound of 2020-10-27 
> (translated to 18563), Spark determines the stride size as follows:
>  
> (18563L / 1000L) - (-15611 / 1000L) = 33
> Starting from the lower bound, doing 998 strides of 33, you end up at 
> 2017-06-05 (currently, it will actually do 2017-07-08 due to adding the first 
> stride into the lower partition). This is over 3 years of extra data that 
> will go into the last partition, and depending on the shape of the data could 
> cause a very long running task at the end of a job.
>  
> Using the formula I'm proposing, you'd get:
> ((18563L / 1000F) - (-15611 / 1000F)).toLong = 34
> This would put the upper bound at 2020-02-28 (currently, it will actually do 
> 2020-04-02 due to adding the first stride into the lower partition), which is 
> much closer to the original supplied upper bound. This is the best you can do 
> to get as close as possible to the upper bound (without adjusting the number 
> of partitions). For example, a stride size of 35 would go well past the 
> supplied upper bound (over 2 years, 2022-11-22).
>  
> In the above example, there is only a difference of 1 between the stride size 
> using the current formula and the stride size using the proposed formula, but 
> with greater distance between the lower and upper bounds, or a lower number 
> of partitions, the difference can be much greater. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34843) JDBCRelation columnPartition function improperly determines stride size

2021-03-24 Thread Jason Yarbrough (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Yarbrough updated SPARK-34843:

Description: 
Currently, in JDBCRelation (line 123), the stride size is calculated as follows:

val stride: Long = upperBound / numPartitions - lowerBound / numPartitions

 

Due to truncation happening on both divisions, the stride size can fall short 
of what it should be. This can lead to a big difference between the provided 
upper bound and the actual start of the last partition.

I propose this formula, as it is much more accurate and leads to better 
distribution:

val stride = (upperBound / numPartitions.toFloat - lowerBound / 
numPartitions.toFloat).toLong

 

An example (using a date column):

Say you're creating 1,000 partitions. If you provide a lower bound of 
1927-04-05 (this gets translated to -15611), and an upper bound of 2020-10-27 
(translated to 18563), Spark determines the stride size as follows:

 

(18563L / 1000L) - (-15611 / 1000L) = 33

Starting from the lower bound, doing 998 strides of 33, you end up at 
2017-06-05 (currently, it will actually do 2017-07-08 due to adding the first 
stride into the lower partition). This is over 3 years of extra data that will 
go into the last partition, and depending on the shape of the data could cause 
a very long running task at the end of a job.

 

Using the formula I'm proposing, you'd get:

((18563L / 1000F) - (-15611 / 1000F)).toLong = 34

This would put the upper bound at 2020-02-28 (currently, it will actually do 
2020-04-02 due to adding the first stride into the lower partition), which is 
much closer to the original supplied upper bound. This is the best you can do 
to get as close as possible to the upper bound (without adjusting the number of 
partitions). For example, a stride size of 35 would go well past the supplied 
upper bound (over 2 years, 2022-11-22).

 

In the above example, there is only a difference of 1 between the stride size 
using the current formula and the stride size using the proposed formula, but 
with greater distance between the lower and upper bounds, or a lower number of 
partitions, the difference can be much greater. 

  was:
Currently, in JDBCRelation (line 123), the stride size is calculated as follows:

val stride: Long = upperBound / numPartitions - lowerBound / numPartitions

 

Due to truncation happening on both divisions, the stride size can fall short 
of what it should be. This can lead to a big difference between the provided 
upper bound and the actual start of the last partition.

I propose this formula, as it is much more accurate and leads to better 
distribution:

val stride: Long = ((upperBound / (numPartitions - 2.0)) - (lowerBound / 
(numPartitions - 2.0))).floor.toLong

 

An example (using a date column):

Say you're creating 1,000 partitions. If you provide a lower bound of 
1927-04-05 (this gets translated to -15611), and an upper bound of 2020-10-27 
(translated to 18563), Spark determines the stride size as follows:

 

(18563L / 1000L) - (-15611 / 1000L) = 33

Starting from the lower bound, doing 998 strides of 33, you end up at 
2017-06-05 (currently, it will actually do 2017-07-08 due to adding the first 
stride into the lower partition). This is over 3 years of extra data that will 
go into the last partition, and depending on the shape of the data could cause 
a very long running task at the end of a job.

 

Using the formula I'm proposing, you'd get:

((18563L / (1000L - 2.0)) - (-15611 / (1000L - 2.0))).floor.toLong = 34

This would put the upper bound at 2020-02-28 (currently, it will actually do 
2020-04-02 due to adding the first stride into the lower partition), which is 
much closer to the original supplied upper bound. This is the best you can do 
to get as close as possible to the upper bound (without adjusting the number of 
partitions). For example, a stride size of 35 would go well past the supplied 
upper bound (over 2 years, 2022-11-22).

 

In the above example, there is only a difference of 1 between the stride size 
using the current formula and the stride size using the proposed formula, but 
with greater distance between the lower and upper bounds, or a lower number of 
partitions, the difference can be much greater. 


> JDBCRelation columnPartition function improperly determines stride size
> ---
>
> Key: SPARK-34843
> URL: https://issues.apache.org/jira/browse/SPARK-34843
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Yarbrough
>Priority: Minor
> Attachments: SPARK-34843.patch
>
>
> Currently, in JDBCRelation (line 123), the stride size is calculated as 
> follows:
> val stride: Long = upperBound / numPartitions - lowerBound / 

[jira] [Updated] (SPARK-34822) Update plan stability golden files even if only explain differs

2021-03-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34822:
-
Fix Version/s: (was: 3.2.0)

> Update plan stability golden files even if only explain differs
> ---
>
> Key: SPARK-34822
> URL: https://issues.apache.org/jira/browse/SPARK-34822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Assignee: Tanel Kiis
>Priority: Major
>
> PlanStabilitySuite updates the golden files only if simplified.txt has 
> changed. In some situations only explain.txt will change and the golden files 
> are not updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34822) Update plan stability golden files even if only explain differs

2021-03-24 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308324#comment-17308324
 ] 

Hyukjin Kwon commented on SPARK-34822:
--

Reverted at 
https://github.com/apache/spark/commit/7838f55ca795ca222541de7bc3cb065205718957

> Update plan stability golden files even if only explain differs
> ---
>
> Key: SPARK-34822
> URL: https://issues.apache.org/jira/browse/SPARK-34822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Assignee: Tanel Kiis
>Priority: Major
>
> PlanStabilitySuite updates the golden files only if simplified.txt has 
> changed. In some situations only explain.txt will change and the golden files 
> are not updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34855) SparkContext - avoid using local lazy val

2021-03-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34855:
-
Issue Type: Bug  (was: Improvement)

> SparkContext - avoid using local lazy val
> -
>
> Key: SPARK-34855
> URL: https://issues.apache.org/jira/browse/SPARK-34855
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Li Xian
>Priority: Minor
> Attachments: Screen Shot 2021-03-24 at 5.41.22 PM.png
>
>
> `org.apache.spark.SparkContext#getCallSite` uses local lazy val for 
> `callsite`. But in scala 2.11, local lazy val need synchronization on the 
> containing object `this` (see 
> [https://docs.scala-lang.org/sips/improved-lazy-val-initialization.html#version-6---no-synchronization-on-this-and-concurrent-initialization-of-fields]
>  and [https://github.com/scala/scala-dev/issues/133] )
> `getCallSite` is called at the job submission, and thus will be a bottle neck 
> if we are submitting a large amount of jobs on a single spark session. We 
> observed thread blocked due to this in our load test.
> !image-2021-03-24-17-42-50-412.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34860) Multinomial Logistic Regression with intercept support centering

2021-03-24 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-34860:


 Summary: Multinomial Logistic Regression with intercept support 
centering
 Key: SPARK-34860
 URL: https://issues.apache.org/jira/browse/SPARK-34860
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 3.2.0
Reporter: zhengruifeng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34857) AtLeastNNonNulls does not show up correctly in explain

2021-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34857:


Assignee: (was: Apache Spark)

> AtLeastNNonNulls does not show up correctly in explain
> --
>
> Key: SPARK-34857
> URL: https://issues.apache.org/jira/browse/SPARK-34857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.1.2
>Reporter: Tim Armstrong
>Priority: Minor
>
> I see this in an explain plan
> {noformat}
> (12) Filter
> Input [3]: [c1#2410L, c2#2419, c3#2422]
> Condition : AtLeastNNulls(n, c1#2410L)
> I expect it to be AtLeastNNonNulls and n to have the actual value.
> {noformat}
> Proposed fix is to change 
> https://github.com/databricks/runtime/blob/09ad623c458c40cbaa0dbb4b9f600d96531d35f5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala#L384
>  to
> {noformat}
> override def toString: String = s"AtLeastNNonNulls(${n}, 
> ${children.mkString(",")})"
> {noformat}
> Or maybe it's OK to remove and use a default implementation?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34857) AtLeastNNonNulls does not show up correctly in explain

2021-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34857:


Assignee: Apache Spark

> AtLeastNNonNulls does not show up correctly in explain
> --
>
> Key: SPARK-34857
> URL: https://issues.apache.org/jira/browse/SPARK-34857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.1.2
>Reporter: Tim Armstrong
>Assignee: Apache Spark
>Priority: Minor
>
> I see this in an explain plan
> {noformat}
> (12) Filter
> Input [3]: [c1#2410L, c2#2419, c3#2422]
> Condition : AtLeastNNulls(n, c1#2410L)
> I expect it to be AtLeastNNonNulls and n to have the actual value.
> {noformat}
> Proposed fix is to change 
> https://github.com/databricks/runtime/blob/09ad623c458c40cbaa0dbb4b9f600d96531d35f5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala#L384
>  to
> {noformat}
> override def toString: String = s"AtLeastNNonNulls(${n}, 
> ${children.mkString(",")})"
> {noformat}
> Or maybe it's OK to remove and use a default implementation?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34857) AtLeastNNonNulls does not show up correctly in explain

2021-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308313#comment-17308313
 ] 

Apache Spark commented on SPARK-34857:
--

User 'timarmstrong' has created a pull request for this issue:
https://github.com/apache/spark/pull/31956

> AtLeastNNonNulls does not show up correctly in explain
> --
>
> Key: SPARK-34857
> URL: https://issues.apache.org/jira/browse/SPARK-34857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.1.2
>Reporter: Tim Armstrong
>Priority: Minor
>
> I see this in an explain plan
> {noformat}
> (12) Filter
> Input [3]: [c1#2410L, c2#2419, c3#2422]
> Condition : AtLeastNNulls(n, c1#2410L)
> I expect it to be AtLeastNNonNulls and n to have the actual value.
> {noformat}
> Proposed fix is to change 
> https://github.com/databricks/runtime/blob/09ad623c458c40cbaa0dbb4b9f600d96531d35f5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala#L384
>  to
> {noformat}
> override def toString: String = s"AtLeastNNonNulls(${n}, 
> ${children.mkString(",")})"
> {noformat}
> Or maybe it's OK to remove and use a default implementation?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34488) Support task Metrics Distributions and executor Metrics Distributions in the REST API call for a specified stage

2021-03-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-34488:


Assignee: angerszhu  (was: Ron Hu)

> Support task Metrics Distributions and executor Metrics Distributions in the 
> REST API call for a specified stage
> 
>
> Key: SPARK-34488
> URL: https://issues.apache.org/jira/browse/SPARK-34488
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.2
>Reporter: Ron Hu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: executorMetricsDistributions.json, 
> taskMetricsDistributions.json
>
>
> For a specific stage, it is useful to show the task metrics in percentile 
> distribution.  This information can help users know whether or not there is a 
> skew/bottleneck among tasks in a given stage.  We list an example in 
> [^taskMetricsDistributions.json]
> Similarly, it is useful to show the executor metrics in percentile 
> distribution for a specific stage. This information can show whether or not 
> there is a skewed load on some executors.  We list an example in 
> [^executorMetricsDistributions.json]
>  
> We define withSummaries query parameter in the REST API for a specific stage 
> as:
> applications///?withSummaries=[true|false]=0.0,0.25,0.5,0.75,1.0
>  * When withSummaries=true, both task metrics in percentile distribution and 
> executor metrics in percentile distribution are included in the REST API 
> output.  The default value of withSummaries is false, i.e. no metrics 
> percentile distribution will be included in the REST API output.
>  * Query parameter quantiles define the quantiles we use to calculate metrics 
> distributions.  It takes effect only when {{withSummaries=true.}}  Its 
> default value is {{0.0,0.25,0.5,0.75,1.0.}}  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34488) Support task Metrics Distributions and executor Metrics Distributions in the REST API call for a specified stage

2021-03-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-34488:


Assignee: angerszhu  (was: angerszhu)

> Support task Metrics Distributions and executor Metrics Distributions in the 
> REST API call for a specified stage
> 
>
> Key: SPARK-34488
> URL: https://issues.apache.org/jira/browse/SPARK-34488
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.2
>Reporter: Ron Hu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: executorMetricsDistributions.json, 
> taskMetricsDistributions.json
>
>
> For a specific stage, it is useful to show the task metrics in percentile 
> distribution.  This information can help users know whether or not there is a 
> skew/bottleneck among tasks in a given stage.  We list an example in 
> [^taskMetricsDistributions.json]
> Similarly, it is useful to show the executor metrics in percentile 
> distribution for a specific stage. This information can show whether or not 
> there is a skewed load on some executors.  We list an example in 
> [^executorMetricsDistributions.json]
>  
> We define withSummaries query parameter in the REST API for a specific stage 
> as:
> applications///?withSummaries=[true|false]=0.0,0.25,0.5,0.75,1.0
>  * When withSummaries=true, both task metrics in percentile distribution and 
> executor metrics in percentile distribution are included in the REST API 
> output.  The default value of withSummaries is false, i.e. no metrics 
> percentile distribution will be included in the REST API output.
>  * Query parameter quantiles define the quantiles we use to calculate metrics 
> distributions.  It takes effect only when {{withSummaries=true.}}  Its 
> default value is {{0.0,0.25,0.5,0.75,1.0.}}  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34488) Support task Metrics Distributions and executor Metrics Distributions in the REST API call for a specified stage

2021-03-24 Thread Ron Hu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308305#comment-17308305
 ] 

Ron Hu commented on SPARK-34488:


Hi [~srowen], Thanks for merging this PR.  [~angerszhuuu] and I worked closely 
on this JIRA.  Angers presented the PR for this ticket.  Please put 
[~angerszhuuu] into the assignee field in this ticket.  He deserves the credit. 
 Thanks. 

> Support task Metrics Distributions and executor Metrics Distributions in the 
> REST API call for a specified stage
> 
>
> Key: SPARK-34488
> URL: https://issues.apache.org/jira/browse/SPARK-34488
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.2
>Reporter: Ron Hu
>Assignee: Ron Hu
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: executorMetricsDistributions.json, 
> taskMetricsDistributions.json
>
>
> For a specific stage, it is useful to show the task metrics in percentile 
> distribution.  This information can help users know whether or not there is a 
> skew/bottleneck among tasks in a given stage.  We list an example in 
> [^taskMetricsDistributions.json]
> Similarly, it is useful to show the executor metrics in percentile 
> distribution for a specific stage. This information can show whether or not 
> there is a skewed load on some executors.  We list an example in 
> [^executorMetricsDistributions.json]
>  
> We define withSummaries query parameter in the REST API for a specific stage 
> as:
> applications///?withSummaries=[true|false]=0.0,0.25,0.5,0.75,1.0
>  * When withSummaries=true, both task metrics in percentile distribution and 
> executor metrics in percentile distribution are included in the REST API 
> output.  The default value of withSummaries is false, i.e. no metrics 
> percentile distribution will be included in the REST API output.
>  * Query parameter quantiles define the quantiles we use to calculate metrics 
> distributions.  It takes effect only when {{withSummaries=true.}}  Its 
> default value is {{0.0,0.25,0.5,0.75,1.0.}}  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34852) Close Hive session state should use withHiveState

2021-03-24 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308295#comment-17308295
 ] 

Kent Yao commented on SPARK-34852:
--

Issue fixed by https://github.com/apache/spark/pull/31949

> Close Hive session state should use withHiveState
> -
>
> Key: SPARK-34852
> URL: https://issues.apache.org/jira/browse/SPARK-34852
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.2.0
>
>
> Some reason:
> 1. Shutdown hook is invoked using different thread.
>  2. Hive may use metasotre client again during closing
> Otherwise, we may get such expcetion with custom hive metastore version
> {code:java}
> 21/03/24 13:26:18 INFO session.SessionState: Failed to remove classloaders 
> from DataNucleus
> java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1654)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:80)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:130)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:101)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3367)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3406)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3386)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.unCacheDataNucleusClassLoaders(SessionState.java:1546)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.close(SessionState.java:1536)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.closeState(HiveClientImpl.scala:172)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$new$1(HiveClientImpl.scala:175)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34852) Close Hive session state should use withHiveState

2021-03-24 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-34852.
--
Fix Version/s: 3.2.0
 Assignee: ulysses you
   Resolution: Fixed

> Close Hive session state should use withHiveState
> -
>
> Key: SPARK-34852
> URL: https://issues.apache.org/jira/browse/SPARK-34852
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.2.0
>
>
> Some reason:
> 1. Shutdown hook is invoked using different thread.
>  2. Hive may use metasotre client again during closing
> Otherwise, we may get such expcetion with custom hive metastore version
> {code:java}
> 21/03/24 13:26:18 INFO session.SessionState: Failed to remove classloaders 
> from DataNucleus
> java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1654)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:80)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:130)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:101)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3367)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3406)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3386)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.unCacheDataNucleusClassLoaders(SessionState.java:1546)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.close(SessionState.java:1536)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.closeState(HiveClientImpl.scala:172)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$new$1(HiveClientImpl.scala:175)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2021-03-24 Thread Li Xian (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308292#comment-17308292
 ] 

Li Xian commented on SPARK-26345:
-

[~dongjoon] sure, I have created a new issue 
https://issues.apache.org/jira/browse/SPARK-34859 

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Parquet 1.11 supports column indexing. Spark can supports this feature for 
> better read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201
>  
> Benchmark result:
> [https://github.com/apache/spark/pull/31393#issuecomment-769767724]
> This feature is enabled by default, and users can disable it by setting 
> {{parquet.filter.columnindex.enabled}} to false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34859) Vectorized parquet reader needs synchronization among pages for column index

2021-03-24 Thread Li Xian (Jira)
Li Xian created SPARK-34859:
---

 Summary: Vectorized parquet reader needs synchronization among 
pages for column index
 Key: SPARK-34859
 URL: https://issues.apache.org/jira/browse/SPARK-34859
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Li Xian


the current implementation has a problem. the pages returned by 
`readNextFilteredRowGroup` may not be aligned, some columns may have more rows 
than others.

Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` 
with `rowIndexes` to make sure that rows are aligned. 

Currently `VectorizedParquetRecordReader` doesn't have such synchronizing among 
pages from different columns. Using `readNextFilteredRowGroup` may result in 
incorrect result.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34858) Binary Logistic Regression with intercept support centering

2021-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308272#comment-17308272
 ] 

Apache Spark commented on SPARK-34858:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/31693

> Binary Logistic Regression with intercept support centering
> ---
>
> Key: SPARK-34858
> URL: https://issues.apache.org/jira/browse/SPARK-34858
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34858) Binary Logistic Regression with intercept support centering

2021-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34858:


Assignee: (was: Apache Spark)

> Binary Logistic Regression with intercept support centering
> ---
>
> Key: SPARK-34858
> URL: https://issues.apache.org/jira/browse/SPARK-34858
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34858) Binary Logistic Regression with intercept support centering

2021-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34858:


Assignee: Apache Spark

> Binary Logistic Regression with intercept support centering
> ---
>
> Key: SPARK-34858
> URL: https://issues.apache.org/jira/browse/SPARK-34858
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34858) Binary Logistic Regression with intercept support centering

2021-03-24 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-34858:


 Summary: Binary Logistic Regression with intercept support 
centering
 Key: SPARK-34858
 URL: https://issues.apache.org/jira/browse/SPARK-34858
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 3.2.0
Reporter: zhengruifeng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34780) Cached Table (parquet) with old Configs Used

2021-03-24 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308262#comment-17308262
 ] 

Chao Sun commented on SPARK-34780:
--

Sorry for the late reply [~mikechen]! There's something I still not quite 
clear: when the cache is retrieved, a {{InMemoryRelation}} will be used to 
replace the plan fragment that is matched. Therefore, how can the old stale 
conf still be used in places like {{DataSourceScanExec}}?

> Cached Table (parquet) with old Configs Used
> 
>
> Key: SPARK-34780
> URL: https://issues.apache.org/jira/browse/SPARK-34780
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.1
>Reporter: Michael Chen
>Priority: Major
>
> When a dataframe is cached, the logical plan can contain copies of the spark 
> session meaning the SQLConfs are stored. Then if a different dataframe can 
> replace parts of it's logical plan with a cached logical plan, the cached 
> SQLConfs will be used for the evaluation of the cached logical plan. This is 
> because HadoopFsRelation ignores sparkSession for equality checks (introduced 
> in https://issues.apache.org/jira/browse/SPARK-17358).
> {code:java}
> test("cache uses old SQLConf") {
>   import testImplicits._
>   withTempDir { dir =>
> val tableDir = dir.getAbsoluteFile + "/table"
> val df = Seq("a").toDF("key")
> df.write.parquet(tableDir)
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1Stats = spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10")
> val df2 = spark.read.parquet(tableDir).select("key")
> df2.cache()
> val compression10Stats = df2.queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1StatsWithCache = 
> spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> // I expect these stats to be the same because file compression factor is 
> the same
> assert(compression1Stats == compression1StatsWithCache)
> // Instead, we can see the file compression factor is being cached and 
> used along with
> // the logical plan
> assert(compression10Stats == compression1StatsWithCache)
>   }
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34857) AtLeastNNonNulls does not show up correctly in explain

2021-03-24 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-34857:
-
Affects Version/s: (was: 3.1.1)
   3.1.2
   3.2.0

> AtLeastNNonNulls does not show up correctly in explain
> --
>
> Key: SPARK-34857
> URL: https://issues.apache.org/jira/browse/SPARK-34857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.1.2
>Reporter: Tim Armstrong
>Priority: Minor
>
> I see this in an explain plan
> {noformat}
> (12) Filter
> Input [3]: [c1#2410L, c2#2419, c3#2422]
> Condition : AtLeastNNulls(n, c1#2410L)
> I expect it to be AtLeastNNonNulls and n to have the actual value.
> {noformat}
> Proposed fix is to change 
> https://github.com/databricks/runtime/blob/09ad623c458c40cbaa0dbb4b9f600d96531d35f5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala#L384
>  to
> {noformat}
> override def toString: String = s"AtLeastNNonNulls(${n}, 
> ${children.mkString(",")})"
> {noformat}
> Or maybe it's OK to remove and use a default implementation?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34857) AtLeastNNonNulls does not show up correctly in explain

2021-03-24 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308257#comment-17308257
 ] 

Takeshi Yamamuro commented on SPARK-34857:
--

Please feel free to make a PR for this issue.

> AtLeastNNonNulls does not show up correctly in explain
> --
>
> Key: SPARK-34857
> URL: https://issues.apache.org/jira/browse/SPARK-34857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Tim Armstrong
>Priority: Minor
>
> I see this in an explain plan
> {noformat}
> (12) Filter
> Input [3]: [c1#2410L, c2#2419, c3#2422]
> Condition : AtLeastNNulls(n, c1#2410L)
> I expect it to be AtLeastNNonNulls and n to have the actual value.
> {noformat}
> Proposed fix is to change 
> https://github.com/databricks/runtime/blob/09ad623c458c40cbaa0dbb4b9f600d96531d35f5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala#L384
>  to
> {noformat}
> override def toString: String = s"AtLeastNNonNulls(${n}, 
> ${children.mkString(",")})"
> {noformat}
> Or maybe it's OK to remove and use a default implementation?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34829) transform_values return identical values when it's used with udf that returns reference type

2021-03-24 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-34829:
-
Labels: corre  (was: )

> transform_values return identical values when it's used with udf that returns 
> reference type
> 
>
> Key: SPARK-34829
> URL: https://issues.apache.org/jira/browse/SPARK-34829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Pavel Chernikov
>Priority: Major
>  Labels: corre
>
> If return value of an {{udf}} that is passed to {{transform_values}} is an 
> {{AnyRef}}, then the transformation returns identical new values for each map 
> key (to be more precise, each newly obtained value overrides values for all 
> previously processed keys).
> Consider following examples:
> {code:java}
> case class Bar(i: Int)
> val square = udf((b: Bar) => b.i * b.i)
> val df = Seq(Map(1 -> Bar(1), 2 -> Bar(2), 3 -> Bar(3))).toDF("map")
> df.withColumn("map_square", transform_values(col("map"), (_, v) => 
> square(v))).show(truncate = false)
> +--++
> |map                           |map_square              |
> +--++
> |{1 -> {1}, 2 -> {2}, 3 -> {3}}|{1 -> 1, 2 -> 4, 3 -> 9}|
> +--++
> {code}
> vs 
> {code:java}
> case class Bar(i: Int)
> case class BarSquare(i: Int)
> val square = udf((b: Bar) => BarSquare(b.i * b.i))
> val df = Seq(Map(1 -> Bar(1), 2 -> Bar(2), 3 -> Bar(3))).toDF("map")
> df.withColumn("map_square", transform_values(col("map"), (_, v) => 
> square(v))).show(truncate = false)
> +--+--+
> |map                           |map_square                    |
> +--+--+
> |{1 -> {1}, 2 -> {2}, 3 -> {3}}|{1 -> {9}, 2 -> {9}, 3 -> {9}}|
> +--+--+
> {code}
> or even just this one
> {code:java}
> case class Foo(s: String)
> val reverse = udf((f: Foo) => f.s.reverse)
> val df = Seq(Map(1 -> Foo("abc"), 2 -> Foo("klm"), 3 -> 
> Foo("xyz"))).toDF("map")
> df.withColumn("map_reverse", transform_values(col("map"), (_, v) => 
> reverse(v))).show(truncate = false)
> ++--+
> |map |map_reverse   |
> ++--+
> |{1 -> {abc}, 2 -> {klm}, 3 -> {xyz}}|{1 -> zyx, 2 -> zyx, 3 -> zyx}|
> ++--+
> {code}
> After playing with 
> {{org.apache.spark.sql.catalyst.expressions.TransformValues}} it looks like 
> something wrong is happening while executing this line:
> {code:java}
> resultValues.update(i, functionForEval.eval(inputRow)){code}
> To be more precise , it's all about {{functionForEval.eval(inputRow)}} , 
> because if you do something like this:
> {code:java}
> println(s"RESULTS PRIOR TO EVALUATION - $resultValues")
> val resultValue = functionForEval.eval(inputRow)
> println(s"RESULT - $resultValue")
> println(s"RESULTS PRIOR TO UPDATE - $resultValues")
> resultValues.update(i, resultValue)
> println(s"RESULTS AFTER UPDATE - $resultValues"){code}
> You'll see in the logs, something like:
> {code:java}
> RESULTS PRIOR TO EVALUATION - [null,null,null] 
> RESULT - [0,1] 
> RESULTS PRIOR TO UPDATE - [null,null,null]
> RESULTS AFTER UPDATE - [[0,1],null,null]
> --
> RESULTS PRIOR TO EVALUATION - [[0,1],null,null] 
> RESULT - [0,4]
> RESULTS PRIOR TO UPDATE - [[0,4],null,null] 
> RESULTS  AFTER UPDATE - [[0,4],[0,4],null]
> --
> RESULTS PRIOR TO EVALUATION - [[0,4],[0,4],null] 
> RESULT - [0,9]
> RESULTS PRIOR TO UPDATE - [[0,9],[0,9],null]
> RESULTS  AFTER UPDATE - [[0,9],[0,9],[0,9]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34829) transform_values return identical values when it's used with udf that returns reference type

2021-03-24 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-34829:
-
Affects Version/s: 3.2.0

> transform_values return identical values when it's used with udf that returns 
> reference type
> 
>
> Key: SPARK-34829
> URL: https://issues.apache.org/jira/browse/SPARK-34829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Pavel Chernikov
>Priority: Major
>  Labels: correctness
>
> If return value of an {{udf}} that is passed to {{transform_values}} is an 
> {{AnyRef}}, then the transformation returns identical new values for each map 
> key (to be more precise, each newly obtained value overrides values for all 
> previously processed keys).
> Consider following examples:
> {code:java}
> case class Bar(i: Int)
> val square = udf((b: Bar) => b.i * b.i)
> val df = Seq(Map(1 -> Bar(1), 2 -> Bar(2), 3 -> Bar(3))).toDF("map")
> df.withColumn("map_square", transform_values(col("map"), (_, v) => 
> square(v))).show(truncate = false)
> +--++
> |map                           |map_square              |
> +--++
> |{1 -> {1}, 2 -> {2}, 3 -> {3}}|{1 -> 1, 2 -> 4, 3 -> 9}|
> +--++
> {code}
> vs 
> {code:java}
> case class Bar(i: Int)
> case class BarSquare(i: Int)
> val square = udf((b: Bar) => BarSquare(b.i * b.i))
> val df = Seq(Map(1 -> Bar(1), 2 -> Bar(2), 3 -> Bar(3))).toDF("map")
> df.withColumn("map_square", transform_values(col("map"), (_, v) => 
> square(v))).show(truncate = false)
> +--+--+
> |map                           |map_square                    |
> +--+--+
> |{1 -> {1}, 2 -> {2}, 3 -> {3}}|{1 -> {9}, 2 -> {9}, 3 -> {9}}|
> +--+--+
> {code}
> or even just this one
> {code:java}
> case class Foo(s: String)
> val reverse = udf((f: Foo) => f.s.reverse)
> val df = Seq(Map(1 -> Foo("abc"), 2 -> Foo("klm"), 3 -> 
> Foo("xyz"))).toDF("map")
> df.withColumn("map_reverse", transform_values(col("map"), (_, v) => 
> reverse(v))).show(truncate = false)
> ++--+
> |map |map_reverse   |
> ++--+
> |{1 -> {abc}, 2 -> {klm}, 3 -> {xyz}}|{1 -> zyx, 2 -> zyx, 3 -> zyx}|
> ++--+
> {code}
> After playing with 
> {{org.apache.spark.sql.catalyst.expressions.TransformValues}} it looks like 
> something wrong is happening while executing this line:
> {code:java}
> resultValues.update(i, functionForEval.eval(inputRow)){code}
> To be more precise , it's all about {{functionForEval.eval(inputRow)}} , 
> because if you do something like this:
> {code:java}
> println(s"RESULTS PRIOR TO EVALUATION - $resultValues")
> val resultValue = functionForEval.eval(inputRow)
> println(s"RESULT - $resultValue")
> println(s"RESULTS PRIOR TO UPDATE - $resultValues")
> resultValues.update(i, resultValue)
> println(s"RESULTS AFTER UPDATE - $resultValues"){code}
> You'll see in the logs, something like:
> {code:java}
> RESULTS PRIOR TO EVALUATION - [null,null,null] 
> RESULT - [0,1] 
> RESULTS PRIOR TO UPDATE - [null,null,null]
> RESULTS AFTER UPDATE - [[0,1],null,null]
> --
> RESULTS PRIOR TO EVALUATION - [[0,1],null,null] 
> RESULT - [0,4]
> RESULTS PRIOR TO UPDATE - [[0,4],null,null] 
> RESULTS  AFTER UPDATE - [[0,4],[0,4],null]
> --
> RESULTS PRIOR TO EVALUATION - [[0,4],[0,4],null] 
> RESULT - [0,9]
> RESULTS PRIOR TO UPDATE - [[0,9],[0,9],null]
> RESULTS  AFTER UPDATE - [[0,9],[0,9],[0,9]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34829) transform_values return identical values when it's used with udf that returns reference type

2021-03-24 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-34829:
-
Labels:   (was: corre)

> transform_values return identical values when it's used with udf that returns 
> reference type
> 
>
> Key: SPARK-34829
> URL: https://issues.apache.org/jira/browse/SPARK-34829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Pavel Chernikov
>Priority: Major
>
> If return value of an {{udf}} that is passed to {{transform_values}} is an 
> {{AnyRef}}, then the transformation returns identical new values for each map 
> key (to be more precise, each newly obtained value overrides values for all 
> previously processed keys).
> Consider following examples:
> {code:java}
> case class Bar(i: Int)
> val square = udf((b: Bar) => b.i * b.i)
> val df = Seq(Map(1 -> Bar(1), 2 -> Bar(2), 3 -> Bar(3))).toDF("map")
> df.withColumn("map_square", transform_values(col("map"), (_, v) => 
> square(v))).show(truncate = false)
> +--++
> |map                           |map_square              |
> +--++
> |{1 -> {1}, 2 -> {2}, 3 -> {3}}|{1 -> 1, 2 -> 4, 3 -> 9}|
> +--++
> {code}
> vs 
> {code:java}
> case class Bar(i: Int)
> case class BarSquare(i: Int)
> val square = udf((b: Bar) => BarSquare(b.i * b.i))
> val df = Seq(Map(1 -> Bar(1), 2 -> Bar(2), 3 -> Bar(3))).toDF("map")
> df.withColumn("map_square", transform_values(col("map"), (_, v) => 
> square(v))).show(truncate = false)
> +--+--+
> |map                           |map_square                    |
> +--+--+
> |{1 -> {1}, 2 -> {2}, 3 -> {3}}|{1 -> {9}, 2 -> {9}, 3 -> {9}}|
> +--+--+
> {code}
> or even just this one
> {code:java}
> case class Foo(s: String)
> val reverse = udf((f: Foo) => f.s.reverse)
> val df = Seq(Map(1 -> Foo("abc"), 2 -> Foo("klm"), 3 -> 
> Foo("xyz"))).toDF("map")
> df.withColumn("map_reverse", transform_values(col("map"), (_, v) => 
> reverse(v))).show(truncate = false)
> ++--+
> |map |map_reverse   |
> ++--+
> |{1 -> {abc}, 2 -> {klm}, 3 -> {xyz}}|{1 -> zyx, 2 -> zyx, 3 -> zyx}|
> ++--+
> {code}
> After playing with 
> {{org.apache.spark.sql.catalyst.expressions.TransformValues}} it looks like 
> something wrong is happening while executing this line:
> {code:java}
> resultValues.update(i, functionForEval.eval(inputRow)){code}
> To be more precise , it's all about {{functionForEval.eval(inputRow)}} , 
> because if you do something like this:
> {code:java}
> println(s"RESULTS PRIOR TO EVALUATION - $resultValues")
> val resultValue = functionForEval.eval(inputRow)
> println(s"RESULT - $resultValue")
> println(s"RESULTS PRIOR TO UPDATE - $resultValues")
> resultValues.update(i, resultValue)
> println(s"RESULTS AFTER UPDATE - $resultValues"){code}
> You'll see in the logs, something like:
> {code:java}
> RESULTS PRIOR TO EVALUATION - [null,null,null] 
> RESULT - [0,1] 
> RESULTS PRIOR TO UPDATE - [null,null,null]
> RESULTS AFTER UPDATE - [[0,1],null,null]
> --
> RESULTS PRIOR TO EVALUATION - [[0,1],null,null] 
> RESULT - [0,4]
> RESULTS PRIOR TO UPDATE - [[0,4],null,null] 
> RESULTS  AFTER UPDATE - [[0,4],[0,4],null]
> --
> RESULTS PRIOR TO EVALUATION - [[0,4],[0,4],null] 
> RESULT - [0,9]
> RESULTS PRIOR TO UPDATE - [[0,9],[0,9],null]
> RESULTS  AFTER UPDATE - [[0,9],[0,9],[0,9]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34829) transform_values return identical values when it's used with udf that returns reference type

2021-03-24 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-34829:
-
Labels: correctness  (was: )

> transform_values return identical values when it's used with udf that returns 
> reference type
> 
>
> Key: SPARK-34829
> URL: https://issues.apache.org/jira/browse/SPARK-34829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Pavel Chernikov
>Priority: Major
>  Labels: correctness
>
> If return value of an {{udf}} that is passed to {{transform_values}} is an 
> {{AnyRef}}, then the transformation returns identical new values for each map 
> key (to be more precise, each newly obtained value overrides values for all 
> previously processed keys).
> Consider following examples:
> {code:java}
> case class Bar(i: Int)
> val square = udf((b: Bar) => b.i * b.i)
> val df = Seq(Map(1 -> Bar(1), 2 -> Bar(2), 3 -> Bar(3))).toDF("map")
> df.withColumn("map_square", transform_values(col("map"), (_, v) => 
> square(v))).show(truncate = false)
> +--++
> |map                           |map_square              |
> +--++
> |{1 -> {1}, 2 -> {2}, 3 -> {3}}|{1 -> 1, 2 -> 4, 3 -> 9}|
> +--++
> {code}
> vs 
> {code:java}
> case class Bar(i: Int)
> case class BarSquare(i: Int)
> val square = udf((b: Bar) => BarSquare(b.i * b.i))
> val df = Seq(Map(1 -> Bar(1), 2 -> Bar(2), 3 -> Bar(3))).toDF("map")
> df.withColumn("map_square", transform_values(col("map"), (_, v) => 
> square(v))).show(truncate = false)
> +--+--+
> |map                           |map_square                    |
> +--+--+
> |{1 -> {1}, 2 -> {2}, 3 -> {3}}|{1 -> {9}, 2 -> {9}, 3 -> {9}}|
> +--+--+
> {code}
> or even just this one
> {code:java}
> case class Foo(s: String)
> val reverse = udf((f: Foo) => f.s.reverse)
> val df = Seq(Map(1 -> Foo("abc"), 2 -> Foo("klm"), 3 -> 
> Foo("xyz"))).toDF("map")
> df.withColumn("map_reverse", transform_values(col("map"), (_, v) => 
> reverse(v))).show(truncate = false)
> ++--+
> |map |map_reverse   |
> ++--+
> |{1 -> {abc}, 2 -> {klm}, 3 -> {xyz}}|{1 -> zyx, 2 -> zyx, 3 -> zyx}|
> ++--+
> {code}
> After playing with 
> {{org.apache.spark.sql.catalyst.expressions.TransformValues}} it looks like 
> something wrong is happening while executing this line:
> {code:java}
> resultValues.update(i, functionForEval.eval(inputRow)){code}
> To be more precise , it's all about {{functionForEval.eval(inputRow)}} , 
> because if you do something like this:
> {code:java}
> println(s"RESULTS PRIOR TO EVALUATION - $resultValues")
> val resultValue = functionForEval.eval(inputRow)
> println(s"RESULT - $resultValue")
> println(s"RESULTS PRIOR TO UPDATE - $resultValues")
> resultValues.update(i, resultValue)
> println(s"RESULTS AFTER UPDATE - $resultValues"){code}
> You'll see in the logs, something like:
> {code:java}
> RESULTS PRIOR TO EVALUATION - [null,null,null] 
> RESULT - [0,1] 
> RESULTS PRIOR TO UPDATE - [null,null,null]
> RESULTS AFTER UPDATE - [[0,1],null,null]
> --
> RESULTS PRIOR TO EVALUATION - [[0,1],null,null] 
> RESULT - [0,4]
> RESULTS PRIOR TO UPDATE - [[0,4],null,null] 
> RESULTS  AFTER UPDATE - [[0,4],[0,4],null]
> --
> RESULTS PRIOR TO EVALUATION - [[0,4],[0,4],null] 
> RESULT - [0,9]
> RESULTS PRIOR TO UPDATE - [[0,9],[0,9],null]
> RESULTS  AFTER UPDATE - [[0,9],[0,9],[0,9]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34844) JDBCRelation columnPartition function includes the first stride in the lower partition

2021-03-24 Thread Jason Yarbrough (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308249#comment-17308249
 ] 

Jason Yarbrough edited comment on SPARK-34844 at 3/25/21, 12:19 AM:


As I coded up the unit tests and checked the effect this change has on other 
unit tests, it made me take a step back and reconsider which behavior really is 
more expected. My guess is that many people are not treating the bounds like a 
breakpoint, but instead they set the lower bound to the bottom of their data 
(or close to it), and not so much to the lowest percentile. I've also looked 
through a few people's questions on stack overflow to get some form of 
confirmation for this.

I'm going to set this to "Not A Problem" for now and will re-open if it makes 
sense after more testing.


was (Author: hanover-fiste):
As I coded up the unit tests and checked the effect this change has on other 
unit tests, it made me take a step back and reconsider which behavior really is 
more expected. My guess is that many people are not treating the bounds like a 
breakpoint, but instead they set the lower bound to the bottom of their data 
(or close to it), and not so much to the lowest percentile. I've also looked 
through some people's questions on stack overflow to get a some form of 
confirmation for this.

I'm going to set this to "Not A Problem" for now and will re-open if it makes 
sense after more testing.

> JDBCRelation columnPartition function includes the first stride in the lower 
> partition
> --
>
> Key: SPARK-34844
> URL: https://issues.apache.org/jira/browse/SPARK-34844
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Yarbrough
>Priority: Minor
>
> Currently, columnPartition in JDBCRelation contains logic that adds the first 
> stride into the lower partition. Because of this, the lower bound isn't used 
> as the ceiling for the lower partition.
> For example, say we have data 0-10, 10 partitions, and the lowerBound is set 
> to 1. The lower/first partition should contain anything < 1. However, in the 
> current implementation, it would include anything < 2.
> A possible easy fix would be changing the following code on line 132:
> currentValue += stride
> To:
> if (i != 0) currentValue += stride
> Or include currentValue += stride within the if statement on line 131... 
> although this creates a pretty bad looking side-effect.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34844) JDBCRelation columnPartition function includes the first stride in the lower partition

2021-03-24 Thread Jason Yarbrough (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Yarbrough resolved SPARK-34844.
-
Resolution: Not A Problem

> JDBCRelation columnPartition function includes the first stride in the lower 
> partition
> --
>
> Key: SPARK-34844
> URL: https://issues.apache.org/jira/browse/SPARK-34844
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Yarbrough
>Priority: Minor
>
> Currently, columnPartition in JDBCRelation contains logic that adds the first 
> stride into the lower partition. Because of this, the lower bound isn't used 
> as the ceiling for the lower partition.
> For example, say we have data 0-10, 10 partitions, and the lowerBound is set 
> to 1. The lower/first partition should contain anything < 1. However, in the 
> current implementation, it would include anything < 2.
> A possible easy fix would be changing the following code on line 132:
> currentValue += stride
> To:
> if (i != 0) currentValue += stride
> Or include currentValue += stride within the if statement on line 131... 
> although this creates a pretty bad looking side-effect.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34844) JDBCRelation columnPartition function includes the first stride in the lower partition

2021-03-24 Thread Jason Yarbrough (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308249#comment-17308249
 ] 

Jason Yarbrough commented on SPARK-34844:
-

As I coded up the unit tests and checked the effect this change has on other 
unit tests, it made me take a step back and reconsider which behavior really is 
more expected. My guess is that many people are not treating the bounds like a 
breakpoint, but instead they set the lower bound to the bottom of their data 
(or close to it), and not so much to the lowest percentile. I've also looked 
through some people's questions on stack overflow to get a some form of 
confirmation for this.

I'm going to set this to "Not A Problem" for now and will re-open if it makes 
sense after more testing.

> JDBCRelation columnPartition function includes the first stride in the lower 
> partition
> --
>
> Key: SPARK-34844
> URL: https://issues.apache.org/jira/browse/SPARK-34844
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Yarbrough
>Priority: Minor
>
> Currently, columnPartition in JDBCRelation contains logic that adds the first 
> stride into the lower partition. Because of this, the lower bound isn't used 
> as the ceiling for the lower partition.
> For example, say we have data 0-10, 10 partitions, and the lowerBound is set 
> to 1. The lower/first partition should contain anything < 1. However, in the 
> current implementation, it would include anything < 2.
> A possible easy fix would be changing the following code on line 132:
> currentValue += stride
> To:
> if (i != 0) currentValue += stride
> Or include currentValue += stride within the if statement on line 131... 
> although this creates a pretty bad looking side-effect.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34833) Apply right-padding correctly for correlated subqueries

2021-03-24 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-34833.
--
Fix Version/s: 3.1.2
   3.2.0
 Assignee: Takeshi Yamamuro
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/31940

> Apply right-padding correctly for correlated subqueries
> ---
>
> Key: SPARK-34833
> URL: https://issues.apache.org/jira/browse/SPARK-34833
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.2.0, 3.1.2
>
>
> This ticket aim at  fixing the bug that does not apply right-padding for char 
> types inside correlated subquries.
> For example,  a query below returns nothing in master, but a correct result 
> is `c`.
> {code}
> scala> sql(s"CREATE TABLE t1(v VARCHAR(3), c CHAR(5)) USING parquet")
> scala> sql(s"CREATE TABLE t2(v VARCHAR(5), c CHAR(7)) USING parquet")
> scala> sql("INSERT INTO t1 VALUES ('c', 'b')")
> scala> sql("INSERT INTO t2 VALUES ('a', 'b')")
> scala> val df = sql("""
>   |SELECT v FROM t1
>   |WHERE 'a' IN (SELECT v FROM t2 WHERE t2.c = t1.c )""".stripMargin)
> scala> df.show()
> +---+
> |  v|
> +---+
> +---+
> {code}
> This is because `ApplyCharTypePadding`  does not handle the case above to 
> apply right-padding into `'abc'`. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34843) JDBCRelation columnPartition function improperly determines stride size

2021-03-24 Thread Jason Yarbrough (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Yarbrough updated SPARK-34843:

Description: 
Currently, in JDBCRelation (line 123), the stride size is calculated as follows:

val stride: Long = upperBound / numPartitions - lowerBound / numPartitions

 

Due to truncation happening on both divisions, the stride size can fall short 
of what it should be. This can lead to a big difference between the provided 
upper bound and the actual start of the last partition.

I propose this formula, as it is much more accurate and leads to better 
distribution:

val stride: Long = ((upperBound / (numPartitions - 2.0)) - (lowerBound / 
(numPartitions - 2.0))).floor.toLong

 

An example (using a date column):

Say you're creating 1,000 partitions. If you provide a lower bound of 
1927-04-05 (this gets translated to -15611), and an upper bound of 2020-10-27 
(translated to 18563), Spark determines the stride size as follows:

 

(18563L / 1000L) - (-15611 / 1000L) = 33

Starting from the lower bound, doing 998 strides of 33, you end up at 
2017-06-05 (currently, it will actually do 2017-07-08 due to adding the first 
stride into the lower partition). This is over 3 years of extra data that will 
go into the last partition, and depending on the shape of the data could cause 
a very long running task at the end of a job.

 

Using the formula I'm proposing, you'd get:

((18563L / (1000L - 2.0)) - (-15611 / (1000L - 2.0))).floor.toLong = 34

This would put the upper bound at 2020-02-28 (currently, it will actually do 
2020-04-02 due to adding the first stride into the lower partition), which is 
much closer to the original supplied upper bound. This is the best you can do 
to get as close as possible to the upper bound (without adjusting the number of 
partitions). For example, a stride size of 35 would go well past the supplied 
upper bound (over 2 years, 2022-11-22).

 

In the above example, there is only a difference of 1 between the stride size 
using the current formula and the stride size using the proposed formula, but 
with greater distance between the lower and upper bounds, or a lower number of 
partitions, the difference can be much greater. 

  was:
Currently, in JDBCRelation (line 123), the stride size is calculated as follows:

val stride: Long = upperBound / numPartitions - lowerBound / numPartitions

 

Due to truncation happening on both divisions, the stride size can fall short 
of what it should be. This can lead to a big difference between the provided 
upper bound and the actual start of the last partition.

I propose this formula, as it is much more accurate and leads to better 
distribution:

val stride: Long = ((upperBound / (numPartitions - 2.0)) - (lowerBound / 
(numPartitions - 2.0))).floor.toLong

 

An example (using a date column):

Say you're creating 1,000 partitions. If you provide a lower bound of 
1927-04-05 (this gets translated to -15611), and an upper bound of 2020-10-27 
(translated to 18563), Spark determines the stride size as follows:

 

(18563L / 1000L) - (-15611 / 1000L) = 33

Starting from the lower bound, doing 998 strides of 33, you end up at 
2017-06-05 (currently, it will actually do 2020-07-08 due to adding the first 
stride into the lower partition). This is over 3 years of extra data that will 
go into the last partition, and depending on the shape of the data could cause 
a very long running task at the end of a job.

 

Using the formula I'm proposing, you'd get:

((18563L / (1000L - 2.0)) - (-15611 / (1000L - 2.0))).floor.toLong = 34

This would put the upper bound at 2020-02-28 (currently, it will actually do 
2020-04-02 due to adding the first stride into the lower partition), which is 
much closer to the original supplied upper bound. This is the best you can do 
to get as close as possible to the upper bound (without adjusting the number of 
partitions). For example, a stride size of 35 would go well past the supplied 
upper bound (over 2 years, 2022-11-22).

 

In the above example, there is only a difference of 1 between the stride size 
using the current formula and the stride size using the proposed formula, but 
with greater distance between the lower and upper bounds, or a lower number of 
partitions, the difference can be much greater. 


> JDBCRelation columnPartition function improperly determines stride size
> ---
>
> Key: SPARK-34843
> URL: https://issues.apache.org/jira/browse/SPARK-34843
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Yarbrough
>Priority: Minor
> Attachments: SPARK-34843.patch
>
>
> Currently, in JDBCRelation (line 123), the stride size is calculated as 
> follows:
> val stride: Long = 

[jira] [Updated] (SPARK-34843) JDBCRelation columnPartition function improperly determines stride size

2021-03-24 Thread Jason Yarbrough (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Yarbrough updated SPARK-34843:

Description: 
Currently, in JDBCRelation (line 123), the stride size is calculated as follows:

val stride: Long = upperBound / numPartitions - lowerBound / numPartitions

 

Due to truncation happening on both divisions, the stride size can fall short 
of what it should be. This can lead to a big difference between the provided 
upper bound and the actual start of the last partition.

I propose this formula, as it is much more accurate and leads to better 
distribution:

val stride: Long = ((upperBound / (numPartitions - 2.0)) - (lowerBound / 
(numPartitions - 2.0))).floor.toLong

 

An example (using a date column):

Say you're creating 1,000 partitions. If you provide a lower bound of 
1927-04-05 (this gets translated to -15611), and an upper bound of 2020-10-27 
(translated to 18563), Spark determines the stride size as follows:

 

(18563L / 1000L) - (-15611 / 1000L) = 33

Starting from the lower bound, doing 998 strides of 33, you end up at 
2017-06-05 (currently, it will actually do 2020-07-08 due to adding the first 
stride into the lower partition). This is over 3 years of extra data that will 
go into the last partition, and depending on the shape of the data could cause 
a very long running task at the end of a job.

 

Using the formula I'm proposing, you'd get:

((18563L / (1000L - 2.0)) - (-15611 / (1000L - 2.0))).floor.toLong = 34

This would put the upper bound at 2020-02-28 (currently, it will actually do 
2020-04-02 due to adding the first stride into the lower partition), which is 
much closer to the original supplied upper bound. This is the best you can do 
to get as close as possible to the upper bound (without adjusting the number of 
partitions). For example, a stride size of 35 would go well past the supplied 
upper bound (over 2 years, 2022-11-22).

 

In the above example, there is only a difference of 1 between the stride size 
using the current formula and the stride size using the proposed formula, but 
with greater distance between the lower and upper bounds, or a lower number of 
partitions, the difference can be much greater. 

  was:
Currently, in JDBCRelation (line 123), the stride size is calculated as follows:

val stride: Long = upperBound / numPartitions - lowerBound / numPartitions

 

Due to truncation happening on both divisions, the stride size can fall short 
of what it should be. This can lead to a big difference between the provided 
upper bound and the actual start of the last partition.

I propose this formula, as it is much more accurate and leads to better 
distribution:

val stride: Long = ((upperBound / (numPartitions - 2.0)) - (lowerBound / 
(numPartitions - 2.0))).floor.toLong

 

An example (using a date column):

Say you're creating 1,000 partitions. If you provide a lower bound of 
1927-04-05 (this gets translated to -15611), and an upper bound of 2020-10-27 
(translated to 18563), Spark determines the stride size as follows:

 

(18563L / 1000L) - (-15611 / 1000L) = 33

Starting from the lower bound, doing 998 strides of 33, you end up at 
2017-06-05 (17323). This is over 3 years of extra data that will go into the 
last partition, and depending on the shape of the data could cause a very long 
running task at the end of a job.

 

Using the formula I'm proposing, you'd get:

((18563L / (1000L - 2.0)) - (-15611 / (1000L - 2.0))).floor.toLong = 34

This would put the upper bound at 2020-02-28 (currently, it will actually do 
2020-04-02 due to adding the first stride into the lower partition), which is 
much closer to the original supplied upper bound. This is the best you can do 
to get as close as possible to the upper bound (without adjusting the number of 
partitions). For example, a stride size of 35 would go well past the supplied 
upper bound (over 2 years, 2022-11-22).

 

In the above example, there is only a difference of 1 between the stride size 
using the current formula and the stride size using the proposed formula, but 
with greater distance between the lower and upper bounds, or a lower number of 
partitions, the difference can be much greater. 


> JDBCRelation columnPartition function improperly determines stride size
> ---
>
> Key: SPARK-34843
> URL: https://issues.apache.org/jira/browse/SPARK-34843
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Yarbrough
>Priority: Minor
> Attachments: SPARK-34843.patch
>
>
> Currently, in JDBCRelation (line 123), the stride size is calculated as 
> follows:
> val stride: Long = upperBound / numPartitions - lowerBound / numPartitions
>  
> Due to truncation happening on both 

[jira] [Updated] (SPARK-34843) JDBCRelation columnPartition function improperly determines stride size

2021-03-24 Thread Jason Yarbrough (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Yarbrough updated SPARK-34843:

Description: 
Currently, in JDBCRelation (line 123), the stride size is calculated as follows:

val stride: Long = upperBound / numPartitions - lowerBound / numPartitions

 

Due to truncation happening on both divisions, the stride size can fall short 
of what it should be. This can lead to a big difference between the provided 
upper bound and the actual start of the last partition.

I propose this formula, as it is much more accurate and leads to better 
distribution:

val stride: Long = ((upperBound / (numPartitions - 2.0)) - (lowerBound / 
(numPartitions - 2.0))).floor.toLong

 

An example (using a date column):

Say you're creating 1,000 partitions. If you provide a lower bound of 
1927-04-05 (this gets translated to -15611), and an upper bound of 2020-10-27 
(translated to 18563), Spark determines the stride size as follows:

 

(18563L / 1000L) - (-15611 / 1000L) = 33

Starting from the lower bound, doing 998 strides of 33, you end up at 
2017-06-05 (17323). This is over 3 years of extra data that will go into the 
last partition, and depending on the shape of the data could cause a very long 
running task at the end of a job.

 

Using the formula I'm proposing, you'd get:

((18563L / (1000L - 2.0)) - (-15611 / (1000L - 2.0))).floor.toLong = 34

This would put the upper bound at 2020-02-28 (currently, it will actually do 
2020-04-02 due to adding the first stride into the lower partition), which is 
much closer to the original supplied upper bound. This is the best you can do 
to get as close as possible to the upper bound (without adjusting the number of 
partitions). For example, a stride size of 35 would go well past the supplied 
upper bound (over 2 years, 2022-11-22).

 

In the above example, there is only a difference of 1 between the stride size 
using the current formula and the stride size using the proposed formula, but 
with greater distance between the lower and upper bounds, or a lower number of 
partitions, the difference can be much greater. 

  was:
Currently, in JDBCRelation (line 123), the stride size is calculated as follows:

val stride: Long = upperBound / numPartitions - lowerBound / numPartitions

 

Due to truncation happening on both divisions, the stride size can fall short 
of what it should be. This can lead to a big difference between the provided 
upper bound and the actual start of the last partition.

I propose this formula, as it is much more accurate and leads to better 
distribution:

val stride: Long = ((upperBound / (numPartitions - 2.0)) - (lowerBound / 
(numPartitions - 2.0))).floor.toLong

 

An example (using a date column):

Say you're creating 1,000 partitions. If you provide a lower bound of 
1927-04-05 (this gets translated to -15611), and an upper bound of 2020-10-27 
(translated to 18563), Spark determines the stride size as follows:

 

(18563L / 1000L) - (-15611 / 1000L) = 33

Starting from the lower bound, doing 998 strides of 33, you end up at 
2017-06-05 (17323). This is over 3 years of extra data that will go into the 
last partition, and depending on the shape of the data could cause a very long 
running task at the end of a job.

 

Using the formula I'm proposing, you'd get:

((18563L / (1000L - 2.0)) - (-15611 / (1000L - 2.0))).floor.toLong = 34

This would put the upper bound at 2020-02-28 (18321), which is much closer to 
the original supplied upper bound. This is the best you can do to get as close 
as possible to the upper bound (without adjusting the number of partitions). 
For example, a stride size of 35 would go well past the supplied upper bound 
(over 2 years, 2022-11-22).

 

In the above example, there is only a difference of 1 between the stride size 
using the current formula and the stride size using the proposed formula, but 
with greater distance between the lower and upper bounds, or a lower number of 
partitions, the difference can be much greater. 


> JDBCRelation columnPartition function improperly determines stride size
> ---
>
> Key: SPARK-34843
> URL: https://issues.apache.org/jira/browse/SPARK-34843
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Yarbrough
>Priority: Minor
> Attachments: SPARK-34843.patch
>
>
> Currently, in JDBCRelation (line 123), the stride size is calculated as 
> follows:
> val stride: Long = upperBound / numPartitions - lowerBound / numPartitions
>  
> Due to truncation happening on both divisions, the stride size can fall short 
> of what it should be. This can lead to a big difference between the provided 
> upper bound and the actual start of the last partition.
> I 

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2021-03-24 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308144#comment-17308144
 ] 

Dongjoon Hyun commented on SPARK-26345:
---

Thank you for reporting. Could you file a new JIRA, [~lxian2] and 
[~sha...@uber.com]?

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Parquet 1.11 supports column indexing. Spark can supports this feature for 
> better read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201
>  
> Benchmark result:
> [https://github.com/apache/spark/pull/31393#issuecomment-769767724]
> This feature is enabled by default, and users can disable it by setting 
> {{parquet.filter.columnindex.enabled}} to false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30497) migrate DESCRIBE TABLE to the new framework

2021-03-24 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308067#comment-17308067
 ] 

Chao Sun commented on SPARK-30497:
--

[~cloud_fan] this is resolved right?

> migrate DESCRIBE TABLE to the new framework
> ---
>
> Key: SPARK-30497
> URL: https://issues.apache.org/jira/browse/SPARK-30497
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34857) AtLeastNNonNulls does not show up correctly in explain

2021-03-24 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated SPARK-34857:
--
Priority: Minor  (was: Major)

> AtLeastNNonNulls does not show up correctly in explain
> --
>
> Key: SPARK-34857
> URL: https://issues.apache.org/jira/browse/SPARK-34857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Tim Armstrong
>Priority: Minor
>
> I see this in an explain plan
> {noformat}
> (12) Filter
> Input [3]: [c1#2410L, c2#2419, c3#2422]
> Condition : AtLeastNNulls(n, c1#2410L)
> I expect it to be AtLeastNNonNulls and n to have the actual value.
> {noformat}
> Proposed fix is to change 
> https://github.com/databricks/runtime/blob/09ad623c458c40cbaa0dbb4b9f600d96531d35f5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala#L384
>  to
> {noformat}
> override def toString: String = s"AtLeastNNonNulls(${n}, 
> ${children.mkString(",")})"
> {noformat}
> Or maybe it's OK to remove and use a default implementation?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34857) AtLeastNNonNulls does not show up correctly in explain

2021-03-24 Thread Tim Armstrong (Jira)
Tim Armstrong created SPARK-34857:
-

 Summary: AtLeastNNonNulls does not show up correctly in explain
 Key: SPARK-34857
 URL: https://issues.apache.org/jira/browse/SPARK-34857
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1
 Environment: I see this in an explain plan

{noformat}
(12) Filter
Input [3]: [c1#2410L, c2#2419, c3#2422]
Condition : AtLeastNNulls(n, c1#2410L)

I expect it to be AtLeastNNonNulls and n to have the actual value.
{noformat}

Proposed fix is to change 
https://github.com/databricks/runtime/blob/09ad623c458c40cbaa0dbb4b9f600d96531d35f5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala#L384
 to
{noformat}
override def toString: String = s"AtLeastNNonNulls(${n}, 
${children.mkString(",")})"
{noformat}
Or maybe it's OK to remove and use a default implementation?
Reporter: Tim Armstrong






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34857) AtLeastNNonNulls does not show up correctly in explain

2021-03-24 Thread Tim Armstrong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308055#comment-17308055
 ] 

Tim Armstrong commented on SPARK-34857:
---

[~LI,Xiao] i would like to take this one on.

> AtLeastNNonNulls does not show up correctly in explain
> --
>
> Key: SPARK-34857
> URL: https://issues.apache.org/jira/browse/SPARK-34857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Tim Armstrong
>Priority: Major
>
> I see this in an explain plan
> {noformat}
> (12) Filter
> Input [3]: [c1#2410L, c2#2419, c3#2422]
> Condition : AtLeastNNulls(n, c1#2410L)
> I expect it to be AtLeastNNonNulls and n to have the actual value.
> {noformat}
> Proposed fix is to change 
> https://github.com/databricks/runtime/blob/09ad623c458c40cbaa0dbb4b9f600d96531d35f5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala#L384
>  to
> {noformat}
> override def toString: String = s"AtLeastNNonNulls(${n}, 
> ${children.mkString(",")})"
> {noformat}
> Or maybe it's OK to remove and use a default implementation?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34857) AtLeastNNonNulls does not show up correctly in explain

2021-03-24 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated SPARK-34857:
--
Description: 
I see this in an explain plan

{noformat}
(12) Filter
Input [3]: [c1#2410L, c2#2419, c3#2422]
Condition : AtLeastNNulls(n, c1#2410L)

I expect it to be AtLeastNNonNulls and n to have the actual value.
{noformat}

Proposed fix is to change 
https://github.com/databricks/runtime/blob/09ad623c458c40cbaa0dbb4b9f600d96531d35f5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala#L384
 to
{noformat}
override def toString: String = s"AtLeastNNonNulls(${n}, 
${children.mkString(",")})"
{noformat}
Or maybe it's OK to remove and use a default implementation?

> AtLeastNNonNulls does not show up correctly in explain
> --
>
> Key: SPARK-34857
> URL: https://issues.apache.org/jira/browse/SPARK-34857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Tim Armstrong
>Priority: Major
>
> I see this in an explain plan
> {noformat}
> (12) Filter
> Input [3]: [c1#2410L, c2#2419, c3#2422]
> Condition : AtLeastNNulls(n, c1#2410L)
> I expect it to be AtLeastNNonNulls and n to have the actual value.
> {noformat}
> Proposed fix is to change 
> https://github.com/databricks/runtime/blob/09ad623c458c40cbaa0dbb4b9f600d96531d35f5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala#L384
>  to
> {noformat}
> override def toString: String = s"AtLeastNNonNulls(${n}, 
> ${children.mkString(",")})"
> {noformat}
> Or maybe it's OK to remove and use a default implementation?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34857) AtLeastNNonNulls does not show up correctly in explain

2021-03-24 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated SPARK-34857:
--
Environment: (was: I see this in an explain plan

{noformat}
(12) Filter
Input [3]: [c1#2410L, c2#2419, c3#2422]
Condition : AtLeastNNulls(n, c1#2410L)

I expect it to be AtLeastNNonNulls and n to have the actual value.
{noformat}

Proposed fix is to change 
https://github.com/databricks/runtime/blob/09ad623c458c40cbaa0dbb4b9f600d96531d35f5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala#L384
 to
{noformat}
override def toString: String = s"AtLeastNNonNulls(${n}, 
${children.mkString(",")})"
{noformat}
Or maybe it's OK to remove and use a default implementation?)

> AtLeastNNonNulls does not show up correctly in explain
> --
>
> Key: SPARK-34857
> URL: https://issues.apache.org/jira/browse/SPARK-34857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Tim Armstrong
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34596) NewInstance.doGenCode should not throw malformed class name error

2021-03-24 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-34596:

Fix Version/s: 2.4.8

> NewInstance.doGenCode should not throw malformed class name error
> -
>
> Key: SPARK-34596
> URL: https://issues.apache.org/jira/browse/SPARK-34596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.2, 3.1.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Major
> Fix For: 2.4.8, 3.2.0, 3.1.2, 3.0.3
>
>
> Similar to SPARK-32238 and SPARK-32999, the use of 
> {{java.lang.Class.getSimpleName}} in {{NewInstance.doGenCode}} is problematic 
> because Scala classes may trigger {{java.lang.InternalError: Malformed class 
> name}}.
> This happens more often when using nested classes in Scala (or declaring 
> classes in Scala REPL which implies class nesting).
> Note that on newer versions of JDK the underlying malformed class name no 
> longer reproduces (fixed in the JDK by 
> https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8057919), so it's less 
> of an issue there. But on JDK8u this problem still exists so we still have to 
> fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34596) NewInstance.doGenCode should not throw malformed class name error

2021-03-24 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-34596:
---

Assignee: Kris Mok

> NewInstance.doGenCode should not throw malformed class name error
> -
>
> Key: SPARK-34596
> URL: https://issues.apache.org/jira/browse/SPARK-34596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.2, 3.1.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Major
> Fix For: 3.2.0, 3.1.2, 3.0.3
>
>
> Similar to SPARK-32238 and SPARK-32999, the use of 
> {{java.lang.Class.getSimpleName}} in {{NewInstance.doGenCode}} is problematic 
> because Scala classes may trigger {{java.lang.InternalError: Malformed class 
> name}}.
> This happens more often when using nested classes in Scala (or declaring 
> classes in Scala REPL which implies class nesting).
> Note that on newer versions of JDK the underlying malformed class name no 
> longer reproduces (fixed in the JDK by 
> https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8057919), so it's less 
> of an issue there. But on JDK8u this problem still exists so we still have to 
> fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34829) transform_values return identical values when it's used with udf that returns reference type

2021-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34829:


Assignee: Apache Spark

> transform_values return identical values when it's used with udf that returns 
> reference type
> 
>
> Key: SPARK-34829
> URL: https://issues.apache.org/jira/browse/SPARK-34829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Pavel Chernikov
>Assignee: Apache Spark
>Priority: Major
>
> If return value of an {{udf}} that is passed to {{transform_values}} is an 
> {{AnyRef}}, then the transformation returns identical new values for each map 
> key (to be more precise, each newly obtained value overrides values for all 
> previously processed keys).
> Consider following examples:
> {code:java}
> case class Bar(i: Int)
> val square = udf((b: Bar) => b.i * b.i)
> val df = Seq(Map(1 -> Bar(1), 2 -> Bar(2), 3 -> Bar(3))).toDF("map")
> df.withColumn("map_square", transform_values(col("map"), (_, v) => 
> square(v))).show(truncate = false)
> +--++
> |map                           |map_square              |
> +--++
> |{1 -> {1}, 2 -> {2}, 3 -> {3}}|{1 -> 1, 2 -> 4, 3 -> 9}|
> +--++
> {code}
> vs 
> {code:java}
> case class Bar(i: Int)
> case class BarSquare(i: Int)
> val square = udf((b: Bar) => BarSquare(b.i * b.i))
> val df = Seq(Map(1 -> Bar(1), 2 -> Bar(2), 3 -> Bar(3))).toDF("map")
> df.withColumn("map_square", transform_values(col("map"), (_, v) => 
> square(v))).show(truncate = false)
> +--+--+
> |map                           |map_square                    |
> +--+--+
> |{1 -> {1}, 2 -> {2}, 3 -> {3}}|{1 -> {9}, 2 -> {9}, 3 -> {9}}|
> +--+--+
> {code}
> or even just this one
> {code:java}
> case class Foo(s: String)
> val reverse = udf((f: Foo) => f.s.reverse)
> val df = Seq(Map(1 -> Foo("abc"), 2 -> Foo("klm"), 3 -> 
> Foo("xyz"))).toDF("map")
> df.withColumn("map_reverse", transform_values(col("map"), (_, v) => 
> reverse(v))).show(truncate = false)
> ++--+
> |map |map_reverse   |
> ++--+
> |{1 -> {abc}, 2 -> {klm}, 3 -> {xyz}}|{1 -> zyx, 2 -> zyx, 3 -> zyx}|
> ++--+
> {code}
> After playing with 
> {{org.apache.spark.sql.catalyst.expressions.TransformValues}} it looks like 
> something wrong is happening while executing this line:
> {code:java}
> resultValues.update(i, functionForEval.eval(inputRow)){code}
> To be more precise , it's all about {{functionForEval.eval(inputRow)}} , 
> because if you do something like this:
> {code:java}
> println(s"RESULTS PRIOR TO EVALUATION - $resultValues")
> val resultValue = functionForEval.eval(inputRow)
> println(s"RESULT - $resultValue")
> println(s"RESULTS PRIOR TO UPDATE - $resultValues")
> resultValues.update(i, resultValue)
> println(s"RESULTS AFTER UPDATE - $resultValues"){code}
> You'll see in the logs, something like:
> {code:java}
> RESULTS PRIOR TO EVALUATION - [null,null,null] 
> RESULT - [0,1] 
> RESULTS PRIOR TO UPDATE - [null,null,null]
> RESULTS AFTER UPDATE - [[0,1],null,null]
> --
> RESULTS PRIOR TO EVALUATION - [[0,1],null,null] 
> RESULT - [0,4]
> RESULTS PRIOR TO UPDATE - [[0,4],null,null] 
> RESULTS  AFTER UPDATE - [[0,4],[0,4],null]
> --
> RESULTS PRIOR TO EVALUATION - [[0,4],[0,4],null] 
> RESULT - [0,9]
> RESULTS PRIOR TO UPDATE - [[0,9],[0,9],null]
> RESULTS  AFTER UPDATE - [[0,9],[0,9],[0,9]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34829) transform_values return identical values when it's used with udf that returns reference type

2021-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34829:


Assignee: (was: Apache Spark)

> transform_values return identical values when it's used with udf that returns 
> reference type
> 
>
> Key: SPARK-34829
> URL: https://issues.apache.org/jira/browse/SPARK-34829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Pavel Chernikov
>Priority: Major
>
> If return value of an {{udf}} that is passed to {{transform_values}} is an 
> {{AnyRef}}, then the transformation returns identical new values for each map 
> key (to be more precise, each newly obtained value overrides values for all 
> previously processed keys).
> Consider following examples:
> {code:java}
> case class Bar(i: Int)
> val square = udf((b: Bar) => b.i * b.i)
> val df = Seq(Map(1 -> Bar(1), 2 -> Bar(2), 3 -> Bar(3))).toDF("map")
> df.withColumn("map_square", transform_values(col("map"), (_, v) => 
> square(v))).show(truncate = false)
> +--++
> |map                           |map_square              |
> +--++
> |{1 -> {1}, 2 -> {2}, 3 -> {3}}|{1 -> 1, 2 -> 4, 3 -> 9}|
> +--++
> {code}
> vs 
> {code:java}
> case class Bar(i: Int)
> case class BarSquare(i: Int)
> val square = udf((b: Bar) => BarSquare(b.i * b.i))
> val df = Seq(Map(1 -> Bar(1), 2 -> Bar(2), 3 -> Bar(3))).toDF("map")
> df.withColumn("map_square", transform_values(col("map"), (_, v) => 
> square(v))).show(truncate = false)
> +--+--+
> |map                           |map_square                    |
> +--+--+
> |{1 -> {1}, 2 -> {2}, 3 -> {3}}|{1 -> {9}, 2 -> {9}, 3 -> {9}}|
> +--+--+
> {code}
> or even just this one
> {code:java}
> case class Foo(s: String)
> val reverse = udf((f: Foo) => f.s.reverse)
> val df = Seq(Map(1 -> Foo("abc"), 2 -> Foo("klm"), 3 -> 
> Foo("xyz"))).toDF("map")
> df.withColumn("map_reverse", transform_values(col("map"), (_, v) => 
> reverse(v))).show(truncate = false)
> ++--+
> |map |map_reverse   |
> ++--+
> |{1 -> {abc}, 2 -> {klm}, 3 -> {xyz}}|{1 -> zyx, 2 -> zyx, 3 -> zyx}|
> ++--+
> {code}
> After playing with 
> {{org.apache.spark.sql.catalyst.expressions.TransformValues}} it looks like 
> something wrong is happening while executing this line:
> {code:java}
> resultValues.update(i, functionForEval.eval(inputRow)){code}
> To be more precise , it's all about {{functionForEval.eval(inputRow)}} , 
> because if you do something like this:
> {code:java}
> println(s"RESULTS PRIOR TO EVALUATION - $resultValues")
> val resultValue = functionForEval.eval(inputRow)
> println(s"RESULT - $resultValue")
> println(s"RESULTS PRIOR TO UPDATE - $resultValues")
> resultValues.update(i, resultValue)
> println(s"RESULTS AFTER UPDATE - $resultValues"){code}
> You'll see in the logs, something like:
> {code:java}
> RESULTS PRIOR TO EVALUATION - [null,null,null] 
> RESULT - [0,1] 
> RESULTS PRIOR TO UPDATE - [null,null,null]
> RESULTS AFTER UPDATE - [[0,1],null,null]
> --
> RESULTS PRIOR TO EVALUATION - [[0,1],null,null] 
> RESULT - [0,4]
> RESULTS PRIOR TO UPDATE - [[0,4],null,null] 
> RESULTS  AFTER UPDATE - [[0,4],[0,4],null]
> --
> RESULTS PRIOR TO EVALUATION - [[0,4],[0,4],null] 
> RESULT - [0,9]
> RESULTS PRIOR TO UPDATE - [[0,9],[0,9],null]
> RESULTS  AFTER UPDATE - [[0,9],[0,9],[0,9]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34829) transform_values return identical values when it's used with udf that returns reference type

2021-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308034#comment-17308034
 ] 

Apache Spark commented on SPARK-34829:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/31955

> transform_values return identical values when it's used with udf that returns 
> reference type
> 
>
> Key: SPARK-34829
> URL: https://issues.apache.org/jira/browse/SPARK-34829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Pavel Chernikov
>Priority: Major
>
> If return value of an {{udf}} that is passed to {{transform_values}} is an 
> {{AnyRef}}, then the transformation returns identical new values for each map 
> key (to be more precise, each newly obtained value overrides values for all 
> previously processed keys).
> Consider following examples:
> {code:java}
> case class Bar(i: Int)
> val square = udf((b: Bar) => b.i * b.i)
> val df = Seq(Map(1 -> Bar(1), 2 -> Bar(2), 3 -> Bar(3))).toDF("map")
> df.withColumn("map_square", transform_values(col("map"), (_, v) => 
> square(v))).show(truncate = false)
> +--++
> |map                           |map_square              |
> +--++
> |{1 -> {1}, 2 -> {2}, 3 -> {3}}|{1 -> 1, 2 -> 4, 3 -> 9}|
> +--++
> {code}
> vs 
> {code:java}
> case class Bar(i: Int)
> case class BarSquare(i: Int)
> val square = udf((b: Bar) => BarSquare(b.i * b.i))
> val df = Seq(Map(1 -> Bar(1), 2 -> Bar(2), 3 -> Bar(3))).toDF("map")
> df.withColumn("map_square", transform_values(col("map"), (_, v) => 
> square(v))).show(truncate = false)
> +--+--+
> |map                           |map_square                    |
> +--+--+
> |{1 -> {1}, 2 -> {2}, 3 -> {3}}|{1 -> {9}, 2 -> {9}, 3 -> {9}}|
> +--+--+
> {code}
> or even just this one
> {code:java}
> case class Foo(s: String)
> val reverse = udf((f: Foo) => f.s.reverse)
> val df = Seq(Map(1 -> Foo("abc"), 2 -> Foo("klm"), 3 -> 
> Foo("xyz"))).toDF("map")
> df.withColumn("map_reverse", transform_values(col("map"), (_, v) => 
> reverse(v))).show(truncate = false)
> ++--+
> |map |map_reverse   |
> ++--+
> |{1 -> {abc}, 2 -> {klm}, 3 -> {xyz}}|{1 -> zyx, 2 -> zyx, 3 -> zyx}|
> ++--+
> {code}
> After playing with 
> {{org.apache.spark.sql.catalyst.expressions.TransformValues}} it looks like 
> something wrong is happening while executing this line:
> {code:java}
> resultValues.update(i, functionForEval.eval(inputRow)){code}
> To be more precise , it's all about {{functionForEval.eval(inputRow)}} , 
> because if you do something like this:
> {code:java}
> println(s"RESULTS PRIOR TO EVALUATION - $resultValues")
> val resultValue = functionForEval.eval(inputRow)
> println(s"RESULT - $resultValue")
> println(s"RESULTS PRIOR TO UPDATE - $resultValues")
> resultValues.update(i, resultValue)
> println(s"RESULTS AFTER UPDATE - $resultValues"){code}
> You'll see in the logs, something like:
> {code:java}
> RESULTS PRIOR TO EVALUATION - [null,null,null] 
> RESULT - [0,1] 
> RESULTS PRIOR TO UPDATE - [null,null,null]
> RESULTS AFTER UPDATE - [[0,1],null,null]
> --
> RESULTS PRIOR TO EVALUATION - [[0,1],null,null] 
> RESULT - [0,4]
> RESULTS PRIOR TO UPDATE - [[0,4],null,null] 
> RESULTS  AFTER UPDATE - [[0,4],[0,4],null]
> --
> RESULTS PRIOR TO EVALUATION - [[0,4],[0,4],null] 
> RESULT - [0,9]
> RESULTS PRIOR TO UPDATE - [[0,9],[0,9],null]
> RESULTS  AFTER UPDATE - [[0,9],[0,9],[0,9]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34797) Refactor Logistic Aggregator - support virtual centering

2021-03-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-34797.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31889
[https://github.com/apache/spark/pull/31889]

> Refactor Logistic Aggregator - support virtual centering
> 
>
> Key: SPARK-34797
> URL: https://issues.apache.org/jira/browse/SPARK-34797
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.2.0
>
>
> 1, add BinaryLogisticBlockAggregator and MultinomialLogisticBlockAggregator 
> and related suites;
> 2, impl 'virtual centering' in standardization



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34797) Refactor Logistic Aggregator - support virtual centering

2021-03-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-34797:


Assignee: zhengruifeng  (was: Apache Spark)

> Refactor Logistic Aggregator - support virtual centering
> 
>
> Key: SPARK-34797
> URL: https://issues.apache.org/jira/browse/SPARK-34797
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 3.2.0
>
>
> 1, add BinaryLogisticBlockAggregator and MultinomialLogisticBlockAggregator 
> and related suites;
> 2, impl 'virtual centering' in standardization



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2021-03-24 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308000#comment-17308000
 ] 

Xinli Shang commented on SPARK-26345:
-

Yes, it needs some synchronization. I have the modified version implementation 
in Presto. You can check it 
[here|https://github.com/shangxinli/presto/commit/f6327a161eb6cfd5137f679620e095d8257816b8#diff-bb24b92e28343804ebaf540efe6c1cda0b5e2524e6811f8fe2daee5944dad386R203].
 

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Parquet 1.11 supports column indexing. Spark can supports this feature for 
> better read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201
>  
> Benchmark result:
> [https://github.com/apache/spark/pull/31393#issuecomment-769767724]
> This feature is enabled by default, and users can disable it by setting 
> {{parquet.filter.columnindex.enabled}} to false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34856) ANSI mode: Allow casting complex types as string type

2021-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34856:


Assignee: Gengliang Wang  (was: Apache Spark)

> ANSI mode: Allow casting complex types as string type
> -
>
> Key: SPARK-34856
> URL: https://issues.apache.org/jira/browse/SPARK-34856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, complex types are not allowed to cast as string type. This breaks 
> the Dataset.show() API. E.g
> {code:java}
> scala> sql(“select array(1, 2, 2)“).show(false)
> org.apache.spark.sql.AnalysisException: cannot resolve ‘CAST(`array(1, 2, 2)` 
> AS STRING)’ due to data type mismatch:
>  cannot cast array to string with ANSI mode on.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34856) ANSI mode: Allow casting complex types as string type

2021-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34856:


Assignee: Apache Spark  (was: Gengliang Wang)

> ANSI mode: Allow casting complex types as string type
> -
>
> Key: SPARK-34856
> URL: https://issues.apache.org/jira/browse/SPARK-34856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Currently, complex types are not allowed to cast as string type. This breaks 
> the Dataset.show() API. E.g
> {code:java}
> scala> sql(“select array(1, 2, 2)“).show(false)
> org.apache.spark.sql.AnalysisException: cannot resolve ‘CAST(`array(1, 2, 2)` 
> AS STRING)’ due to data type mismatch:
>  cannot cast array to string with ANSI mode on.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34856) ANSI mode: Allow casting complex types as string type

2021-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307910#comment-17307910
 ] 

Apache Spark commented on SPARK-34856:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/31954

> ANSI mode: Allow casting complex types as string type
> -
>
> Key: SPARK-34856
> URL: https://issues.apache.org/jira/browse/SPARK-34856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, complex types are not allowed to cast as string type. This breaks 
> the Dataset.show() API. E.g
> {code:java}
> scala> sql(“select array(1, 2, 2)“).show(false)
> org.apache.spark.sql.AnalysisException: cannot resolve ‘CAST(`array(1, 2, 2)` 
> AS STRING)’ due to data type mismatch:
>  cannot cast array to string with ANSI mode on.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34856) ANSI mode: Allow casting complex types as string type

2021-03-24 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-34856:
--

 Summary: ANSI mode: Allow casting complex types as string type
 Key: SPARK-34856
 URL: https://issues.apache.org/jira/browse/SPARK-34856
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


Currently, complex types are not allowed to cast as string type. This breaks 
the Dataset.show() API. E.g

{code:java}
scala> sql(“select array(1, 2, 2)“).show(false)
org.apache.spark.sql.AnalysisException: cannot resolve ‘CAST(`array(1, 2, 2)` 
AS STRING)’ due to data type mismatch:
 cannot cast array to string with ANSI mode on.
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33482) V2 Datasources that extend FileScan preclude exchange reuse

2021-03-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-33482:

Fix Version/s: 3.0.3

> V2 Datasources that extend FileScan preclude exchange reuse
> ---
>
> Key: SPARK-33482
> URL: https://issues.apache.org/jira/browse/SPARK-33482
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Bruce Robbins
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.2.0, 3.1.2, 3.0.3
>
>
> Sample query:
> {noformat}
> spark.read.parquet("tbl").createOrReplaceTempView("tbl")
> spark.read.parquet("lookup").createOrReplaceTempView("lookup")
> sql("""
>select tbl.col1, fk1, fk2
>from tbl, lookup l1, lookup l2
>where fk1 = l1.key
>and fk2 = l2.key
> """).explain
> {noformat}
> Test files can be created as so:
> {noformat}
> import scala.util.Random
> val rand = Random
> val tbl = spark.range(1, 1).map { x =>
>   (rand.nextLong.abs % 20,
>rand.nextLong.abs % 20,
>x)
> }.toDF("fk1", "fk2", "col1")
> tbl.write.mode("overwrite").parquet("tbl")
> val lookup = spark.range(0, 20).map { x =>
>   (x + 1, x * 1, (x + 1) * 1)
> }.toDF("key", "col1", "col2")
> lookup.write.mode("overwrite").parquet("lookup")
> {noformat}
> Output with V1 Parquet reader:
> {noformat}
>  == Physical Plan ==
> *(3) Project [col1#2L, fk1#0L, fk2#1L]
> +- *(3) BroadcastHashJoin [fk2#1L], [key#12L], Inner, BuildRight, false
>:- *(3) Project [fk1#0L, fk2#1L, col1#2L]
>:  +- *(3) BroadcastHashJoin [fk1#0L], [key#6L], Inner, BuildRight, false
>: :- *(3) Filter (isnotnull(fk1#0L) AND isnotnull(fk2#1L))
>: :  +- *(3) ColumnarToRow
>: : +- FileScan parquet [fk1#0L,fk2#1L,col1#2L] Batched: true, 
> DataFilters: [isnotnull(fk1#0L), isnotnull(fk2#1L)], Format: Parquet, 
> Location: InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/tbl], 
> PartitionFilters: [], PushedFilters: [IsNotNull(fk1), IsNotNull(fk2)], 
> ReadSchema: struct
>: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, 
> bigint, false]),false), [id=#75]
>:+- *(1) Filter isnotnull(key#6L)
>:   +- *(1) ColumnarToRow
>:  +- FileScan parquet [key#6L] Batched: true, DataFilters: 
> [isnotnull(key#6L)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], 
> PartitionFilters: [], PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
>+- ReusedExchange [key#12L], BroadcastExchange 
> HashedRelationBroadcastMode(List(input[0, bigint, false]),false), [id=#75]
> {noformat}
> With V1 Parquet reader, the exchange for lookup is reused (see last line).
> Output with V2 Parquet reader (spark.sql.sources.useV1SourceList=""):
> {noformat}
>  == Physical Plan ==
> *(3) Project [col1#2L, fk1#0L, fk2#1L]
> +- *(3) BroadcastHashJoin [fk2#1L], [key#12L], Inner, BuildRight, false
>:- *(3) Project [fk1#0L, fk2#1L, col1#2L]
>:  +- *(3) BroadcastHashJoin [fk1#0L], [key#6L], Inner, BuildRight, false
>: :- *(3) Filter (isnotnull(fk1#0L) AND isnotnull(fk2#1L))
>: :  +- *(3) ColumnarToRow
>: : +- BatchScan[fk1#0L, fk2#1L, col1#2L] ParquetScan DataFilters: 
> [isnotnull(fk1#0L), isnotnull(fk2#1L)], Format: parquet, Location: 
> InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/tbl], 
> PartitionFilters: [], PushedFilers: [IsNotNull(fk1), IsNotNull(fk2)], 
> ReadSchema: struct, PushedFilters: 
> [IsNotNull(fk1), IsNotNull(fk2)]
>: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, 
> bigint, false]),false), [id=#75]
>:+- *(1) Filter isnotnull(key#6L)
>:   +- *(1) ColumnarToRow
>:  +- BatchScan[key#6L] ParquetScan DataFilters: 
> [isnotnull(key#6L)], Format: parquet, Location: 
> InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], 
> PartitionFilters: [], PushedFilers: [IsNotNull(key)], ReadSchema: 
> struct, PushedFilters: [IsNotNull(key)]
>+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false]),false), [id=#83]
>   +- *(2) Filter isnotnull(key#12L)
>  +- *(2) ColumnarToRow
> +- BatchScan[key#12L] ParquetScan DataFilters: 
> [isnotnull(key#12L)], Format: parquet, Location: 
> InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], 
> PartitionFilters: [], PushedFilers: [IsNotNull(key)], ReadSchema: 
> struct, PushedFilters: [IsNotNull(key)]
> {noformat}
> With the V2 Parquet reader, the exchange for lookup is not reused (see last 4 
> lines).
> You can see the same issue with the Orc reader (and I assume any other 
> datasource that extends Filescan).
> The issue appears to be this check in FileScan#equals:

[jira] [Updated] (SPARK-34756) Fix FileScan equality check

2021-03-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-34756:

Fix Version/s: 3.0.3

> Fix FileScan equality check
> ---
>
> Key: SPARK-34756
> URL: https://issues.apache.org/jira/browse/SPARK-34756
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.2.0, 3.1.2, 3.0.3
>
>
> `&&` is missing from `FileScan.equals()`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34769) AnsiTypeCoercion: return narrowest convertible type among TypeCollection

2021-03-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34769.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31859
[https://github.com/apache/spark/pull/31859]

> AnsiTypeCoercion: return narrowest convertible type among TypeCollection
> 
>
> Key: SPARK-34769
> URL: https://issues.apache.org/jira/browse/SPARK-34769
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently, when implicit casting a data type to a `TypeCollection`, Spark 
> returns the first convertible data type among `TypeCollection`.
> In ANSI mode, we can make the behavior more reasonable by returning the 
> narrowest convertible data type in `TypeCollection`.
> In details,  we first try to find the all the expected types we can 
> implicitly cast:
> 1. if there is no convertible data types, return None;
> 2. if there is only one convertible data type, cast input as it;
> 3. otherwise if there are multiple convertible data types, find the narrowest 
> common data
>  type among them. If there is no such narrowest common data type, return 
> None.
> Note that if the narrowest common type is Float type and the convertible 
> types contains Double ype, simply return Double type as the narrowest common 
> type to avoid potential
>  precision loss on converting the Integral type as Float type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34510) .foreachPartition command hangs when ran inside Python package but works when ran from Python file outside the package on EMR

2021-03-24 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307879#comment-17307879
 ] 

Sean R. Owen commented on SPARK-34510:
--

Firstly, does this happen on Apache Spark or just EMR?
Are you sure it's not just taking a long time to read data from S3?

> .foreachPartition command hangs when ran inside Python package but works when 
> ran from Python file outside the package on EMR
> -
>
> Key: SPARK-34510
> URL: https://issues.apache.org/jira/browse/SPARK-34510
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, PySpark
>Affects Versions: 3.0.0
>Reporter: Yuriy
>Priority: Minor
> Attachments: Code.zip
>
>
> I'm running on EMR Pyspark 3.0.0. with project structure below, process.py is 
> what controls the flow of the application and calls code inside the 
> _file_processor_ package. The command hangs when the .foreachPartition code 
> that is located inside _s3_repo.py_ is called by _process.py_. When the same 
> .foreachPartition code is moved from _s3_repo.py_ and placed inside the 
> _process.py_ it runs just fine.
> {code:java}
> process.py
> file_processor
>   config
> spark.py
>   repository
> s3_repo.py
>   structure
> table_creator.py
> {code}
> *process.py*
> {code:java}
> from file_processor.structure import table_creator
> from file_processor.repository import s3_repo
> def process():
> table_creator.create_table()
> s3_repo.save_to_s3()
> if __name__ == '__main__':
> process()
> {code}
> *spark.py*
> {code:java}
> from pyspark.sql import SparkSession
> spark_session = SparkSession.builder.appName("Test").getOrCreate()
> {code}
> *s3_repo.py* 
> {code:java}
> from file_processor.config.spark import spark_session
> def save_to_s3():
> spark_session.sql('SELECT * FROM 
> rawFileData').toJSON().foreachPartition(_save_to_s3)
> def _save_to_s3(iterator):   
> for record in iterator:
> print(record)
> {code}
>  *table_creator.py*
> {code:java}
> from file_processor.config.spark import spark_session
> from pyspark.sql import Row
> def create_table():
> file_contents = [
> {'line_num': 1, 'contents': 'line 1'},
> {'line_num': 2, 'contents': 'line 2'},
> {'line_num': 3, 'contents': 'line 3'}
> ]
> spark_session.createDataFrame(Row(**row) for row in 
> file_contents).cache().createOrReplaceTempView("rawFileData")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34530) logError for interrupting block migrations is too high

2021-03-24 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307878#comment-17307878
 ] 

Sean R. Owen commented on SPARK-34530:
--

Is this just about turning log level down? that'd be fine I'm sure. Where is it?

> logError for interrupting block migrations is too high
> --
>
> Key: SPARK-34530
> URL: https://issues.apache.org/jira/browse/SPARK-34530
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Holden Karau
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34569) java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.

2021-03-24 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307876#comment-17307876
 ] 

Sean R. Owen commented on SPARK-34569:
--

This sounds like a Spark version mismatch?

> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.
> 
>
> Key: SPARK-34569
> URL: https://issues.apache.org/jira/browse/SPARK-34569
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Vishwesh Vinc
>Priority: Major
>
> Using Apache Spark 3.0 with Scala 2.12, and trying to submit a job, which 
> reads from Azure Event Hub as binary encoded and then deserializes as 
> protobuf generated class, then getting this error. The stack trace is follows:
> {code:java}
> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.(Lorg/apache/spark/sql/types/StructType;ZLscala/collection/Seq;Lorg/apache/spark/sql/catalyst/expressions/Expression;Lscala/reflect/ClassTag;)V
>   at 
> frameless.TypedExpressionEncoder$.apply(TypedExpressionEncoder.scala:45)
>   at 
> scalapb.spark.Implicits.typedEncoderToEncoder(TypedEncoders.scala:125)
>   at 
> scalapb.spark.Implicits.typedEncoderToEncoder$(TypedEncoders.scala:122)
>   at 
> scalapb.spark.Implicits$.typedEncoderToEncoder(TypedEncoders.scala:128)
>   at 
> Utils.MessagingQueues.EventHubSourceReader$.editorialAdStream(EventHubSourceReader.scala:57)
>   at 
> Utils.MessagingQueues.SourceReader$.readEditorialAdSource(SourceReader.scala:39)
>   at 
> Workflows.Streaming.PROD.UnifiedNRT.UnifiedNRTHelper$.fetchInputStream(UnifiedNRTHelper.scala:62)
>   at 
> Workflows.Streaming.PROD.UnifiedNRT.UnifiedNRT$.main(UnifiedNRT.scala:50)
>   at Workflows.Streaming.PROD.UnifiedNRT.UnifiedNRT.main(UnifiedNRT.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:775)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34591) Pyspark undertakes pruning of decision trees and random forests outside the control of the user, leading to undesirable and unexpected outcomes that are challenging to d

2021-03-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-34591:
-
Priority: Minor  (was: Critical)

(Not nearly Critical)
Yes please open a PR to add the parameter

> Pyspark undertakes pruning of decision trees and random forests outside the 
> control of the user, leading to undesirable and unexpected outcomes that are 
> challenging to diagnose and impossible to correct
> --
>
> Key: SPARK-34591
> URL: https://issues.apache.org/jira/browse/SPARK-34591
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0, 2.4.4, 3.1.1
>Reporter: Julian King
>Priority: Minor
>  Labels: pyspark
>
> *History of the issue*
> SPARK-3159 implemented a method designed to reduce the computational burden 
> for predictions from decision trees and random forests by pruning the tree 
> after fitting. This is done in such a way that branches where child leaves 
> all produce the same classification prediction are merged.
> This was implemented via a PR: [https://github.com/apache/spark/pull/20632]
> This feature is controllable by a "prune" parameter in the Scala version of 
> the code, which is set to True as the default behaviour. However, this 
> parameter is not exposed in the Pyspark API, resulting in the behaviour above:
>  * Occurring always (despite the user may not wanting it to occur)
>  * Not being documented in the ML documentation, leading to decision tree 
> behavoiur that may be in conflict with what the user expects to happen
> *Why is this a problem?*
> +Problem 1: Inaccurate probabilities+
> Because the decision to prune is based on the classification prediction from 
> the tree (not the probability prediction from the node), this introduces 
> additional bias compared to the situation where the pruning is not done. The 
> impact here may be severe in some cases
> +Problem 2: Leads to completely unacceptable behaviours in some circumstances 
> and for some hyper-parameters+
> My colleagues and I encountered this bug in a scenario where we could not get 
> a decision tree classifier (or random forest classifier with a single tree) 
> to split a single node, despite this being eminently supported by the data. 
> This renders the decision trees and random forests complete unusable
> +Problem 3: Outcomes are highly sensitive to the hyper-parameters chosen, and 
> how they interact with the data+
> Small changes in the hyper-parameters should ideally produce small changes in 
> the built trees. However, here we have found that small changes in the 
> hyper-parameters lead to large and unpredictable changes in the resultant 
> trees as a result of this pruning.
> In principle, this high degree of instability means that re-training the same 
> model, with the same hyper-parameter settings, on slightly different data may 
> lead to large variations in the tree structure simply as a result of the 
> pruning
> +Problem 4: The problems above are much worse for unbalanced data sets+
> Probability estimation on unbalanced data sets using trees should be 
> supported, but the pruning method described will make this very difficult
> +Problem 5: This pruning method is a substantial variation from the 
> description of the decision tree algorithm in the MLLib documents and is not 
> described+
> This made it extremely confusing for us in working out why we were seeing 
> certain behaviours - we had to trace back through all of the Spark detailed 
> release notes to identify where the problem might.
> *Proposed solutions*
> +Option 1 (much easier):+
> The proposed solution here is:
>  * Set the default pruning behaviour to False rather than True, thereby 
> bringing the default behaviour back into alignment with the documentation 
> whilst avoiding the issues described above
> +Option 2 (more involved):+
> The proposed solution here is:
>  * Leave the default pruning behaviour set to False
>  * Expand the pyspark API to expose the pruning behaviour as a 
> user-controllable option
>  * Document the change to the API
>  * Document the change to the tree building behaviour at appropriate points 
> in the Spark ML and Spark MLLib documentation
> We recommend that the default behaviour be set to False because this approach 
> is not the generally understood approach for building decision trees, where 
> pruning is decided a separate and user-controllable step.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, 

[jira] [Resolved] (SPARK-34822) Update plan stability golden files even if only explain differs

2021-03-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34822.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31927
[https://github.com/apache/spark/pull/31927]

> Update plan stability golden files even if only explain differs
> ---
>
> Key: SPARK-34822
> URL: https://issues.apache.org/jira/browse/SPARK-34822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Assignee: Tanel Kiis
>Priority: Major
> Fix For: 3.2.0
>
>
> PlanStabilitySuite updates the golden files only if simplified.txt has 
> changed. In some situations only explain.txt will change and the golden files 
> are not updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34822) Update plan stability golden files even if only explain differs

2021-03-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34822:
---

Assignee: Tanel Kiis

> Update plan stability golden files even if only explain differs
> ---
>
> Key: SPARK-34822
> URL: https://issues.apache.org/jira/browse/SPARK-34822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Assignee: Tanel Kiis
>Priority: Major
>
> PlanStabilitySuite updates the golden files only if simplified.txt has 
> changed. In some situations only explain.txt will change and the golden files 
> are not updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34630) Add type hints of pyspark.__version__ and pyspark.sql.Column.contains

2021-03-24 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-34630.

Fix Version/s: 3.1.2
   3.2.0
 Assignee: Danny Meijer
   Resolution: Fixed

Issue resolved by pull request 31823

https://github.com/apache/spark/pull/31823

> Add type hints of pyspark.__version__ and pyspark.sql.Column.contains
> -
>
> Key: SPARK-34630
> URL: https://issues.apache.org/jira/browse/SPARK-34630
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Hyukjin Kwon
>Assignee: Danny Meijer
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
>
> pyspark.__version__ and pyspark.sql.Column.contains are missing in python 
> type hints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34689) Spark Thrift Server: Memory leak for SparkSession objects

2021-03-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-34689.
--
Resolution: Duplicate

> Spark Thrift Server: Memory leak for SparkSession objects
> -
>
> Key: SPARK-34689
> URL: https://issues.apache.org/jira/browse/SPARK-34689
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Dimitris Batis
>Priority: Major
> Attachments: heap_sparksession.png, 
> heapdump_local_attempt_250_closed_connections.png, test_patch.diff
>
>
> When running the Spark Thrift Server (3.0.1, standalone cluster), we have 
> noticed that each new JDBC connection creates a new SparkSession object. This 
> object (and anything being referenced by it), however, remains in memory 
> indefinitely even though the JDBC connection is closed, and full GCs do not 
> remove it. After about 18 hours of heavy use, we get more than 46.000 such 
> objects (heap_sparksession.png).
> In a small local installation test, I replicated the behavior by simply 
> opening a JDBC connection, executing SHOW SCHEMAS and closing the connection 
> (heapdump_local_attempt.png). For each connection, a new SparkSession object 
> is created and never removed. I have noticed the same behavior in Spark 3.1.1 
> as well.
> Our settings are as follows. Please note that this was occuring even before 
> we added the ExplicitGCInvokesConcurrent option (i.e. it happened even when a 
> full GC was performed every 20 minutes). 
> spark-defaults.conf:
> {code}
> spark.masterspark://...:7077,...:7077
> spark.master.rest.enabled   true
> spark.eventLog.enabled  false
> spark.eventLog.dir  file:///...
> spark.driver.cores 1
> spark.driver.maxResultSize 4g
> spark.driver.memory5g
> spark.executor.memory  1g
> spark.executor.logs.rolling.maxRetainedFiles   2
> spark.executor.logs.rolling.strategy   size
> spark.executor.logs.rolling.maxSize1G
> spark.local.dir ...
> spark.sql.ui.retainedExecutions=10
> spark.ui.retainedDeadExecutors=10
> spark.worker.ui.retainedExecutors=10
> spark.worker.ui.retainedDrivers=10
> spark.ui.retainedJobs=30
> spark.ui.retainedStages=100
> spark.ui.retainedTasks=500
> spark.appStateStore.asyncTracking.enable=false
> spark.sql.shuffle.partitions=200
> spark.default.parallelism=200
> spark.task.reaper.enabled=true
> spark.task.reaper.threadDump=false
> spark.memory.offHeap.enabled=true
> spark.memory.offHeap.size=4g
> {code}
> spark-env.sh:
> {code}
> HADOOP_CONF_DIR="/.../hadoop/etc/hadoop"
> SPARK_WORKER_CORES=28
> SPARK_WORKER_MEMORY=54g
> SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true 
> -Dspark.worker.cleanup.appDataTtl=172800 -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=40 "
> SPARK_DAEMON_JAVA_OPTS="-Dlog4j.configuration=file:///.../log4j.properties 
> -Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.dir="..." 
> -Dspark.deploy.zookeeper.url=...:2181,...:2181,...:2181 -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=40"
> {code}
> start-thriftserver.sh:
> {code}
> export SPARK_DAEMON_MEMORY=16g
> exec "${SPARK_HOME}"/sbin/spark-daemon.sh submit $CLASS 1 \
>   --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
>   --conf "spark.ui.retainedJobs=30" \
>   --conf "spark.ui.retainedStages=100" \
>   --conf "spark.ui.retainedTasks=500" \
>   --conf "spark.sql.ui.retainedExecutions=10" \
>   --conf "spark.appStateStore.asyncTracking.enable=false" \
>   --conf "spark.cleaner.periodicGC.interval=20min" \
>   --conf "spark.sql.autoBroadcastJoinThreshold=-1" \
>   --conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=200" \
>   --conf "spark.driver.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps 
> -Xloggc:/.../thrift_driver_gc.log -XX:+UseGCLogFileRotation 
> -XX:NumberOfGCLogFiles=7 -XX:GCLogFileSize=35M -XX:+UseG1GC 
> -XX:MaxGCPauseMillis=200 -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.port=11990 -XX:+ExplicitGCInvokesConcurrent" \
>   --conf "spark.metrics.namespace=..." --name "..." --packages 
> io.delta:delta-core_2.12:0.7.0 --hiveconf spark.ui.port=4038 --hiveconf 
> spark.cores.max=22 --hiveconf spark.executor.cores=3 --hiveconf 
> spark.executor.memory=6144M --hiveconf spark.scheduler.mode=FAIR --hiveconf 
> spark.scheduler.allocation.file=.../conf/thrift-scheduler.xml \
>   --conf spark.sql.thriftServer.incrementalCollect=true "$@"
> 

[jira] [Resolved] (SPARK-34746) Spark dependencies require scala 2.12.12

2021-03-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-34746.
--
Resolution: Duplicate

Yep, though there are reasons we couldn't use 2.12.12

> Spark dependencies require scala 2.12.12
> 
>
> Key: SPARK-34746
> URL: https://issues.apache.org/jira/browse/SPARK-34746
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Peter Kaiser
>Priority: Critical
>
> In our application we're creating a spark session programmatically. The 
> application is built using gradle.
> After upgrading spark to 3.1.1 it no longer works, due to incompatible 
> classes on driver and executor (namely: 
> scala.lang.collections.immutable.WrappedArray.ofRef).
> Turns out this was caused by different scala versions on driver vs. executor. 
> While spark still comes with Scala 2.12.10, some of its dependencies in the 
> gradle build require Scala 2.12.12:
> {noformat}
> Cannot find a version of 'org.scala-lang:scala-library' that satisfies the 
> version constraints:
> Dependency path '...' --> '...' --> 'org.scala-lang:scala-library:{strictly 
> 2.12.10}'
> Dependency path '...' --> 'org.apache.spark:spark-core_2.12:3.1.1' --> 
> 'org.json4s:json4s-jackson_2.12:3.7.0-M5' --> 
> 'org.scala-lang:scala-library:2.12.12' {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34757) Spark submit should ignore cache for SNAPSHOT dependencies

2021-03-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-34757:
-
Priority: Minor  (was: Major)

> Spark submit should ignore cache for SNAPSHOT dependencies
> --
>
> Key: SPARK-34757
> URL: https://issues.apache.org/jira/browse/SPARK-34757
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 3.1.1
>Reporter: Bo Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> When spark-submit is executed with --packages, it will not download the 
> dependency jars when they are available in cache (e.g. ivy cache), even when 
> the dependencies are SNAPSHOTs. 
> This might block developers who work on external modules in Spark (e.g. 
> spark-avro), since they need to remove the cache manually every time when 
> they update the code during developments (which generates SNAPSHOT jars). 
> Without knowing this, they could be blocked wondering why their code changes 
> are not reflected in spark-submit executions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34759) run JavaSparkSQLExample failed with Exception.

2021-03-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-34759.
--
Resolution: Duplicate

> run JavaSparkSQLExample failed with Exception.
> --
>
> Key: SPARK-34759
> URL: https://issues.apache.org/jira/browse/SPARK-34759
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 3.0.1, 3.1.1
>Reporter: zengrui
>Priority: Minor
>
> run JavaSparkSQLExample failed with Exception.
> The Exception is thrown  in function runDatasetCreationExample, when execute 
> ‘spark.read().json(path).as(personEncoder)’.
> The exception is  'Exception in thread "main" 
> org.apache.spark.sql.AnalysisException: Cannot up cast `age` from bigint to 
> int.'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34801) java.lang.NoSuchMethodException: org.apache.hadoop.hive.ql.metadata.Hive.loadPartition

2021-03-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-34801.
--
Resolution: Not A Problem

> java.lang.NoSuchMethodException: 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition
> --
>
> Key: SPARK-34801
> URL: https://issues.apache.org/jira/browse/SPARK-34801
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.2
> Environment: HDP3.1.4.0-315  spark 3.0.2
>Reporter: zhaojk
>Priority: Major
>
> use spark-sql  run this sql  insert overwrite table zry.zjk1 
> partition(etl_dt=2) select * from zry.zry;
> java.lang.NoSuchMethodException: 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(org.apache.hadoop.fs.Path,
>  org.apache.hadoop.hive.ql.metadata.Table, java.util.Map, 
> org.apache.hadoop.hive.ql.plan.LoadTableDesc$LoadFileType, boolean, boolean, 
> boolean, boolean, boolean, java.lang.Long, int, boolean)
>  at java.lang.Class.getMethod(Class.java:1786)
>  at org.apache.spark.sql.hive.client.Shim.findMethod(HiveShim.scala:177)
>  at 
> org.apache.spark.sql.hive.client.Shim_v3_0.loadPartitionMethod$lzycompute(HiveShim.scala:1289)
>  at 
> org.apache.spark.sql.hive.client.Shim_v3_0.loadPartitionMethod(HiveShim.scala:1274)
>  at 
> org.apache.spark.sql.hive.client.Shim_v3_0.loadPartition(HiveShim.scala:1337)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$loadPartition$1(HiveClientImpl.scala:881)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:295)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:228)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:227)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:277)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.loadPartition(HiveClientImpl.scala:871)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$loadPartition$1(HiveExternalCatalog.scala:915)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:103)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:894)
>  at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadPartition(ExternalCatalogWithListener.scala:179)
>  at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:318)
>  at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:102)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)
>  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
>  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:229)
>  at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
>  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
>  at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:650)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
>  at 
> 

[jira] [Updated] (SPARK-34838) consider markdownlint

2021-03-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-34838:
-
Issue Type: Improvement  (was: Bug)

> consider markdownlint
> -
>
> Key: SPARK-34838
> URL: https://issues.apache.org/jira/browse/SPARK-34838
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation
>Affects Versions: 3.1.1
>Reporter: Josh Soref
>Priority: Minor
>
> There's apparently a tool called markdownlint which seems to be integrated w/ 
> Visual Studio Code. It suggests, among other things, to use {{*}} instead of 
> {{-}} for top level bullets:
> https://github.com/DavidAnson/markdownlint/blob/v0.23.1/doc/Rules.md#md004 
> https://github.com/apache/spark/pull/30679#issuecomment-804971370



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34853) Move partitioning and ordering to common limit trait

2021-03-24 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-34853.
--
Fix Version/s: 3.2.0
 Assignee: Cheng Su
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/31950

> Move partitioning and ordering to common limit trait
> 
>
> Key: SPARK-34853
> URL: https://issues.apache.org/jira/browse/SPARK-34853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Trivial
> Fix For: 3.2.0
>
>
> Both local limit and global limit define the output partitioning and output 
> ordering in the same way and this is duplicated 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala#L159-L175]
>  ). We can move the output partitioning and ordering into their parent trait 
> - BaseLimitExec. This is a minor code refactoring.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34488) Support task Metrics Distributions and executor Metrics Distributions in the REST API call for a specified stage

2021-03-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-34488:


Assignee: Ron Hu

> Support task Metrics Distributions and executor Metrics Distributions in the 
> REST API call for a specified stage
> 
>
> Key: SPARK-34488
> URL: https://issues.apache.org/jira/browse/SPARK-34488
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.2
>Reporter: Ron Hu
>Assignee: Ron Hu
>Priority: Major
> Attachments: executorMetricsDistributions.json, 
> taskMetricsDistributions.json
>
>
> For a specific stage, it is useful to show the task metrics in percentile 
> distribution.  This information can help users know whether or not there is a 
> skew/bottleneck among tasks in a given stage.  We list an example in 
> [^taskMetricsDistributions.json]
> Similarly, it is useful to show the executor metrics in percentile 
> distribution for a specific stage. This information can show whether or not 
> there is a skewed load on some executors.  We list an example in 
> [^executorMetricsDistributions.json]
>  
> We define withSummaries query parameter in the REST API for a specific stage 
> as:
> applications///?withSummaries=[true|false]=0.0,0.25,0.5,0.75,1.0
>  * When withSummaries=true, both task metrics in percentile distribution and 
> executor metrics in percentile distribution are included in the REST API 
> output.  The default value of withSummaries is false, i.e. no metrics 
> percentile distribution will be included in the REST API output.
>  * Query parameter quantiles define the quantiles we use to calculate metrics 
> distributions.  It takes effect only when {{withSummaries=true.}}  Its 
> default value is {{0.0,0.25,0.5,0.75,1.0.}}  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34488) Support task Metrics Distributions and executor Metrics Distributions in the REST API call for a specified stage

2021-03-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-34488.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31611
[https://github.com/apache/spark/pull/31611]

> Support task Metrics Distributions and executor Metrics Distributions in the 
> REST API call for a specified stage
> 
>
> Key: SPARK-34488
> URL: https://issues.apache.org/jira/browse/SPARK-34488
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.2
>Reporter: Ron Hu
>Assignee: Ron Hu
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: executorMetricsDistributions.json, 
> taskMetricsDistributions.json
>
>
> For a specific stage, it is useful to show the task metrics in percentile 
> distribution.  This information can help users know whether or not there is a 
> skew/bottleneck among tasks in a given stage.  We list an example in 
> [^taskMetricsDistributions.json]
> Similarly, it is useful to show the executor metrics in percentile 
> distribution for a specific stage. This information can show whether or not 
> there is a skewed load on some executors.  We list an example in 
> [^executorMetricsDistributions.json]
>  
> We define withSummaries query parameter in the REST API for a specific stage 
> as:
> applications///?withSummaries=[true|false]=0.0,0.25,0.5,0.75,1.0
>  * When withSummaries=true, both task metrics in percentile distribution and 
> executor metrics in percentile distribution are included in the REST API 
> output.  The default value of withSummaries is false, i.e. no metrics 
> percentile distribution will be included in the REST API output.
>  * Query parameter quantiles define the quantiles we use to calculate metrics 
> distributions.  It takes effect only when {{withSummaries=true.}}  Its 
> default value is {{0.0,0.25,0.5,0.75,1.0.}}  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34855) SparkContext - avoid using local lazy val

2021-03-24 Thread Li Xian (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307717#comment-17307717
 ] 

Li Xian commented on SPARK-34855:
-

[https://github.com/apache/spark/pull/31953] I have submitted a PR to remove 
the local lazy val

> SparkContext - avoid using local lazy val
> -
>
> Key: SPARK-34855
> URL: https://issues.apache.org/jira/browse/SPARK-34855
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Li Xian
>Priority: Minor
> Attachments: Screen Shot 2021-03-24 at 5.41.22 PM.png
>
>
> `org.apache.spark.SparkContext#getCallSite` uses local lazy val for 
> `callsite`. But in scala 2.11, local lazy val need synchronization on the 
> containing object `this` (see 
> [https://docs.scala-lang.org/sips/improved-lazy-val-initialization.html#version-6---no-synchronization-on-this-and-concurrent-initialization-of-fields]
>  and [https://github.com/scala/scala-dev/issues/133] )
> `getCallSite` is called at the job submission, and thus will be a bottle neck 
> if we are submitting a large amount of jobs on a single spark session. We 
> observed thread blocked due to this in our load test.
> !image-2021-03-24-17-42-50-412.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34855) SparkContext - avoid using local lazy val

2021-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34855:


Assignee: (was: Apache Spark)

> SparkContext - avoid using local lazy val
> -
>
> Key: SPARK-34855
> URL: https://issues.apache.org/jira/browse/SPARK-34855
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Li Xian
>Priority: Minor
> Attachments: Screen Shot 2021-03-24 at 5.41.22 PM.png
>
>
> `org.apache.spark.SparkContext#getCallSite` uses local lazy val for 
> `callsite`. But in scala 2.11, local lazy val need synchronization on the 
> containing object `this` (see 
> [https://docs.scala-lang.org/sips/improved-lazy-val-initialization.html#version-6---no-synchronization-on-this-and-concurrent-initialization-of-fields]
>  and [https://github.com/scala/scala-dev/issues/133] )
> `getCallSite` is called at the job submission, and thus will be a bottle neck 
> if we are submitting a large amount of jobs on a single spark session. We 
> observed thread blocked due to this in our load test.
> !image-2021-03-24-17-42-50-412.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34855) SparkContext - avoid using local lazy val

2021-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34855:


Assignee: Apache Spark

> SparkContext - avoid using local lazy val
> -
>
> Key: SPARK-34855
> URL: https://issues.apache.org/jira/browse/SPARK-34855
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Li Xian
>Assignee: Apache Spark
>Priority: Minor
> Attachments: Screen Shot 2021-03-24 at 5.41.22 PM.png
>
>
> `org.apache.spark.SparkContext#getCallSite` uses local lazy val for 
> `callsite`. But in scala 2.11, local lazy val need synchronization on the 
> containing object `this` (see 
> [https://docs.scala-lang.org/sips/improved-lazy-val-initialization.html#version-6---no-synchronization-on-this-and-concurrent-initialization-of-fields]
>  and [https://github.com/scala/scala-dev/issues/133] )
> `getCallSite` is called at the job submission, and thus will be a bottle neck 
> if we are submitting a large amount of jobs on a single spark session. We 
> observed thread blocked due to this in our load test.
> !image-2021-03-24-17-42-50-412.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34855) SparkContext - avoid using local lazy val

2021-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307718#comment-17307718
 ] 

Apache Spark commented on SPARK-34855:
--

User 'lxian' has created a pull request for this issue:
https://github.com/apache/spark/pull/31953

> SparkContext - avoid using local lazy val
> -
>
> Key: SPARK-34855
> URL: https://issues.apache.org/jira/browse/SPARK-34855
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Li Xian
>Priority: Minor
> Attachments: Screen Shot 2021-03-24 at 5.41.22 PM.png
>
>
> `org.apache.spark.SparkContext#getCallSite` uses local lazy val for 
> `callsite`. But in scala 2.11, local lazy val need synchronization on the 
> containing object `this` (see 
> [https://docs.scala-lang.org/sips/improved-lazy-val-initialization.html#version-6---no-synchronization-on-this-and-concurrent-initialization-of-fields]
>  and [https://github.com/scala/scala-dev/issues/133] )
> `getCallSite` is called at the job submission, and thus will be a bottle neck 
> if we are submitting a large amount of jobs on a single spark session. We 
> observed thread blocked due to this in our load test.
> !image-2021-03-24-17-42-50-412.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34855) SparkContext - avoid using local lazy val

2021-03-24 Thread Li Xian (Jira)
Li Xian created SPARK-34855:
---

 Summary: SparkContext - avoid using local lazy val
 Key: SPARK-34855
 URL: https://issues.apache.org/jira/browse/SPARK-34855
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.7
Reporter: Li Xian
 Attachments: Screen Shot 2021-03-24 at 5.41.22 PM.png

`org.apache.spark.SparkContext#getCallSite` uses local lazy val for `callsite`. 
But in scala 2.11, local lazy val need synchronization on the containing object 
`this` (see 
[https://docs.scala-lang.org/sips/improved-lazy-val-initialization.html#version-6---no-synchronization-on-this-and-concurrent-initialization-of-fields]
 and [https://github.com/scala/scala-dev/issues/133] )

`getCallSite` is called at the job submission, and thus will be a bottle neck 
if we are submitting a large amount of jobs on a single spark session. We 
observed thread blocked due to this in our load test.

!image-2021-03-24-17-42-50-412.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34855) SparkContext - avoid using local lazy val

2021-03-24 Thread Li Xian (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Xian updated SPARK-34855:

Attachment: Screen Shot 2021-03-24 at 5.41.22 PM.png

> SparkContext - avoid using local lazy val
> -
>
> Key: SPARK-34855
> URL: https://issues.apache.org/jira/browse/SPARK-34855
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.7
>Reporter: Li Xian
>Priority: Minor
> Attachments: Screen Shot 2021-03-24 at 5.41.22 PM.png
>
>
> `org.apache.spark.SparkContext#getCallSite` uses local lazy val for 
> `callsite`. But in scala 2.11, local lazy val need synchronization on the 
> containing object `this` (see 
> [https://docs.scala-lang.org/sips/improved-lazy-val-initialization.html#version-6---no-synchronization-on-this-and-concurrent-initialization-of-fields]
>  and [https://github.com/scala/scala-dev/issues/133] )
> `getCallSite` is called at the job submission, and thus will be a bottle neck 
> if we are submitting a large amount of jobs on a single spark session. We 
> observed thread blocked due to this in our load test.
> !image-2021-03-24-17-42-50-412.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34684) Hadoop config could not be successfully serilized from driver pods to executor pods

2021-03-24 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307710#comment-17307710
 ] 

Attila Zsolt Piros commented on SPARK-34684:


I did not say there is no error but by following my advice you can make sure 
your basic settings is right. This is how I can help you right now.

> Hadoop config could not be successfully serilized from driver pods to 
> executor pods
> ---
>
> Key: SPARK-34684
> URL: https://issues.apache.org/jira/browse/SPARK-34684
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1, 3.0.2
>Reporter: Yue Peng
>Priority: Major
>
> I have set HADOOP_CONF_DIR correctly. And I have verified that hadoop configs 
> have been stored into a configmap and mounted to driver. However, spark pi 
> example job keeps failing where executor do not know how to talk to hdfs. I 
> highly suspect that there is a bug causing it, as I manually create a 
> configmap storing hadoop configs and mounted it to executor in template file, 
> which could fix the error. 
>  
> Spark submit command:
> /opt/spark-3.0/bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --deploy-mode cluster --master k8s://https://10.***.18.96:6443 
> --num-executors 1 --conf spark.kubernetes.namespace=test --conf 
> spark.kubernetes.container.image= --conf 
> spark.kubernetes.driver.podTemplateFile=/opt/spark-3.0/conf/spark-driver.template
>  --conf 
> spark.kubernetes.executor.podTemplateFile=/opt/spark-3.0/conf/spark-executor.template
>   --conf spark.kubernetes.file.upload.path=/opt/spark-3.0/examples/jars 
> hdfs:///tmp/spark-examples_2.12-3.0.125067.jar 1000
>  
>  
> Error log:
>  
> 21/03/10 06:59:58 INFO TransportClientFactory: Successfully created 
> connection to 
> org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc/100.64.0.191:7078
>  after 608 ms (392 ms spent in bootstraps)
> 21/03/10 06:59:58 INFO SecurityManager: Changing view acls to: root
> 21/03/10 06:59:58 INFO SecurityManager: Changing modify acls to: root
> 21/03/10 06:59:58 INFO SecurityManager: Changing view acls groups to:
> 21/03/10 06:59:58 INFO SecurityManager: Changing modify acls groups to:
> 21/03/10 06:59:58 INFO SecurityManager: SecurityManager: authentication 
> enabled; ui acls disabled; users with view permissions: Set(root); groups 
> with view permissions: Set(); users with modify permissions: Set(root); 
> groups with modify permissions: Set()
> 21/03/10 06:59:59 INFO TransportClientFactory: Successfully created 
> connection to 
> org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc/100.64.0.191:7078
>  after 130 ms (104 ms spent in bootstraps)
> 21/03/10 06:59:59 INFO DiskBlockManager: Created local directory at 
> /var/data/spark-0f541e3d-994f-4c7a-843f-f7dac57dfc13/blockmgr-981cfb62-5b27-4d1a-8fbd-eddb466faf1d
> 21/03/10 06:59:59 INFO MemoryStore: MemoryStore started with capacity 2047.2 
> MiB
> 21/03/10 06:59:59 INFO CoarseGrainedExecutorBackend: Connecting to driver: 
> spark://coarsegrainedschedu...@org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc:7078
> 21/03/10 06:59:59 INFO ResourceUtils: 
> ==
> 21/03/10 06:59:59 INFO ResourceUtils: Resources for spark.executor:
> 21/03/10 06:59:59 INFO ResourceUtils: 
> ==
> 21/03/10 06:59:59 INFO CoarseGrainedExecutorBackend: Successfully registered 
> with driver
> 21/03/10 06:59:59 INFO Executor: Starting executor ID 1 on host 100.64.0.192
> 21/03/10 07:00:00 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37956.
> 21/03/10 07:00:00 INFO NettyBlockTransferService: Server created on 
> 100.64.0.192:37956
> 21/03/10 07:00:00 INFO BlockManager: Using 
> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
> policy
> 21/03/10 07:00:00 INFO BlockManagerMaster: Registering BlockManager 
> BlockManagerId(1, 100.64.0.192, 37956, None)
> 21/03/10 07:00:00 INFO BlockManagerMaster: Registered BlockManager 
> BlockManagerId(1, 100.64.0.192, 37956, None)
> 21/03/10 07:00:00 INFO BlockManager: Initialized BlockManager: 
> BlockManagerId(1, 100.64.0.192, 37956, None)
> 21/03/10 07:00:01 INFO CoarseGrainedExecutorBackend: Got assigned task 0
> 21/03/10 07:00:01 INFO CoarseGrainedExecutorBackend: Got assigned task 1
> 21/03/10 07:00:01 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
> 21/03/10 07:00:01 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 21/03/10 07:00:01 INFO Executor: Fetching 
> spark://org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc:7078/jars/spark-examples_2.12-3.0.125067.jar

[jira] [Commented] (SPARK-34756) Fix FileScan equality check

2021-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307702#comment-17307702
 ] 

Apache Spark commented on SPARK-34756:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/31952

> Fix FileScan equality check
> ---
>
> Key: SPARK-34756
> URL: https://issues.apache.org/jira/browse/SPARK-34756
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
>
> `&&` is missing from `FileScan.equals()`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34756) Fix FileScan equality check

2021-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307700#comment-17307700
 ] 

Apache Spark commented on SPARK-34756:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/31952

> Fix FileScan equality check
> ---
>
> Key: SPARK-34756
> URL: https://issues.apache.org/jira/browse/SPARK-34756
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
>
> `&&` is missing from `FileScan.equals()`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34756) Fix FileScan equality check

2021-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307699#comment-17307699
 ] 

Apache Spark commented on SPARK-34756:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/31952

> Fix FileScan equality check
> ---
>
> Key: SPARK-34756
> URL: https://issues.apache.org/jira/browse/SPARK-34756
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
>
> `&&` is missing from `FileScan.equals()`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33482) V2 Datasources that extend FileScan preclude exchange reuse

2021-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307698#comment-17307698
 ] 

Apache Spark commented on SPARK-33482:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/31952

> V2 Datasources that extend FileScan preclude exchange reuse
> ---
>
> Key: SPARK-33482
> URL: https://issues.apache.org/jira/browse/SPARK-33482
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Bruce Robbins
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
>
> Sample query:
> {noformat}
> spark.read.parquet("tbl").createOrReplaceTempView("tbl")
> spark.read.parquet("lookup").createOrReplaceTempView("lookup")
> sql("""
>select tbl.col1, fk1, fk2
>from tbl, lookup l1, lookup l2
>where fk1 = l1.key
>and fk2 = l2.key
> """).explain
> {noformat}
> Test files can be created as so:
> {noformat}
> import scala.util.Random
> val rand = Random
> val tbl = spark.range(1, 1).map { x =>
>   (rand.nextLong.abs % 20,
>rand.nextLong.abs % 20,
>x)
> }.toDF("fk1", "fk2", "col1")
> tbl.write.mode("overwrite").parquet("tbl")
> val lookup = spark.range(0, 20).map { x =>
>   (x + 1, x * 1, (x + 1) * 1)
> }.toDF("key", "col1", "col2")
> lookup.write.mode("overwrite").parquet("lookup")
> {noformat}
> Output with V1 Parquet reader:
> {noformat}
>  == Physical Plan ==
> *(3) Project [col1#2L, fk1#0L, fk2#1L]
> +- *(3) BroadcastHashJoin [fk2#1L], [key#12L], Inner, BuildRight, false
>:- *(3) Project [fk1#0L, fk2#1L, col1#2L]
>:  +- *(3) BroadcastHashJoin [fk1#0L], [key#6L], Inner, BuildRight, false
>: :- *(3) Filter (isnotnull(fk1#0L) AND isnotnull(fk2#1L))
>: :  +- *(3) ColumnarToRow
>: : +- FileScan parquet [fk1#0L,fk2#1L,col1#2L] Batched: true, 
> DataFilters: [isnotnull(fk1#0L), isnotnull(fk2#1L)], Format: Parquet, 
> Location: InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/tbl], 
> PartitionFilters: [], PushedFilters: [IsNotNull(fk1), IsNotNull(fk2)], 
> ReadSchema: struct
>: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, 
> bigint, false]),false), [id=#75]
>:+- *(1) Filter isnotnull(key#6L)
>:   +- *(1) ColumnarToRow
>:  +- FileScan parquet [key#6L] Batched: true, DataFilters: 
> [isnotnull(key#6L)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], 
> PartitionFilters: [], PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
>+- ReusedExchange [key#12L], BroadcastExchange 
> HashedRelationBroadcastMode(List(input[0, bigint, false]),false), [id=#75]
> {noformat}
> With V1 Parquet reader, the exchange for lookup is reused (see last line).
> Output with V2 Parquet reader (spark.sql.sources.useV1SourceList=""):
> {noformat}
>  == Physical Plan ==
> *(3) Project [col1#2L, fk1#0L, fk2#1L]
> +- *(3) BroadcastHashJoin [fk2#1L], [key#12L], Inner, BuildRight, false
>:- *(3) Project [fk1#0L, fk2#1L, col1#2L]
>:  +- *(3) BroadcastHashJoin [fk1#0L], [key#6L], Inner, BuildRight, false
>: :- *(3) Filter (isnotnull(fk1#0L) AND isnotnull(fk2#1L))
>: :  +- *(3) ColumnarToRow
>: : +- BatchScan[fk1#0L, fk2#1L, col1#2L] ParquetScan DataFilters: 
> [isnotnull(fk1#0L), isnotnull(fk2#1L)], Format: parquet, Location: 
> InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/tbl], 
> PartitionFilters: [], PushedFilers: [IsNotNull(fk1), IsNotNull(fk2)], 
> ReadSchema: struct, PushedFilters: 
> [IsNotNull(fk1), IsNotNull(fk2)]
>: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, 
> bigint, false]),false), [id=#75]
>:+- *(1) Filter isnotnull(key#6L)
>:   +- *(1) ColumnarToRow
>:  +- BatchScan[key#6L] ParquetScan DataFilters: 
> [isnotnull(key#6L)], Format: parquet, Location: 
> InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], 
> PartitionFilters: [], PushedFilers: [IsNotNull(key)], ReadSchema: 
> struct, PushedFilters: [IsNotNull(key)]
>+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false]),false), [id=#83]
>   +- *(2) Filter isnotnull(key#12L)
>  +- *(2) ColumnarToRow
> +- BatchScan[key#12L] ParquetScan DataFilters: 
> [isnotnull(key#12L)], Format: parquet, Location: 
> InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], 
> PartitionFilters: [], PushedFilers: [IsNotNull(key)], ReadSchema: 
> struct, PushedFilters: [IsNotNull(key)]
> {noformat}
> With the V2 Parquet reader, the exchange for lookup is not reused (see last 4 
> lines).
> You can see the same issue with the Orc reader (and 

[jira] [Updated] (SPARK-30641) Project Matrix: Linear Models revisit and refactor

2021-03-24 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-30641:
-
Description: 
We had been refactoring linear models for a long time, and there still are some 
works in the future. After some discuss among [~huaxingao] [~srowen] 
[~weichenxu123] [~mengxr] [~podongfeng] , we decide to gather related works 
under a sub-project Matrix, it includes:
 # *Blockification (vectorization of vectors)*
 ** vectors are stacked into matrices, so that high-level BLAS can be used for 
better performance. (about ~3x faster on sparse datasets, up to ~18x faster on 
dense datasets, see SPARK-31783 for details).
 ** Since 3.1.1, LoR/SVC/LiR/AFT supports blockification, and we need to 
blockify KMeans in the future.
 # *Standardization (virutal centering)*
 ** Existing impl of standardization in linear models does NOT center the 
vectors by removing the means, for the purpose of keeping dataset _*sparsity*_. 
However, this will cause feature values with small var be scaled to large 
values, and underlying solver like LBFGS can not efficiently handle this case. 
see SPARK-34448 for details.
 ** If internal vectors are centers (like other famous impl, i.e. 
GLMNET/Scikit-Learn), the convergence ratio will be better. In the case in 
SPARK-34448, the number of iteration to convergence will be reduced from 93 to 
6. Moreover, the final solution is much more close to the one in GLMNET.
 ** Luckily, we find a new way to _*virtually*_ center the vectors without 
densifying the dataset. Good results had been observed in LoR, we will take it 
into account in other linear models.
 # _*Initialization (To be discussed)*_
 ** Initializing model coef with a given model, should be beneficial to: 1, 
convergence ratio (should reduce number of iterations); 2, model stability (may 
obtain a new solution more close to the previous one);
 # _*Early Stopping* *(To be discussed)*_
 ** we can compute the test error in the procedure (like tree models), and stop 
the training procedure if test error begin to increase;

 

  If you want to add other features in these models, please comment in the 
ticket.

  was:
We had been refactoring linear models for a long time, and there still are some 
works in the future. After some discuss among [~huaxingao] [~srowen] 
[~weichenxu123] , we decide to gather related works under a sub-project Matrix, 
it include:
 # *Blockification (vectorization of vectors)*
 ** vectors are stacked into matrices, so that high-level BLAS can be used for 
better performance. (about ~3x faster on sparse datasets, up to ~18x faster on 
dense datasets, see SPARK-31783 for details).
 ** Since 3.1.1, LoR/SVC/LiR/AFT supports blockification, and we need to 
blockify KMeans in the future.
 # *Standardization (virutal centering)*
 ** Existing impl of standardization in linear models does NOT center the 
vectors by removing the means, for the purpose of keeping dataset _*sparsity*_. 
However, this will cause feature values with small var be scaled to large 
values, and underlying solver like LBFGS can not efficiently handle this case. 
see SPARK-34448 for details.
 ** If internal vectors are centers (like other famous impl, i.e. 
GLMNET/Scikit-Learn), the convergence ratio will be better. In the case in 
SPARK-34448, the number of iteration to convergence will be reduced from 93 to 
6. Moreover, the final solution is much more close to the one in GLMNET.
 ** Luckily, we find a new way to _*virtually*_ center the vectors without 
densifying the dataset. Good results had been observed in LoR, we will take it 
into account in other linear models.
 # _*Initialization (To be discussed)*_
 ** Initializing model coef with a given model, should be beneficial to: 1, 
convergence ratio (should reduce number of iterations); 2, model stability (may 
obtain a new solution more close to the previous one);
 # _*Early Stopping* *(To be discussed)*_
 ** we can compute the test error in the procedure (like tree models), and stop 
the training procedure if test error begin to increase;

 

  If you want to add other features in these models, please comment in the 
ticket.


> Project Matrix: Linear Models revisit and refactor
> --
>
> Key: SPARK-30641
> URL: https://issues.apache.org/jira/browse/SPARK-30641
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.1.0, 3.2.0
>Reporter: zhengruifeng
>Priority: Major
>
> We had been refactoring linear models for a long time, and there still are 
> some works in the future. After some discuss among [~huaxingao] [~srowen] 
> [~weichenxu123] [~mengxr] [~podongfeng] , we decide to gather related works 
> under a sub-project Matrix, it includes:
>  # *Blockification (vectorization of vectors)*
>  ** 

[jira] [Commented] (SPARK-33482) V2 Datasources that extend FileScan preclude exchange reuse

2021-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307697#comment-17307697
 ] 

Apache Spark commented on SPARK-33482:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/31952

> V2 Datasources that extend FileScan preclude exchange reuse
> ---
>
> Key: SPARK-33482
> URL: https://issues.apache.org/jira/browse/SPARK-33482
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Bruce Robbins
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
>
> Sample query:
> {noformat}
> spark.read.parquet("tbl").createOrReplaceTempView("tbl")
> spark.read.parquet("lookup").createOrReplaceTempView("lookup")
> sql("""
>select tbl.col1, fk1, fk2
>from tbl, lookup l1, lookup l2
>where fk1 = l1.key
>and fk2 = l2.key
> """).explain
> {noformat}
> Test files can be created as so:
> {noformat}
> import scala.util.Random
> val rand = Random
> val tbl = spark.range(1, 1).map { x =>
>   (rand.nextLong.abs % 20,
>rand.nextLong.abs % 20,
>x)
> }.toDF("fk1", "fk2", "col1")
> tbl.write.mode("overwrite").parquet("tbl")
> val lookup = spark.range(0, 20).map { x =>
>   (x + 1, x * 1, (x + 1) * 1)
> }.toDF("key", "col1", "col2")
> lookup.write.mode("overwrite").parquet("lookup")
> {noformat}
> Output with V1 Parquet reader:
> {noformat}
>  == Physical Plan ==
> *(3) Project [col1#2L, fk1#0L, fk2#1L]
> +- *(3) BroadcastHashJoin [fk2#1L], [key#12L], Inner, BuildRight, false
>:- *(3) Project [fk1#0L, fk2#1L, col1#2L]
>:  +- *(3) BroadcastHashJoin [fk1#0L], [key#6L], Inner, BuildRight, false
>: :- *(3) Filter (isnotnull(fk1#0L) AND isnotnull(fk2#1L))
>: :  +- *(3) ColumnarToRow
>: : +- FileScan parquet [fk1#0L,fk2#1L,col1#2L] Batched: true, 
> DataFilters: [isnotnull(fk1#0L), isnotnull(fk2#1L)], Format: Parquet, 
> Location: InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/tbl], 
> PartitionFilters: [], PushedFilters: [IsNotNull(fk1), IsNotNull(fk2)], 
> ReadSchema: struct
>: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, 
> bigint, false]),false), [id=#75]
>:+- *(1) Filter isnotnull(key#6L)
>:   +- *(1) ColumnarToRow
>:  +- FileScan parquet [key#6L] Batched: true, DataFilters: 
> [isnotnull(key#6L)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], 
> PartitionFilters: [], PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
>+- ReusedExchange [key#12L], BroadcastExchange 
> HashedRelationBroadcastMode(List(input[0, bigint, false]),false), [id=#75]
> {noformat}
> With V1 Parquet reader, the exchange for lookup is reused (see last line).
> Output with V2 Parquet reader (spark.sql.sources.useV1SourceList=""):
> {noformat}
>  == Physical Plan ==
> *(3) Project [col1#2L, fk1#0L, fk2#1L]
> +- *(3) BroadcastHashJoin [fk2#1L], [key#12L], Inner, BuildRight, false
>:- *(3) Project [fk1#0L, fk2#1L, col1#2L]
>:  +- *(3) BroadcastHashJoin [fk1#0L], [key#6L], Inner, BuildRight, false
>: :- *(3) Filter (isnotnull(fk1#0L) AND isnotnull(fk2#1L))
>: :  +- *(3) ColumnarToRow
>: : +- BatchScan[fk1#0L, fk2#1L, col1#2L] ParquetScan DataFilters: 
> [isnotnull(fk1#0L), isnotnull(fk2#1L)], Format: parquet, Location: 
> InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/tbl], 
> PartitionFilters: [], PushedFilers: [IsNotNull(fk1), IsNotNull(fk2)], 
> ReadSchema: struct, PushedFilters: 
> [IsNotNull(fk1), IsNotNull(fk2)]
>: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, 
> bigint, false]),false), [id=#75]
>:+- *(1) Filter isnotnull(key#6L)
>:   +- *(1) ColumnarToRow
>:  +- BatchScan[key#6L] ParquetScan DataFilters: 
> [isnotnull(key#6L)], Format: parquet, Location: 
> InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], 
> PartitionFilters: [], PushedFilers: [IsNotNull(key)], ReadSchema: 
> struct, PushedFilters: [IsNotNull(key)]
>+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false]),false), [id=#83]
>   +- *(2) Filter isnotnull(key#12L)
>  +- *(2) ColumnarToRow
> +- BatchScan[key#12L] ParquetScan DataFilters: 
> [isnotnull(key#12L)], Format: parquet, Location: 
> InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], 
> PartitionFilters: [], PushedFilers: [IsNotNull(key)], ReadSchema: 
> struct, PushedFilters: [IsNotNull(key)]
> {noformat}
> With the V2 Parquet reader, the exchange for lookup is not reused (see last 4 
> lines).
> You can see the same issue with the Orc reader (and 

[jira] [Commented] (SPARK-30641) Project Matrix: Linear Models revisit and refactor

2021-03-24 Thread Weichen Xu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307690#comment-17307690
 ] 

Weichen Xu commented on SPARK-30641:


Good work!

> Project Matrix: Linear Models revisit and refactor
> --
>
> Key: SPARK-30641
> URL: https://issues.apache.org/jira/browse/SPARK-30641
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.1.0, 3.2.0
>Reporter: zhengruifeng
>Priority: Major
>
> We had been refactoring linear models for a long time, and there still are 
> some works in the future. After some discuss among [~huaxingao] [~srowen] 
> [~weichenxu123] , we decide to gather related works under a sub-project 
> Matrix, it include:
>  # *Blockification (vectorization of vectors)*
>  ** vectors are stacked into matrices, so that high-level BLAS can be used 
> for better performance. (about ~3x faster on sparse datasets, up to ~18x 
> faster on dense datasets, see SPARK-31783 for details).
>  ** Since 3.1.1, LoR/SVC/LiR/AFT supports blockification, and we need to 
> blockify KMeans in the future.
>  # *Standardization (virutal centering)*
>  ** Existing impl of standardization in linear models does NOT center the 
> vectors by removing the means, for the purpose of keeping dataset 
> _*sparsity*_. However, this will cause feature values with small var be 
> scaled to large values, and underlying solver like LBFGS can not efficiently 
> handle this case. see SPARK-34448 for details.
>  ** If internal vectors are centers (like other famous impl, i.e. 
> GLMNET/Scikit-Learn), the convergence ratio will be better. In the case in 
> SPARK-34448, the number of iteration to convergence will be reduced from 93 
> to 6. Moreover, the final solution is much more close to the one in GLMNET.
>  ** Luckily, we find a new way to _*virtually*_ center the vectors without 
> densifying the dataset. Good results had been observed in LoR, we will take 
> it into account in other linear models.
>  # _*Initialization (To be discussed)*_
>  ** Initializing model coef with a given model, should be beneficial to: 1, 
> convergence ratio (should reduce number of iterations); 2, model stability 
> (may obtain a new solution more close to the previous one);
>  # _*Early Stopping* *(To be discussed)*_
>  ** we can compute the test error in the procedure (like tree models), and 
> stop the training procedure if test error begin to increase;
>  
>   If you want to add other features in these models, please comment in 
> the ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34684) Hadoop config could not be successfully serilized from driver pods to executor pods

2021-03-24 Thread Yue Peng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307686#comment-17307686
 ] 

Yue Peng commented on SPARK-34684:
--

Why we need to do this? Accessing files from executor should be by default 
supported, right? Otherwise, how could executor get the jar need to be run with?

> Hadoop config could not be successfully serilized from driver pods to 
> executor pods
> ---
>
> Key: SPARK-34684
> URL: https://issues.apache.org/jira/browse/SPARK-34684
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1, 3.0.2
>Reporter: Yue Peng
>Priority: Major
>
> I have set HADOOP_CONF_DIR correctly. And I have verified that hadoop configs 
> have been stored into a configmap and mounted to driver. However, spark pi 
> example job keeps failing where executor do not know how to talk to hdfs. I 
> highly suspect that there is a bug causing it, as I manually create a 
> configmap storing hadoop configs and mounted it to executor in template file, 
> which could fix the error. 
>  
> Spark submit command:
> /opt/spark-3.0/bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --deploy-mode cluster --master k8s://https://10.***.18.96:6443 
> --num-executors 1 --conf spark.kubernetes.namespace=test --conf 
> spark.kubernetes.container.image= --conf 
> spark.kubernetes.driver.podTemplateFile=/opt/spark-3.0/conf/spark-driver.template
>  --conf 
> spark.kubernetes.executor.podTemplateFile=/opt/spark-3.0/conf/spark-executor.template
>   --conf spark.kubernetes.file.upload.path=/opt/spark-3.0/examples/jars 
> hdfs:///tmp/spark-examples_2.12-3.0.125067.jar 1000
>  
>  
> Error log:
>  
> 21/03/10 06:59:58 INFO TransportClientFactory: Successfully created 
> connection to 
> org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc/100.64.0.191:7078
>  after 608 ms (392 ms spent in bootstraps)
> 21/03/10 06:59:58 INFO SecurityManager: Changing view acls to: root
> 21/03/10 06:59:58 INFO SecurityManager: Changing modify acls to: root
> 21/03/10 06:59:58 INFO SecurityManager: Changing view acls groups to:
> 21/03/10 06:59:58 INFO SecurityManager: Changing modify acls groups to:
> 21/03/10 06:59:58 INFO SecurityManager: SecurityManager: authentication 
> enabled; ui acls disabled; users with view permissions: Set(root); groups 
> with view permissions: Set(); users with modify permissions: Set(root); 
> groups with modify permissions: Set()
> 21/03/10 06:59:59 INFO TransportClientFactory: Successfully created 
> connection to 
> org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc/100.64.0.191:7078
>  after 130 ms (104 ms spent in bootstraps)
> 21/03/10 06:59:59 INFO DiskBlockManager: Created local directory at 
> /var/data/spark-0f541e3d-994f-4c7a-843f-f7dac57dfc13/blockmgr-981cfb62-5b27-4d1a-8fbd-eddb466faf1d
> 21/03/10 06:59:59 INFO MemoryStore: MemoryStore started with capacity 2047.2 
> MiB
> 21/03/10 06:59:59 INFO CoarseGrainedExecutorBackend: Connecting to driver: 
> spark://coarsegrainedschedu...@org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc:7078
> 21/03/10 06:59:59 INFO ResourceUtils: 
> ==
> 21/03/10 06:59:59 INFO ResourceUtils: Resources for spark.executor:
> 21/03/10 06:59:59 INFO ResourceUtils: 
> ==
> 21/03/10 06:59:59 INFO CoarseGrainedExecutorBackend: Successfully registered 
> with driver
> 21/03/10 06:59:59 INFO Executor: Starting executor ID 1 on host 100.64.0.192
> 21/03/10 07:00:00 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37956.
> 21/03/10 07:00:00 INFO NettyBlockTransferService: Server created on 
> 100.64.0.192:37956
> 21/03/10 07:00:00 INFO BlockManager: Using 
> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
> policy
> 21/03/10 07:00:00 INFO BlockManagerMaster: Registering BlockManager 
> BlockManagerId(1, 100.64.0.192, 37956, None)
> 21/03/10 07:00:00 INFO BlockManagerMaster: Registered BlockManager 
> BlockManagerId(1, 100.64.0.192, 37956, None)
> 21/03/10 07:00:00 INFO BlockManager: Initialized BlockManager: 
> BlockManagerId(1, 100.64.0.192, 37956, None)
> 21/03/10 07:00:01 INFO CoarseGrainedExecutorBackend: Got assigned task 0
> 21/03/10 07:00:01 INFO CoarseGrainedExecutorBackend: Got assigned task 1
> 21/03/10 07:00:01 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
> 21/03/10 07:00:01 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 21/03/10 07:00:01 INFO Executor: Fetching 
> spark://org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc:7078/jars/spark-examples_2.12-3.0.125067.jar
>  with 

[jira] [Assigned] (SPARK-34850) Multiply day-time interval by numeric

2021-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34850:


Assignee: Apache Spark  (was: Max Gekk)

> Multiply day-time interval by numeric
> -
>
> Key: SPARK-34850
> URL: https://issues.apache.org/jira/browse/SPARK-34850
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.2.0
>
>
> Support the multiply op over day-time interval by numeric types including:
> # ByteType
> # ShortType
> # IntegerType
> # LongType
> # FloatType
> # DoubleType
> # DecimalType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34850) Multiply day-time interval by numeric

2021-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34850:


Assignee: Max Gekk  (was: Apache Spark)

> Multiply day-time interval by numeric
> -
>
> Key: SPARK-34850
> URL: https://issues.apache.org/jira/browse/SPARK-34850
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Support the multiply op over day-time interval by numeric types including:
> # ByteType
> # ShortType
> # IntegerType
> # LongType
> # FloatType
> # DoubleType
> # DecimalType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34850) Multiply day-time interval by numeric

2021-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307655#comment-17307655
 ] 

Apache Spark commented on SPARK-34850:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31951

> Multiply day-time interval by numeric
> -
>
> Key: SPARK-34850
> URL: https://issues.apache.org/jira/browse/SPARK-34850
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Support the multiply op over day-time interval by numeric types including:
> # ByteType
> # ShortType
> # IntegerType
> # LongType
> # FloatType
> # DoubleType
> # DecimalType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34345) Allow several properties files

2021-03-24 Thread Arseniy Tashoyan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arseniy Tashoyan updated SPARK-34345:
-
Component/s: Kubernetes

> Allow several properties files
> --
>
> Key: SPARK-34345
> URL: https://issues.apache.org/jira/browse/SPARK-34345
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Arseniy Tashoyan
>Priority: Major
>
> Example: we have 2 applications A and B. These applications have some common 
> Spark settings and some application-specific settings. The idea is to run 
> them like this:
> {code:bash}
> spark-submit --properties-files common.properties,a.properties A
> spark-submit --properties-files common.properties,b.properties B
> {code}
> Benefits:
>  - Common settings can be extracted to a common file _common.properties_, no 
> need to copy them over _a.properties_ and _b.properties_
>  - Applications can override common settings in their respective custom 
> properties files
> Currently the following mechanism works in SparkSubmitArguments.scala: 
> console arguments like _--conf key=value_ overwrite settings in the 
> properties file. This is not enough, because console arguments should be 
> specified in the launcher script; de-facto they belong to the binary 
> distribution rather than the configuration.
> Consider the following scenario: Spark on Kubernetes, the configuration is 
> provided as a ConfigMap. We could have the following ConfigMaps:
>  - _a.properties_ // mount to the Pod with application A
>  - _b.properties_ // mount to the Pod with application B
>  - _common.properties_ // mount to both Pods with A and B
>  Meanwhile the launcher script _app-submit.sh_ is the same for both 
> applications A and B, since it contains none configuration settings:
> {code:bash}
> spark-submit --properties-files common.properties,${app_name}.properties ...
> {code}
> *Alternate solution*
> Use Typesafe Config for Spark settings instead of properties files. Typesafe 
> Config allows including files.
>  For example, settings for the application A - _a.conf_:
> {code:yaml}
> include required("common.conf")
> spark.sql.shuffle.partitions = 240
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34345) Allow several properties files

2021-03-24 Thread Arseniy Tashoyan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307644#comment-17307644
 ] 

Arseniy Tashoyan commented on SPARK-34345:
--

Typesafe Config seems attractive, as it provides the following features:
 * _include_ directives
 * environment variable injections

This is convenient when running an application in Kubernetes and providing the 
configuration in ConfigMaps.

As an alternative, a Scala wrapper like 
[Pureconfig|https://pureconfig.github.io] could be used.

Implementation details:
 * Deliver within Spark libraries reference files _reference.conf_ containing 
default (reference) values for all Spark settings.
 ** Different Spark libraries may have their respective files _reference.conf_. 
For example, _spark-sql/reference.conf_ contains settings specific to Spark SQL 
and so on.
 * Use Config API to get values
 ** Cleanup default values, coded in Scala or Java
 * Introduce new command-line argument for spark-submit: _--config-file_
 ** When both _--config-file_ and _--properties-file_ specified, ignore the 
latter and print a warning
 ** When only _--properties-file_ specified, use the legacy way and print a 
deprecation warning.

> Allow several properties files
> --
>
> Key: SPARK-34345
> URL: https://issues.apache.org/jira/browse/SPARK-34345
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Arseniy Tashoyan
>Priority: Major
>
> Example: we have 2 applications A and B. These applications have some common 
> Spark settings and some application-specific settings. The idea is to run 
> them like this:
> {code:bash}
> spark-submit --properties-files common.properties,a.properties A
> spark-submit --properties-files common.properties,b.properties B
> {code}
> Benefits:
>  - Common settings can be extracted to a common file _common.properties_, no 
> need to copy them over _a.properties_ and _b.properties_
>  - Applications can override common settings in their respective custom 
> properties files
> Currently the following mechanism works in SparkSubmitArguments.scala: 
> console arguments like _--conf key=value_ overwrite settings in the 
> properties file. This is not enough, because console arguments should be 
> specified in the launcher script; de-facto they belong to the binary 
> distribution rather than the configuration.
> Consider the following scenario: Spark on Kubernetes, the configuration is 
> provided as a ConfigMap. We could have the following ConfigMaps:
>  - _a.properties_ // mount to the Pod with application A
>  - _b.properties_ // mount to the Pod with application B
>  - _common.properties_ // mount to both Pods with A and B
>  Meanwhile the launcher script _app-submit.sh_ is the same for both 
> applications A and B, since it contains none configuration settings:
> {code:bash}
> spark-submit --properties-files common.properties,${app_name}.properties ...
> {code}
> *Alternate solution*
> Use Typesafe Config for Spark settings instead of properties files. Typesafe 
> Config allows including files.
>  For example, settings for the application A - _a.conf_:
> {code:yaml}
> include required("common.conf")
> spark.sql.shuffle.partitions = 240
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34295) Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude

2021-03-24 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-34295.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31761
[https://github.com/apache/spark/pull/31761]

> Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude
> 
>
> Key: SPARK-34295
> URL: https://issues.apache.org/jira/browse/SPARK-34295
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> MapReduce jobs can instruct YARN to skip renewal of tokens obtained from 
> certain hosts by specifying the hosts with configuration 
> mapreduce.job.hdfs-servers.token-renewal.exclude=,,..,.
> But seems Spark lacks of similar option. So the job submission fails if YARN 
> fails to renew DelegationToken for any of the remote HDFS cluster.  The 
> failure in DT renewal can happen due to many reason like Remote HDFS does not 
> trust Kerberos identity of YARN etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34851) Add scope tag in configuration

2021-03-24 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-34851:
--
Description: 
According to discussion at 
[https://github.com/apache/spark/pull/31598#discussion_r586155367]

We need to add a scope tag in each configuration.

> Add scope tag in configuration
> --
>
> Key: SPARK-34851
> URL: https://issues.apache.org/jira/browse/SPARK-34851
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Spark Core
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> According to discussion at 
> [https://github.com/apache/spark/pull/31598#discussion_r586155367]
> We need to add a scope tag in each configuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34345) Allow several properties files

2021-03-24 Thread Arseniy Tashoyan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arseniy Tashoyan updated SPARK-34345:
-
Affects Version/s: 3.1.1

> Allow several properties files
> --
>
> Key: SPARK-34345
> URL: https://issues.apache.org/jira/browse/SPARK-34345
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Arseniy Tashoyan
>Priority: Major
>
> Example: we have 2 applications A and B. These applications have some common 
> Spark settings and some application-specific settings. The idea is to run 
> them like this:
> {code:bash}
> spark-submit --properties-files common.properties,a.properties A
> spark-submit --properties-files common.properties,b.properties B
> {code}
> Benefits:
>  - Common settings can be extracted to a common file _common.properties_, no 
> need to copy them over _a.properties_ and _b.properties_
>  - Applications can override common settings in their respective custom 
> properties files
> Currently the following mechanism works in SparkSubmitArguments.scala: 
> console arguments like _--conf key=value_ overwrite settings in the 
> properties file. This is not enough, because console arguments should be 
> specified in the launcher script; de-facto they belong to the binary 
> distribution rather than the configuration.
> Consider the following scenario: Spark on Kubernetes, the configuration is 
> provided as a ConfigMap. We could have the following ConfigMaps:
>  - _a.properties_ // mount to the Pod with application A
>  - _b.properties_ // mount to the Pod with application B
>  - _common.properties_ // mount to both Pods with A and B
>  Meanwhile the launcher script _app-submit.sh_ is the same for both 
> applications A and B, since it contains none configuration settings:
> {code:bash}
> spark-submit --properties-files common.properties,${app_name}.properties ...
> {code}
> *Alternate solution*
> Use Typesafe Config for Spark settings instead of properties files. Typesafe 
> Config allows including files.
>  For example, settings for the application A - _a.conf_:
> {code:yaml}
> include required("common.conf")
> spark.sql.shuffle.partitions = 240
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34345) Allow several properties files

2021-03-24 Thread Arseniy Tashoyan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arseniy Tashoyan updated SPARK-34345:
-
Priority: Major  (was: Minor)

> Allow several properties files
> --
>
> Key: SPARK-34345
> URL: https://issues.apache.org/jira/browse/SPARK-34345
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 3.0.1
>Reporter: Arseniy Tashoyan
>Priority: Major
>
> Example: we have 2 applications A and B. These applications have some common 
> Spark settings and some application-specific settings. The idea is to run 
> them like this:
> {code:bash}
> spark-submit --properties-files common.properties,a.properties A
> spark-submit --properties-files common.properties,b.properties B
> {code}
> Benefits:
>  - Common settings can be extracted to a common file _common.properties_, no 
> need to copy them over _a.properties_ and _b.properties_
>  - Applications can override common settings in their respective custom 
> properties files
> Currently the following mechanism works in SparkSubmitArguments.scala: 
> console arguments like _--conf key=value_ overwrite settings in the 
> properties file. This is not enough, because console arguments should be 
> specified in the launcher script; de-facto they belong to the binary 
> distribution rather than the configuration.
> Consider the following scenario: Spark on Kubernetes, the configuration is 
> provided as a ConfigMap. We could have the following ConfigMaps:
>  - _a.properties_ // mount to the Pod with application A
>  - _b.properties_ // mount to the Pod with application B
>  - _common.properties_ // mount to both Pods with A and B
>  Meanwhile the launcher script _app-submit.sh_ is the same for both 
> applications A and B, since it contains none configuration settings:
> {code:bash}
> spark-submit --properties-files common.properties,${app_name}.properties ...
> {code}
> *Alternate solution*
> Use Typesafe Config for Spark settings instead of properties files. Typesafe 
> Config allows including files.
>  For example, settings for the application A - _a.conf_:
> {code:yaml}
> include required("common.conf")
> spark.sql.shuffle.partitions = 240
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31531) sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner() method not found during spark-submit

2021-03-24 Thread Arseniy Tashoyan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arseniy Tashoyan updated SPARK-31531:
-
Attachment: image.png

> sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner() method not found during 
> spark-submit
> ---
>
> Key: SPARK-31531
> URL: https://issues.apache.org/jira/browse/SPARK-31531
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.5
>Reporter: shayoni Halder
>Priority: Major
> Attachments: error.PNG, image.png
>
>
> I am trying to run the following Spark submit from a VM using Yarn cluster 
> mode.
>  ./spark-submit --master yarn --deploy-mode client test_spark_yarn.py
> The VM has java version 11 and spark-2.4.5 while the yarn cluster java 8 and 
> spark-2.4.0. I am getting the error below:
> !error.PNG!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34854) Report metrics for streaming source through progress reporter with Kafka source use-case

2021-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307631#comment-17307631
 ] 

Apache Spark commented on SPARK-34854:
--

User 'yijiacui-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/31944

> Report metrics for streaming source through progress reporter with Kafka 
> source use-case
> 
>
> Key: SPARK-34854
> URL: https://issues.apache.org/jira/browse/SPARK-34854
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yijia Cui
>Priority: Major
>
> Report Metrics for streaming source through progress reporter to users.
> Add kafka micro batch streaming use-case to report stats of # of offsets for 
> the current offset falling behind the latest.
>  
> SPARK-34366 and SPARK-34297 report metrics in spark ui, but this issue 
> reports metrics through progress report, making it available to users via 
> listener.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >