[jira] [Created] (SPARK-36068) No tests in hadoop-cloud run unless hadoop-3.2 profile is not activated explicitly

2021-07-09 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-36068:
--

 Summary: No tests in hadoop-cloud run unless hadoop-3.2 profile is 
not activated explicitly
 Key: SPARK-36068
 URL: https://issues.apache.org/jira/browse/SPARK-36068
 Project: Spark
  Issue Type: Bug
  Components: Build, Tests
Affects Versions: 3.2.0, 3.3.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


No tests in hadoop-cloud are compiled and run unless hadoop-3.2 profile is 
activated explicitly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36068) No tests in hadoop-cloud run unless hadoop-3.2 profile is not activated explicitly

2021-07-09 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-36068:
---
Description: 
No tests in hadoop-cloud are compiled and run unless hadoop-3.2 profile is 
activated explicitly.
This issue is similar to SPARK-36067.

  was:No tests in hadoop-cloud are compiled and run unless hadoop-3.2 profile 
is activated explicitly.


> No tests in hadoop-cloud run unless hadoop-3.2 profile is not activated 
> explicitly
> --
>
> Key: SPARK-36068
> URL: https://issues.apache.org/jira/browse/SPARK-36068
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> No tests in hadoop-cloud are compiled and run unless hadoop-3.2 profile is 
> activated explicitly.
> This issue is similar to SPARK-36067.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36068) No tests in hadoop-cloud run unless hadoop-3.2 profile is activated explicitly

2021-07-09 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-36068:
---
Summary: No tests in hadoop-cloud run unless hadoop-3.2 profile is 
activated explicitly  (was: No tests in hadoop-cloud run unless hadoop-3.2 
profile is not activated explicitly)

> No tests in hadoop-cloud run unless hadoop-3.2 profile is activated explicitly
> --
>
> Key: SPARK-36068
> URL: https://issues.apache.org/jira/browse/SPARK-36068
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> No tests in hadoop-cloud are compiled and run unless hadoop-3.2 profile is 
> activated explicitly.
> This issue is similar to SPARK-36067.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36068) No tests in hadoop-cloud run unless hadoop-3.2 profile is activated explicitly

2021-07-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36068:


Assignee: Apache Spark  (was: Kousuke Saruta)

> No tests in hadoop-cloud run unless hadoop-3.2 profile is activated explicitly
> --
>
> Key: SPARK-36068
> URL: https://issues.apache.org/jira/browse/SPARK-36068
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> No tests in hadoop-cloud are compiled and run unless hadoop-3.2 profile is 
> activated explicitly.
> This issue is similar to SPARK-36067.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36068) No tests in hadoop-cloud run unless hadoop-3.2 profile is activated explicitly

2021-07-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377888#comment-17377888
 ] 

Apache Spark commented on SPARK-36068:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/33277

> No tests in hadoop-cloud run unless hadoop-3.2 profile is activated explicitly
> --
>
> Key: SPARK-36068
> URL: https://issues.apache.org/jira/browse/SPARK-36068
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> No tests in hadoop-cloud are compiled and run unless hadoop-3.2 profile is 
> activated explicitly.
> This issue is similar to SPARK-36067.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36068) No tests in hadoop-cloud run unless hadoop-3.2 profile is activated explicitly

2021-07-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36068:


Assignee: Kousuke Saruta  (was: Apache Spark)

> No tests in hadoop-cloud run unless hadoop-3.2 profile is activated explicitly
> --
>
> Key: SPARK-36068
> URL: https://issues.apache.org/jira/browse/SPARK-36068
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> No tests in hadoop-cloud are compiled and run unless hadoop-3.2 profile is 
> activated explicitly.
> This issue is similar to SPARK-36067.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36068) No tests in hadoop-cloud run unless hadoop-3.2 profile is activated explicitly

2021-07-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377890#comment-17377890
 ] 

Apache Spark commented on SPARK-36068:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/33277

> No tests in hadoop-cloud run unless hadoop-3.2 profile is activated explicitly
> --
>
> Key: SPARK-36068
> URL: https://issues.apache.org/jira/browse/SPARK-36068
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> No tests in hadoop-cloud are compiled and run unless hadoop-3.2 profile is 
> activated explicitly.
> This issue is similar to SPARK-36067.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36058) Support replicasets/job API

2021-07-09 Thread Klaus Ma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377893#comment-17377893
 ] 

Klaus Ma commented on SPARK-36058:
--

In Volcano, we have volcano job to run mpi/tensorflow job (the pods are created 
by controller/operator); but for spark-job, the executor pod is created by 
driver pod which is different.  If spark pods (both driver and executor) can be 
created by operator, we can use volcano job to make it simple :)

 

> Support replicasets/job API
> ---
>
> Key: SPARK-36058
> URL: https://issues.apache.org/jira/browse/SPARK-36058
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Priority: Major
>
> Volcano & Yunikorn both support scheduling invidual pods, but they also 
> support higher level abstractions similar to the vanilla Kube replicasets 
> which we can use to improve scheduling performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36058) Support replicasets/job API

2021-07-09 Thread Klaus Ma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377895#comment-17377895
 ] 

Klaus Ma commented on SPARK-36058:
--

xref for volcano job example: 
[https://github.com/volcano-sh/volcano/blob/master/example/integrations/mpi/mpi-example.yaml]
 

> Support replicasets/job API
> ---
>
> Key: SPARK-36058
> URL: https://issues.apache.org/jira/browse/SPARK-36058
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Priority: Major
>
> Volcano & Yunikorn both support scheduling invidual pods, but they also 
> support higher level abstractions similar to the vanilla Kube replicasets 
> which we can use to improve scheduling performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35571) tag v3.0.0 org.apache.spark.sql.catalyst.parser.AstBuilder import error

2021-07-09 Thread geekyouth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

geekyouth updated SPARK-35571:
--
Attachment: screenshot-1.png

> tag v3.0.0 org.apache.spark.sql.catalyst.parser.AstBuilder import error
> ---
>
> Key: SPARK-35571
> URL: https://issues.apache.org/jira/browse/SPARK-35571
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: geekyouth
>Priority: Major
> Attachments: screenshot-1.png
>
>
> org.apache.spark.sql.catalyst.parser.AstBuilder:
> https://github.com/apache/spark/blob/v3.0.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
> line 36:
> `import org.apache.spark.sql.catalyst.parser.SqlBaseParser._`
> SqlBaseParser do not exists in pkg `org.apache.spark.sql.catalyst.parser`
> https://github.com/apache/spark/tree/v3.0.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser
> also line 54 :
> SqlBaseBaseVisitor does not import and could not compile



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35571) tag v3.0.0 org.apache.spark.sql.catalyst.parser.AstBuilder import error

2021-07-09 Thread geekyouth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377901#comment-17377901
 ] 

geekyouth commented on SPARK-35571:
---

 !screenshot-1.png! 

That works for me (y)

> tag v3.0.0 org.apache.spark.sql.catalyst.parser.AstBuilder import error
> ---
>
> Key: SPARK-35571
> URL: https://issues.apache.org/jira/browse/SPARK-35571
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: geekyouth
>Priority: Major
> Attachments: screenshot-1.png
>
>
> org.apache.spark.sql.catalyst.parser.AstBuilder:
> https://github.com/apache/spark/blob/v3.0.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
> line 36:
> `import org.apache.spark.sql.catalyst.parser.SqlBaseParser._`
> SqlBaseParser do not exists in pkg `org.apache.spark.sql.catalyst.parser`
> https://github.com/apache/spark/tree/v3.0.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser
> also line 54 :
> SqlBaseBaseVisitor does not import and could not compile



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36044) Suport TimestampNTZ in functions unix_timestamp/to_unix_timestamp

2021-07-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36044:


Assignee: Apache Spark

> Suport TimestampNTZ in functions unix_timestamp/to_unix_timestamp
> -
>
> Key: SPARK-36044
> URL: https://issues.apache.org/jira/browse/SPARK-36044
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> The functions unix_timestamp/to_unix_timestamp should be able to accept input 
> of TimestampNTZ type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36044) Suport TimestampNTZ in functions unix_timestamp/to_unix_timestamp

2021-07-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36044:


Assignee: (was: Apache Spark)

> Suport TimestampNTZ in functions unix_timestamp/to_unix_timestamp
> -
>
> Key: SPARK-36044
> URL: https://issues.apache.org/jira/browse/SPARK-36044
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Priority: Major
>
> The functions unix_timestamp/to_unix_timestamp should be able to accept input 
> of TimestampNTZ type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36044) Suport TimestampNTZ in functions unix_timestamp/to_unix_timestamp

2021-07-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377907#comment-17377907
 ] 

Apache Spark commented on SPARK-36044:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/33278

> Suport TimestampNTZ in functions unix_timestamp/to_unix_timestamp
> -
>
> Key: SPARK-36044
> URL: https://issues.apache.org/jira/browse/SPARK-36044
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Priority: Major
>
> The functions unix_timestamp/to_unix_timestamp should be able to accept input 
> of TimestampNTZ type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-36044) Suport TimestampNTZ in functions unix_timestamp/to_unix_timestamp

2021-07-09 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-36044:
---
Comment: was deleted

(was: I'm working on this.)

> Suport TimestampNTZ in functions unix_timestamp/to_unix_timestamp
> -
>
> Key: SPARK-36044
> URL: https://issues.apache.org/jira/browse/SPARK-36044
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Priority: Major
>
> The functions unix_timestamp/to_unix_timestamp should be able to accept input 
> of TimestampNTZ type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36069) spark function from_json output field name, field type and field value when FAILFAST mode throw exception

2021-07-09 Thread geekyouth (Jira)
geekyouth created SPARK-36069:
-

 Summary: spark function from_json output field name, field type 
and field value when FAILFAST mode throw exception
 Key: SPARK-36069
 URL: https://issues.apache.org/jira/browse/SPARK-36069
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: geekyouth


spark function from_json outputs error message when FAILFAST mode throw 
exception.

 

But the message does not contain important info exemaple: field name, field 
vlue , field type...

 

This  infoormation is very important for devlops to find where error input data 
is located.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36069) spark function from_json should output field name, field type and field value when FAILFAST mode throw exception

2021-07-09 Thread geekyouth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

geekyouth updated SPARK-36069:
--
Summary: spark function from_json should output field name, field type and 
field value when FAILFAST mode throw exception  (was: spark function from_json 
output field name, field type and field value when FAILFAST mode throw 
exception)

> spark function from_json should output field name, field type and field value 
> when FAILFAST mode throw exception
> 
>
> Key: SPARK-36069
> URL: https://issues.apache.org/jira/browse/SPARK-36069
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: geekyouth
>Priority: Major
>
> spark function from_json outputs error message when FAILFAST mode throw 
> exception.
>  
> But the message does not contain important info exemaple: field name, field 
> vlue , field type...
>  
> This  infoormation is very important for devlops to find where error input 
> data is located.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36035) Adjust `test_astype`, `test_neg` for old pandas versions

2021-07-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36035:


Assignee: Xinrong Meng

> Adjust `test_astype`, `test_neg` for old pandas versions
> 
>
> Key: SPARK-36035
> URL: https://issues.apache.org/jira/browse/SPARK-36035
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> * test_astype
> For pandas < 1.1.0, declaring or converting to StringDtype was in general 
> only possible if the data was already only str or nan-like (GH31204).
> In pandas 1.1.0, the problem is adjusted by 
> [https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.1.0.html#all-dtypes-can-now-be-converted-to-stringdtype].
> That should be considered in `test_astype`, otherwise, current tests will 
> fail with pandas < 1.1.0.
>  * test_neg
> {code:java}
> dtypes = [
>   "Int8",
>   "Int16",
>   "Int32",
>   "Int64",
> ]
> psers = []
> for dtype in dtypes:
>   psers.append(pd.Series([1, 2, 3, None], dtype=dtype))
>   
> for pser in psers:
>   print((-pser).dtype){code}
>  ~ 1.0.5, object dtype
>  1.1.0~1.1.2, TypeError: bad operand type for unary -: 'IntegerArray'
>  1.1.3, correct respective dtype



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36035) Adjust `test_astype`, `test_neg` for old pandas versions

2021-07-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36035.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33250
[https://github.com/apache/spark/pull/33250]

> Adjust `test_astype`, `test_neg` for old pandas versions
> 
>
> Key: SPARK-36035
> URL: https://issues.apache.org/jira/browse/SPARK-36035
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.2.0
>
>
> * test_astype
> For pandas < 1.1.0, declaring or converting to StringDtype was in general 
> only possible if the data was already only str or nan-like (GH31204).
> In pandas 1.1.0, the problem is adjusted by 
> [https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.1.0.html#all-dtypes-can-now-be-converted-to-stringdtype].
> That should be considered in `test_astype`, otherwise, current tests will 
> fail with pandas < 1.1.0.
>  * test_neg
> {code:java}
> dtypes = [
>   "Int8",
>   "Int16",
>   "Int32",
>   "Int64",
> ]
> psers = []
> for dtype in dtypes:
>   psers.append(pd.Series([1, 2, 3, None], dtype=dtype))
>   
> for pser in psers:
>   print((-pser).dtype){code}
>  ~ 1.0.5, object dtype
>  1.1.0~1.1.2, TypeError: bad operand type for unary -: 'IntegerArray'
>  1.1.3, correct respective dtype



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36068) No tests in hadoop-cloud run unless hadoop-3.2 profile is activated explicitly

2021-07-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36068.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33277
[https://github.com/apache/spark/pull/33277]

> No tests in hadoop-cloud run unless hadoop-3.2 profile is activated explicitly
> --
>
> Key: SPARK-36068
> URL: https://issues.apache.org/jira/browse/SPARK-36068
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.2.0
>
>
> No tests in hadoop-cloud are compiled and run unless hadoop-3.2 profile is 
> activated explicitly.
> This issue is similar to SPARK-36067.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36066) UTF8String trimAll() only can trim space but not ({@literal <=} ASCII 32)

2021-07-09 Thread liukai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liukai updated SPARK-36066:
---
Description: 
In this method,Character.isWhitespace() is used to judged. But 
Character.isWhitespace() can not work matching the method definition.

 

> UTF8String trimAll() only can trim space but not ({@literal <=} ASCII 32)
> -
>
> Key: SPARK-36066
> URL: https://issues.apache.org/jira/browse/SPARK-36066
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1, 3.1.2
>Reporter: liukai
>Priority: Major
>
> In this method,Character.isWhitespace() is used to judged. But 
> Character.isWhitespace() can not work matching the method definition.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36070) Add time cost info for writing rows out and committing the task.

2021-07-09 Thread Kent Yao (Jira)
Kent Yao created SPARK-36070:


 Summary: Add time cost info for writing rows out and committing 
the task.
 Key: SPARK-36070
 URL: https://issues.apache.org/jira/browse/SPARK-36070
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Kent Yao


We have a job that has a stage that contains about 8k tasks.  Most tasks take 
about 1~10min to finish but 3 of them tasks run extremely slow. The root cause 
is most likely the delay of the storage system. On the spark side, we can 
record the time cost in logs for better bug hunting or performance tuning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36070) Add time cost info for writing rows out and committing the task.

2021-07-09 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-36070:
-
Description: We have a job that has a stage that contains about 8k tasks.  
Most tasks take about 1~10min to finish but 3 of them tasks run extremely slow. 
They take about 1 hour each to finish and also do their speculations. The root 
cause is most likely the delay of the storage system. On the spark side, we can 
record the time cost in logs for better bug hunting or performance tuning.  
(was: We have a job that has a stage that contains about 8k tasks.  Most tasks 
take about 1~10min to finish but 3 of them tasks run extremely slow. The root 
cause is most likely the delay of the storage system. On the spark side, we can 
record the time cost in logs for better bug hunting or performance tuning.)

> Add time cost info for writing rows out and committing the task.
> 
>
> Key: SPARK-36070
> URL: https://issues.apache.org/jira/browse/SPARK-36070
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Priority: Minor
>
> We have a job that has a stage that contains about 8k tasks.  Most tasks take 
> about 1~10min to finish but 3 of them tasks run extremely slow. They take 
> about 1 hour each to finish and also do their speculations. The root cause is 
> most likely the delay of the storage system. On the spark side, we can record 
> the time cost in logs for better bug hunting or performance tuning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36070) Add time cost info for writing rows out and committing the task.

2021-07-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36070:


Assignee: Apache Spark

> Add time cost info for writing rows out and committing the task.
> 
>
> Key: SPARK-36070
> URL: https://issues.apache.org/jira/browse/SPARK-36070
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Minor
>
> We have a job that has a stage that contains about 8k tasks.  Most tasks take 
> about 1~10min to finish but 3 of them tasks run extremely slow. They take 
> about 1 hour each to finish and also do their speculations. The root cause is 
> most likely the delay of the storage system. On the spark side, we can record 
> the time cost in logs for better bug hunting or performance tuning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36070) Add time cost info for writing rows out and committing the task.

2021-07-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377937#comment-17377937
 ] 

Apache Spark commented on SPARK-36070:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/33279

> Add time cost info for writing rows out and committing the task.
> 
>
> Key: SPARK-36070
> URL: https://issues.apache.org/jira/browse/SPARK-36070
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Priority: Minor
>
> We have a job that has a stage that contains about 8k tasks.  Most tasks take 
> about 1~10min to finish but 3 of them tasks run extremely slow. They take 
> about 1 hour each to finish and also do their speculations. The root cause is 
> most likely the delay of the storage system. On the spark side, we can record 
> the time cost in logs for better bug hunting or performance tuning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36070) Add time cost info for writing rows out and committing the task.

2021-07-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36070:


Assignee: (was: Apache Spark)

> Add time cost info for writing rows out and committing the task.
> 
>
> Key: SPARK-36070
> URL: https://issues.apache.org/jira/browse/SPARK-36070
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Priority: Minor
>
> We have a job that has a stage that contains about 8k tasks.  Most tasks take 
> about 1~10min to finish but 3 of them tasks run extremely slow. They take 
> about 1 hour each to finish and also do their speculations. The root cause is 
> most likely the delay of the storage system. On the spark side, we can record 
> the time cost in logs for better bug hunting or performance tuning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36070) Add time cost info for writing rows out and committing the task.

2021-07-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377938#comment-17377938
 ] 

Apache Spark commented on SPARK-36070:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/33279

> Add time cost info for writing rows out and committing the task.
> 
>
> Key: SPARK-36070
> URL: https://issues.apache.org/jira/browse/SPARK-36070
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Priority: Minor
>
> We have a job that has a stage that contains about 8k tasks.  Most tasks take 
> about 1~10min to finish but 3 of them tasks run extremely slow. They take 
> about 1 hour each to finish and also do their speculations. The root cause is 
> most likely the delay of the storage system. On the spark side, we can record 
> the time cost in logs for better bug hunting or performance tuning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36068) No tests in hadoop-cloud run unless hadoop-3.2 profile is activated explicitly

2021-07-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36068:
-
Fix Version/s: (was: 3.2.0)

> No tests in hadoop-cloud run unless hadoop-3.2 profile is activated explicitly
> --
>
> Key: SPARK-36068
> URL: https://issues.apache.org/jira/browse/SPARK-36068
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> No tests in hadoop-cloud are compiled and run unless hadoop-3.2 profile is 
> activated explicitly.
> This issue is similar to SPARK-36067.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36068) No tests in hadoop-cloud run unless hadoop-3.2 profile is activated explicitly

2021-07-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36068:


Assignee: (was: Apache Spark)

> No tests in hadoop-cloud run unless hadoop-3.2 profile is activated explicitly
> --
>
> Key: SPARK-36068
> URL: https://issues.apache.org/jira/browse/SPARK-36068
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> No tests in hadoop-cloud are compiled and run unless hadoop-3.2 profile is 
> activated explicitly.
> This issue is similar to SPARK-36067.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-36068) No tests in hadoop-cloud run unless hadoop-3.2 profile is activated explicitly

2021-07-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-36068:
--
  Assignee: (was: Kousuke Saruta)

Reverted at 
https://github.com/apache/spark/commit/951e84f1b91fc2ac09b3afbe51bdd68af62d26fb

> No tests in hadoop-cloud run unless hadoop-3.2 profile is activated explicitly
> --
>
> Key: SPARK-36068
> URL: https://issues.apache.org/jira/browse/SPARK-36068
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> No tests in hadoop-cloud are compiled and run unless hadoop-3.2 profile is 
> activated explicitly.
> This issue is similar to SPARK-36067.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36068) No tests in hadoop-cloud run unless hadoop-3.2 profile is activated explicitly

2021-07-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36068:


Assignee: Apache Spark

> No tests in hadoop-cloud run unless hadoop-3.2 profile is activated explicitly
> --
>
> Key: SPARK-36068
> URL: https://issues.apache.org/jira/browse/SPARK-36068
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> No tests in hadoop-cloud are compiled and run unless hadoop-3.2 profile is 
> activated explicitly.
> This issue is similar to SPARK-36067.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36069) spark function from_json should output field name, field type and field value when FAILFAST mode throw exception

2021-07-09 Thread geekyouth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377968#comment-17377968
 ] 

geekyouth commented on SPARK-36069:
---

here is my unit test output:

 

org.apache.spark.SparkException: Malformed records are detected in record 
parsing. Parse Mode: FAILFAST. To process malformed records as null result, try 
setting the option 'mode' as 'PERMISSIVE'.org.apache.spark.SparkException: 
Malformed records are detected in record parsing. Parse Mode: FAILFAST. To 
process malformed records as null result, try setting the option 'mode' as 
'PERMISSIVE'. at 
org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:70)
 at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597)
 at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.subExpr_0$(Unknown
 Source) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:341)
 at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872) at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
org.apache.spark.scheduler.Task.run(Task.scala:127) at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)

 

Caused by: org.apache.spark.sql.catalyst.util.BadRecordException: 
java.lang.RuntimeException: Cannot parse 0.31 as double. at 
org.apache.spark.sql.catalyst.json.JacksonParser.parse(JacksonParser.scala:478) 
at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$parser$3(jsonExpressions.scala:585)
 at 
org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60)
 ... 20 more

 

 

> spark function from_json should output field name, field type and field value 
> when FAILFAST mode throw exception
> 
>
> Key: SPARK-36069
> URL: https://issues.apache.org/jira/browse/SPARK-36069
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: geekyouth
>Priority: Major
>
> spark function from_json outputs error message when FAILFAST mode throw 
> exception.
>  
> But the message does not contain important info exemaple: field name, field 
> vlue , field type...
>  
> This  infoormation is very important for devlops to find where error input 
> data is located.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2021-07-09 Thread shashank (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377990#comment-17377990
 ] 

shashank commented on SPARK-12837:
--

Seeing the same issue on 2.4.3
{code:java}
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Total size of serialized results of 104904 tasks (8.0 GB) is bigger than 
spark.driver.maxResultSize (8.0 GB)
 at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
 at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
 at scala.Option.foreach(Option.scala:257)
 at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2262)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2211)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2200)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
 at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
 at 
org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78){code}

 ... 54 more

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.2.0
>
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36071) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2021-07-09 Thread shashank (Jira)
shashank created SPARK-36071:


 Summary: Spark driver requires large memory space for serialized 
results even there are no data collected to the driver
 Key: SPARK-36071
 URL: https://issues.apache.org/jira/browse/SPARK-36071
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
Reporter: shashank


Executing with large partition is causing the data transferred to driver exceed 
spark.driver.maxResultSize.

Even when no data from the logic is being collected at by the driver. Looks 
like spark is sending metadata back which is causing it to exceed.
{code:java}
spark.driver.maxResultSize=8g{code}
 
{code:java}
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Total size of serialized results of 104904 tasks (8.0 GB) is bigger than 
spark.driver.maxResultSize (8.0 GB)Caused by: org.apache.spark.SparkException: 
Job aborted due to stage failure: Total size of serialized results of 104904 
tasks (8.0 GB) is bigger than spark.driver.maxResultSize (8.0 GB) at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028) at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
 at scala.Option.foreach(Option.scala:257) at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2262)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2211)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2200)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2082) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2114) at 
org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78)
 ... 54 more{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36071) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2021-07-09 Thread shashank (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shashank updated SPARK-36071:
-
Priority: Critical  (was: Major)

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-36071
> URL: https://issues.apache.org/jira/browse/SPARK-36071
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: shashank
>Priority: Critical
>
> Executing with large partition is causing the data transferred to driver 
> exceed spark.driver.maxResultSize.
> Even when no data from the logic is being collected at by the driver. Looks 
> like spark is sending metadata back which is causing it to exceed.
> {code:java}
> spark.driver.maxResultSize=8g{code}
>  
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 104904 tasks (8.0 GB) is bigger than 
> spark.driver.maxResultSize (8.0 GB)Caused by: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Total size 
> of serialized results of 104904 tasks (8.0 GB) is bigger than 
> spark.driver.maxResultSize (8.0 GB) at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028) 
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
>  at scala.Option.foreach(Option.scala:257) at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2262)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2211)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2200)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2082) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2114) at 
> org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78)
>  ... 54 more{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36072) TO_TIMESTAMP: return different results based on the default timestamp type

2021-07-09 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-36072:
--

 Summary: TO_TIMESTAMP: return different results based on the 
default timestamp type
 Key: SPARK-36072
 URL: https://issues.apache.org/jira/browse/SPARK-36072
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


The SQL function TO_TIMESTAMP should return different results based on the 
default timestamp type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36072) TO_TIMESTAMP: return different results based on the default timestamp type

2021-07-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378027#comment-17378027
 ] 

Apache Spark commented on SPARK-36072:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/33280

> TO_TIMESTAMP: return different results based on the default timestamp type
> --
>
> Key: SPARK-36072
> URL: https://issues.apache.org/jira/browse/SPARK-36072
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> The SQL function TO_TIMESTAMP should return different results based on the 
> default timestamp type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36072) TO_TIMESTAMP: return different results based on the default timestamp type

2021-07-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36072:


Assignee: Apache Spark  (was: Gengliang Wang)

> TO_TIMESTAMP: return different results based on the default timestamp type
> --
>
> Key: SPARK-36072
> URL: https://issues.apache.org/jira/browse/SPARK-36072
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> The SQL function TO_TIMESTAMP should return different results based on the 
> default timestamp type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36072) TO_TIMESTAMP: return different results based on the default timestamp type

2021-07-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36072:


Assignee: Gengliang Wang  (was: Apache Spark)

> TO_TIMESTAMP: return different results based on the default timestamp type
> --
>
> Key: SPARK-36072
> URL: https://issues.apache.org/jira/browse/SPARK-36072
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> The SQL function TO_TIMESTAMP should return different results based on the 
> default timestamp type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36073) SubExpr elimination should include common child exprs of conditional expressions

2021-07-09 Thread Peter Toth (Jira)
Peter Toth created SPARK-36073:
--

 Summary: SubExpr elimination should include common child exprs of 
conditional expressions
 Key: SPARK-36073
 URL: https://issues.apache.org/jira/browse/SPARK-36073
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Peter Toth


SPARK-35410 
(https://github.com/apache/spark/commit/9e1b204bcce4a8fe24c1edd8271197277b5017f4#diff-4d8c210a38fc808fef3e5c966b438591f225daa3c9fd69359446b94c351aa11eR106-R112)
 filters out all child expressions, but in some cases that is not necessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36073) SubExpr elimination should include common child exprs of conditional expressions

2021-07-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378035#comment-17378035
 ] 

Apache Spark commented on SPARK-36073:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/33281

> SubExpr elimination should include common child exprs of conditional 
> expressions
> 
>
> Key: SPARK-36073
> URL: https://issues.apache.org/jira/browse/SPARK-36073
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Peter Toth
>Priority: Minor
>
> SPARK-35410 
> (https://github.com/apache/spark/commit/9e1b204bcce4a8fe24c1edd8271197277b5017f4#diff-4d8c210a38fc808fef3e5c966b438591f225daa3c9fd69359446b94c351aa11eR106-R112)
>  filters out all child expressions, but in some cases that is not necessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36073) SubExpr elimination should include common child exprs of conditional expressions

2021-07-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36073:


Assignee: Apache Spark

> SubExpr elimination should include common child exprs of conditional 
> expressions
> 
>
> Key: SPARK-36073
> URL: https://issues.apache.org/jira/browse/SPARK-36073
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Peter Toth
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-35410 
> (https://github.com/apache/spark/commit/9e1b204bcce4a8fe24c1edd8271197277b5017f4#diff-4d8c210a38fc808fef3e5c966b438591f225daa3c9fd69359446b94c351aa11eR106-R112)
>  filters out all child expressions, but in some cases that is not necessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36073) SubExpr elimination should include common child exprs of conditional expressions

2021-07-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36073:


Assignee: (was: Apache Spark)

> SubExpr elimination should include common child exprs of conditional 
> expressions
> 
>
> Key: SPARK-36073
> URL: https://issues.apache.org/jira/browse/SPARK-36073
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Peter Toth
>Priority: Minor
>
> SPARK-35410 
> (https://github.com/apache/spark/commit/9e1b204bcce4a8fe24c1edd8271197277b5017f4#diff-4d8c210a38fc808fef3e5c966b438591f225daa3c9fd69359446b94c351aa11eR106-R112)
>  filters out all child expressions, but in some cases that is not necessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36073) SubExpr elimination should include common child exprs of conditional expressions

2021-07-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378036#comment-17378036
 ] 

Apache Spark commented on SPARK-36073:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/33281

> SubExpr elimination should include common child exprs of conditional 
> expressions
> 
>
> Key: SPARK-36073
> URL: https://issues.apache.org/jira/browse/SPARK-36073
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Peter Toth
>Priority: Minor
>
> SPARK-35410 
> (https://github.com/apache/spark/commit/9e1b204bcce4a8fe24c1edd8271197277b5017f4#diff-4d8c210a38fc808fef3e5c966b438591f225daa3c9fd69359446b94c351aa11eR106-R112)
>  filters out all child expressions, but in some cases that is not necessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32333) Drop references to Master

2021-07-09 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378093#comment-17378093
 ] 

Thomas Graves commented on SPARK-32333:
---

Getting back to this, now that spark 3.2 branch is cut, perhaps we can target 
for 3.3.

>From the discussion thread on spark-dev mailing list Leader was mentioned the 
>most, Scheduler a second. 


One reason against controller, coordinator, application manager, primary as it 
perhaps implies being needed and if the standalone master goes down the apps 
are unaffected.

Based on the feedback, I propose "Leader" based on feedback and it being short.

> Drop references to Master
> -
>
> Key: SPARK-32333
> URL: https://issues.apache.org/jira/browse/SPARK-32333
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> We have a lot of references to "master" in the code base. It will be 
> beneficial to remove references to problematic language that can alienate 
> potential community members. 
> SPARK-32004 removed references to slave
>  
> Here is a IETF draft to fix up some of the most egregious examples
> (master/slave, whitelist/backlist) with proposed alternatives.
> https://tools.ietf.org/id/draft-knodel-terminology-00.html#rfc.section.1.1.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36074) add error class for StructType.findNestedField

2021-07-09 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-36074:
---

 Summary: add error class for StructType.findNestedField
 Key: SPARK-36074
 URL: https://issues.apache.org/jira/browse/SPARK-36074
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36074) add error class for StructType.findNestedField

2021-07-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36074:


Assignee: Wenchen Fan  (was: Apache Spark)

> add error class for StructType.findNestedField
> --
>
> Key: SPARK-36074
> URL: https://issues.apache.org/jira/browse/SPARK-36074
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36074) add error class for StructType.findNestedField

2021-07-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378136#comment-17378136
 ] 

Apache Spark commented on SPARK-36074:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/33282

> add error class for StructType.findNestedField
> --
>
> Key: SPARK-36074
> URL: https://issues.apache.org/jira/browse/SPARK-36074
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36074) add error class for StructType.findNestedField

2021-07-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36074:


Assignee: Apache Spark  (was: Wenchen Fan)

> add error class for StructType.findNestedField
> --
>
> Key: SPARK-36074
> URL: https://issues.apache.org/jira/browse/SPARK-36074
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36074) add error class for StructType.findNestedField

2021-07-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378137#comment-17378137
 ] 

Apache Spark commented on SPARK-36074:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/33282

> add error class for StructType.findNestedField
> --
>
> Key: SPARK-36074
> URL: https://issues.apache.org/jira/browse/SPARK-36074
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36018) Some Improvement for Spark Core

2021-07-09 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-36018.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33216
[https://github.com/apache/spark/pull/33216]

> Some Improvement for Spark Core
> ---
>
> Key: SPARK-36018
> URL: https://issues.apache.org/jira/browse/SPARK-36018
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Trivial
> Fix For: 3.3.0
>
>
> I found some code need to improve.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36018) Some Improvement for Spark Core

2021-07-09 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-36018:


Assignee: jiaan.geng

> Some Improvement for Spark Core
> ---
>
> Key: SPARK-36018
> URL: https://issues.apache.org/jira/browse/SPARK-36018
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Trivial
>
> I found some code need to improve.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36075) Support for specifiying executor/driver node selector

2021-07-09 Thread Yikun Jiang (Jira)
Yikun Jiang created SPARK-36075:
---

 Summary: Support for specifiying executor/driver node selector
 Key: SPARK-36075
 URL: https://issues.apache.org/jira/browse/SPARK-36075
 Project: Spark
  Issue Type: Sub-task
  Components: Kubernetes
Affects Versions: 3.3.0
Reporter: Yikun Jiang


Now we can only use "spark.kubernetes.node.selector" to set lable for 
executor/driver. Sometimes, we need set executor/driver pods to different 
selector separately.

We can add below configure to enable the support for specifiying 
executor/driver node selector:

- spark.kubernetes.driver.node.selector.

- spark.kubernetes.executor.node.selector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36070) Add time cost info for writing rows out and committing the task.

2021-07-09 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-36070:


Assignee: Kent Yao

> Add time cost info for writing rows out and committing the task.
> 
>
> Key: SPARK-36070
> URL: https://issues.apache.org/jira/browse/SPARK-36070
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
>
> We have a job that has a stage that contains about 8k tasks.  Most tasks take 
> about 1~10min to finish but 3 of them tasks run extremely slow. They take 
> about 1 hour each to finish and also do their speculations. The root cause is 
> most likely the delay of the storage system. On the spark side, we can record 
> the time cost in logs for better bug hunting or performance tuning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36070) Add time cost info for writing rows out and committing the task.

2021-07-09 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-36070.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33279
[https://github.com/apache/spark/pull/33279]

> Add time cost info for writing rows out and committing the task.
> 
>
> Key: SPARK-36070
> URL: https://issues.apache.org/jira/browse/SPARK-36070
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.3.0
>
>
> We have a job that has a stage that contains about 8k tasks.  Most tasks take 
> about 1~10min to finish but 3 of them tasks run extremely slow. They take 
> about 1 hour each to finish and also do their speculations. The root cause is 
> most likely the delay of the storage system. On the spark side, we can record 
> the time cost in logs for better bug hunting or performance tuning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36075) Support for specifiying executor/driver node selector

2021-07-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378177#comment-17378177
 ] 

Apache Spark commented on SPARK-36075:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/33283

> Support for specifiying executor/driver node selector
> -
>
> Key: SPARK-36075
> URL: https://issues.apache.org/jira/browse/SPARK-36075
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>
> Now we can only use "spark.kubernetes.node.selector" to set lable for 
> executor/driver. Sometimes, we need set executor/driver pods to different 
> selector separately.
> We can add below configure to enable the support for specifiying 
> executor/driver node selector:
> - spark.kubernetes.driver.node.selector.
> - spark.kubernetes.executor.node.selector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36075) Support for specifiying executor/driver node selector

2021-07-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378178#comment-17378178
 ] 

Apache Spark commented on SPARK-36075:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/33283

> Support for specifiying executor/driver node selector
> -
>
> Key: SPARK-36075
> URL: https://issues.apache.org/jira/browse/SPARK-36075
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>
> Now we can only use "spark.kubernetes.node.selector" to set lable for 
> executor/driver. Sometimes, we need set executor/driver pods to different 
> selector separately.
> We can add below configure to enable the support for specifiying 
> executor/driver node selector:
> - spark.kubernetes.driver.node.selector.
> - spark.kubernetes.executor.node.selector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36075) Support for specifiying executor/driver node selector

2021-07-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36075:


Assignee: (was: Apache Spark)

> Support for specifiying executor/driver node selector
> -
>
> Key: SPARK-36075
> URL: https://issues.apache.org/jira/browse/SPARK-36075
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>
> Now we can only use "spark.kubernetes.node.selector" to set lable for 
> executor/driver. Sometimes, we need set executor/driver pods to different 
> selector separately.
> We can add below configure to enable the support for specifiying 
> executor/driver node selector:
> - spark.kubernetes.driver.node.selector.
> - spark.kubernetes.executor.node.selector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36075) Support for specifiying executor/driver node selector

2021-07-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36075:


Assignee: Apache Spark

> Support for specifiying executor/driver node selector
> -
>
> Key: SPARK-36075
> URL: https://issues.apache.org/jira/browse/SPARK-36075
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Major
>
> Now we can only use "spark.kubernetes.node.selector" to set lable for 
> executor/driver. Sometimes, we need set executor/driver pods to different 
> selector separately.
> We can add below configure to enable the support for specifiying 
> executor/driver node selector:
> - spark.kubernetes.driver.node.selector.
> - spark.kubernetes.executor.node.selector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36076) [SQL] ArrayIndexOutOfBounds in CAST string to date

2021-07-09 Thread Andy Grove (Jira)
Andy Grove created SPARK-36076:
--

 Summary: [SQL] ArrayIndexOutOfBounds in CAST string to date
 Key: SPARK-36076
 URL: https://issues.apache.org/jira/browse/SPARK-36076
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1
Reporter: Andy Grove


{code:java}
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
  /_/
 
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282)
Type in expressions to have them evaluated.
Type :help for more information.scala> 
spark.conf.set("spark.rapids.sql.enabled", "false")scala> val df = 
Seq(":8:434421+ 98:38").toDF("c0")
df: org.apache.spark.sql.DataFrame = [c0: string]scala> val df2 = 
df.withColumn("c1", col("c0").cast(DataTypes.TimestampType))
:25: error: not found: value DataTypes
   val df2 = df.withColumn("c1", col("c0").cast(DataTypes.TimestampType))
^scala> import 
org.spark.sql.types.DataTypes
:23: error: object spark is not a member of package org
   import org.spark.sql.types.DataTypes
  ^scala> import org.apache.spark.sql.types.DataTypes
import org.apache.spark.sql.types.DataTypesscala> val df2 = df.withColumn("c1", 
col("c0").cast(DataTypes.TimestampType))
df2: org.apache.spark.sql.DataFrame = [c0: string, c1: timestamp]scala> df2.show
java.lang.ArrayIndexOutOfBoundsException: 9
  at 
org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:328)
  at 
org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToTimestamp$2(Cast.scala:455)
  at 
org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:295)
  at 
org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToTimestamp$1(Cast.scala:451)
  at 
org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:840)
  at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:476)
 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36076) [SQL] ArrayIndexOutOfBounds in CAST string to date

2021-07-09 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated SPARK-36076:
---
Description: 
I discovered this bug during some fuzz testing.
{code:java}
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
  /_/
 
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282)
Type in expressions to have them evaluated.
Type :help for more information.scala> 

scala> import org.apache.spark.sql.types.DataTypes

scala> val df = Seq(":8:434421+ 98:38").toDF("c0")
df: org.apache.spark.sql.DataFrame = [c0: string]

scala> val df2 = df.withColumn("c1", col("c0").cast(DataTypes.TimestampType))
df2: org.apache.spark.sql.DataFrame = [c0: string, c1: timestamp]

scala> df2.show
java.lang.ArrayIndexOutOfBoundsException: 9
  at 
org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:328)
  at 
org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToTimestamp$2(Cast.scala:455)
  at 
org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:295)
  at 
org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToTimestamp$1(Cast.scala:451)
  at 
org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:840)
  at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:476)
 {code}

  was:
{code:java}
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
  /_/
 
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282)
Type in expressions to have them evaluated.
Type :help for more information.scala> 

scala> import org.apache.spark.sql.types.DataTypes

scala> val df = Seq(":8:434421+ 98:38").toDF("c0")
df: org.apache.spark.sql.DataFrame = [c0: string]

scala> val df2 = df.withColumn("c1", col("c0").cast(DataTypes.TimestampType))
df2: org.apache.spark.sql.DataFrame = [c0: string, c1: timestamp]

scala> df2.show
java.lang.ArrayIndexOutOfBoundsException: 9
  at 
org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:328)
  at 
org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToTimestamp$2(Cast.scala:455)
  at 
org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:295)
  at 
org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToTimestamp$1(Cast.scala:451)
  at 
org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:840)
  at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:476)
 {code}


> [SQL] ArrayIndexOutOfBounds in CAST string to date
> --
>
> Key: SPARK-36076
> URL: https://issues.apache.org/jira/browse/SPARK-36076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Andy Grove
>Priority: Major
>
> I discovered this bug during some fuzz testing.
> {code:java}
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.1.1
>   /_/
>  
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282)
> Type in expressions to have them evaluated.
> Type :help for more information.scala> 
> scala> import org.apache.spark.sql.types.DataTypes
> scala> val df = Seq(":8:434421+ 98:38").toDF("c0")
> df: org.apache.spark.sql.DataFrame = [c0: string]
> scala> val df2 = df.withColumn("c1", col("c0").cast(DataTypes.TimestampType))
> df2: org.apache.spark.sql.DataFrame = [c0: string, c1: timestamp]
> scala> df2.show
> java.lang.ArrayIndexOutOfBoundsException: 9
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:328)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToTimestamp$2(Cast.scala:455)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:295)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToTimestamp$1(Cast.scala:451)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:840)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:476)
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36076) [SQL] ArrayIndexOutOfBounds in CAST string to date

2021-07-09 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated SPARK-36076:
---
Description: 
{code:java}
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
  /_/
 
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282)
Type in expressions to have them evaluated.
Type :help for more information.scala> 

scala> import org.apache.spark.sql.types.DataTypes

scala> val df = Seq(":8:434421+ 98:38").toDF("c0")
df: org.apache.spark.sql.DataFrame = [c0: string]

scala> val df2 = df.withColumn("c1", col("c0").cast(DataTypes.TimestampType))
df2: org.apache.spark.sql.DataFrame = [c0: string, c1: timestamp]

scala> df2.show
java.lang.ArrayIndexOutOfBoundsException: 9
  at 
org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:328)
  at 
org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToTimestamp$2(Cast.scala:455)
  at 
org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:295)
  at 
org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToTimestamp$1(Cast.scala:451)
  at 
org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:840)
  at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:476)
 {code}

  was:
{code:java}
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
  /_/
 
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282)
Type in expressions to have them evaluated.
Type :help for more information.scala> 
spark.conf.set("spark.rapids.sql.enabled", "false")scala> val df = 
Seq(":8:434421+ 98:38").toDF("c0")
df: org.apache.spark.sql.DataFrame = [c0: string]scala> val df2 = 
df.withColumn("c1", col("c0").cast(DataTypes.TimestampType))
:25: error: not found: value DataTypes
   val df2 = df.withColumn("c1", col("c0").cast(DataTypes.TimestampType))
^scala> import 
org.spark.sql.types.DataTypes
:23: error: object spark is not a member of package org
   import org.spark.sql.types.DataTypes
  ^scala> import org.apache.spark.sql.types.DataTypes
import org.apache.spark.sql.types.DataTypesscala> val df2 = df.withColumn("c1", 
col("c0").cast(DataTypes.TimestampType))
df2: org.apache.spark.sql.DataFrame = [c0: string, c1: timestamp]scala> df2.show
java.lang.ArrayIndexOutOfBoundsException: 9
  at 
org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:328)
  at 
org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToTimestamp$2(Cast.scala:455)
  at 
org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:295)
  at 
org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToTimestamp$1(Cast.scala:451)
  at 
org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:840)
  at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:476)
 {code}


> [SQL] ArrayIndexOutOfBounds in CAST string to date
> --
>
> Key: SPARK-36076
> URL: https://issues.apache.org/jira/browse/SPARK-36076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Andy Grove
>Priority: Major
>
> {code:java}
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.1.1
>   /_/
>  
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282)
> Type in expressions to have them evaluated.
> Type :help for more information.scala> 
> scala> import org.apache.spark.sql.types.DataTypes
> scala> val df = Seq(":8:434421+ 98:38").toDF("c0")
> df: org.apache.spark.sql.DataFrame = [c0: string]
> scala> val df2 = df.withColumn("c1", col("c0").cast(DataTypes.TimestampType))
> df2: org.apache.spark.sql.DataFrame = [c0: string, c1: timestamp]
> scala> df2.show
> java.lang.ArrayIndexOutOfBoundsException: 9
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:328)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToTimestamp$2(Cast.scala:455)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:295)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToTimestamp$1(Cast.scala:451)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:840)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:476)
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-

[jira] [Updated] (SPARK-36076) [SQL] ArrayIndexOutOfBounds in CAST string to timestamp

2021-07-09 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated SPARK-36076:
---
Summary: [SQL] ArrayIndexOutOfBounds in CAST string to timestamp  (was: 
[SQL] ArrayIndexOutOfBounds in CAST string to date)

> [SQL] ArrayIndexOutOfBounds in CAST string to timestamp
> ---
>
> Key: SPARK-36076
> URL: https://issues.apache.org/jira/browse/SPARK-36076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Andy Grove
>Priority: Major
>
> I discovered this bug during some fuzz testing.
> {code:java}
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.1.1
>   /_/
>  
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282)
> Type in expressions to have them evaluated.
> Type :help for more information.scala> 
> scala> import org.apache.spark.sql.types.DataTypes
> scala> val df = Seq(":8:434421+ 98:38").toDF("c0")
> df: org.apache.spark.sql.DataFrame = [c0: string]
> scala> val df2 = df.withColumn("c1", col("c0").cast(DataTypes.TimestampType))
> df2: org.apache.spark.sql.DataFrame = [c0: string, c1: timestamp]
> scala> df2.show
> java.lang.ArrayIndexOutOfBoundsException: 9
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:328)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToTimestamp$2(Cast.scala:455)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:295)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToTimestamp$1(Cast.scala:451)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:840)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:476)
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36077) Support numpy literals as input for pandas-on-Spark APIs

2021-07-09 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-36077:


 Summary: Support numpy literals as input for pandas-on-Spark APIs 
 Key: SPARK-36077
 URL: https://issues.apache.org/jira/browse/SPARK-36077
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Xinrong Meng


Some pandas-on-Spark APIs use PySpark column-related APIs internally, and these 
column-related APIs don't support numpy literals, thus numpy literals are 
disallowed as input (e.g. {{to_replace}} parameter in {{replace}} API). 

`isin` method has been adjusted in 
[https://github.com/apache/spark/pull/32955|https://github.com/apache/spark/pull/32955,]
 . We ought to adjust other API to support numpy literals.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36063) Optimize OneRowRelation subqueries

2021-07-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378227#comment-17378227
 ] 

Apache Spark commented on SPARK-36063:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/33284

> Optimize OneRowRelation subqueries
> --
>
> Key: SPARK-36063
> URL: https://issues.apache.org/jira/browse/SPARK-36063
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> Inline subqueries with OneRowRelation as leaf nodes instead of decorrelating 
> and rewriting them as left outer joins.
> Scalar subquery:
>  ```
>  SELECT (SELECT c1) FROM t1 -> SELECT c1 FROM t1
>  ```
> Lateral subquery:
>  ```
>  SELECT * FROM t1, LATERAL (SELECT c1, c2) -> SELECT c1, c2 , c1, c2 FROM t1
>  ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36078) Complete mappings between numpy literals and Spark data types

2021-07-09 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-36078:


 Summary: Complete mappings between numpy literals and Spark data 
types
 Key: SPARK-36078
 URL: https://issues.apache.org/jira/browse/SPARK-36078
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Xinrong Meng


In [https://github.com/apache/spark/pull/32955,] the {{lit}} function defined 
in {{pyspark.pandas.spark.functions}} has been adjusted to support numpy 
literals input.

However, the mapping between numpy literals and Spark data types is not 
complete.

We ought to fill the gap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36063) Optimize OneRowRelation subqueries

2021-07-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378228#comment-17378228
 ] 

Apache Spark commented on SPARK-36063:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/33284

> Optimize OneRowRelation subqueries
> --
>
> Key: SPARK-36063
> URL: https://issues.apache.org/jira/browse/SPARK-36063
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> Inline subqueries with OneRowRelation as leaf nodes instead of decorrelating 
> and rewriting them as left outer joins.
> Scalar subquery:
>  ```
>  SELECT (SELECT c1) FROM t1 -> SELECT c1 FROM t1
>  ```
> Lateral subquery:
>  ```
>  SELECT * FROM t1, LATERAL (SELECT c1, c2) -> SELECT c1, c2 , c1, c2 FROM t1
>  ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36063) Optimize OneRowRelation subqueries

2021-07-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36063:


Assignee: (was: Apache Spark)

> Optimize OneRowRelation subqueries
> --
>
> Key: SPARK-36063
> URL: https://issues.apache.org/jira/browse/SPARK-36063
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> Inline subqueries with OneRowRelation as leaf nodes instead of decorrelating 
> and rewriting them as left outer joins.
> Scalar subquery:
>  ```
>  SELECT (SELECT c1) FROM t1 -> SELECT c1 FROM t1
>  ```
> Lateral subquery:
>  ```
>  SELECT * FROM t1, LATERAL (SELECT c1, c2) -> SELECT c1, c2 , c1, c2 FROM t1
>  ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36063) Optimize OneRowRelation subqueries

2021-07-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36063:


Assignee: Apache Spark

> Optimize OneRowRelation subqueries
> --
>
> Key: SPARK-36063
> URL: https://issues.apache.org/jira/browse/SPARK-36063
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Assignee: Apache Spark
>Priority: Major
>
> Inline subqueries with OneRowRelation as leaf nodes instead of decorrelating 
> and rewriting them as left outer joins.
> Scalar subquery:
>  ```
>  SELECT (SELECT c1) FROM t1 -> SELECT c1 FROM t1
>  ```
> Lateral subquery:
>  ```
>  SELECT * FROM t1, LATERAL (SELECT c1, c2) -> SELECT c1, c2 , c1, c2 FROM t1
>  ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35866) Improve error message quality

2021-07-09 Thread Karen Feng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karen Feng updated SPARK-35866:
---
Description: 
In the SPIP: Standardize Exception Messages in Spark, there are three major 
improvements proposed:
 # Group error messages in dedicated files: SPARK-33539
 # Establish an error message guideline for developers SPARK-35140
 # Improve error message quality

Based on the guideline, we can start improving the error messages in the 
dedicated files. To make auditing easy, we should use the 
[SparkThrowable|https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/SparkThrowable.java]
 framework; then, the error messages can be centralized in a [single JSON 
file|https://github.com/apache/spark/blob/master/core/src/main/resources/error/error-classes.json].

  was:
In the SPIP: Standardize Exception Messages in Spark, there are three major 
improvements proposed:
 # Group error messages in dedicated files: 
[SPARK-33539|https://issues.apache.org/jira/browse/SPARK-33539]
 # Establish an error message guideline for developers 
[SPARK-35140|https://issues.apache.org/jira/browse/SPARK-35140]
 # Improve error message quality

Based on the guideline, we can start improving the error messages in the 
dedicated files.


> Improve error message quality
> -
>
> Key: SPARK-35866
> URL: https://issues.apache.org/jira/browse/SPARK-35866
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Priority: Major
>
> In the SPIP: Standardize Exception Messages in Spark, there are three major 
> improvements proposed:
>  # Group error messages in dedicated files: SPARK-33539
>  # Establish an error message guideline for developers SPARK-35140
>  # Improve error message quality
> Based on the guideline, we can start improving the error messages in the 
> dedicated files. To make auditing easy, we should use the 
> [SparkThrowable|https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/SparkThrowable.java]
>  framework; then, the error messages can be centralized in a [single JSON 
> file|https://github.com/apache/spark/blob/master/core/src/main/resources/error/error-classes.json].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36003) Implement unary operator `invert` of integral ps.Series/Index

2021-07-09 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-36003:
-
Summary: Implement unary operator `invert` of integral ps.Series/Index  
(was: Implement unary operator `invert` of numeric ps.Series/Index)

> Implement unary operator `invert` of integral ps.Series/Index
> -
>
> Key: SPARK-36003
> URL: https://issues.apache.org/jira/browse/SPARK-36003
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  
> {code:java}
> >>> ~ps.Series([1, 2, 3])
> Traceback (most recent call last):
> ...
> pyspark.sql.utils.AnalysisException: cannot resolve '(NOT `0`)' due to data 
> type mismatch: argument 1 requires boolean type, however, '`0`' is of bigint 
> type.;
> 'Project [unresolvedalias(NOT 0#1L, 
> Some(org.apache.spark.sql.Column$$Lambda$1365/2097273578@53165e1))]
> +- Project [__index_level_0__#0L, 0#1L, monotonically_increasing_id() AS 
> __natural_order__#4L]
>  +- LogicalRDD [__index_level_0__#0L, 0#1L], false
> {code}
>  
>  Currently, unary operator `invert` of numeric ps.Series/Index is not 
> supported. We ought to implement that following pandas' behaviors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36003) Implement unary operator `invert` of integral ps.Series/Index

2021-07-09 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-36003:
-
Description: 
 
{code:java}
>>> ~ps.Series([1, 2, 3])
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: cannot resolve '(NOT `0`)' due to data 
type mismatch: argument 1 requires boolean type, however, '`0`' is of bigint 
type.;
'Project [unresolvedalias(NOT 0#1L, 
Some(org.apache.spark.sql.Column$$Lambda$1365/2097273578@53165e1))]
+- Project [__index_level_0__#0L, 0#1L, monotonically_increasing_id() AS 
__natural_order__#4L]
 +- LogicalRDD [__index_level_0__#0L, 0#1L], false
{code}
 

 Currently, unary operator `invert` of integral ps.Series/Index is not 
supported. We ought to implement that following pandas' behaviors.

  was:
 
{code:java}
>>> ~ps.Series([1, 2, 3])
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: cannot resolve '(NOT `0`)' due to data 
type mismatch: argument 1 requires boolean type, however, '`0`' is of bigint 
type.;
'Project [unresolvedalias(NOT 0#1L, 
Some(org.apache.spark.sql.Column$$Lambda$1365/2097273578@53165e1))]
+- Project [__index_level_0__#0L, 0#1L, monotonically_increasing_id() AS 
__natural_order__#4L]
 +- LogicalRDD [__index_level_0__#0L, 0#1L], false
{code}
 

 Currently, unary operator `invert` of numeric ps.Series/Index is not 
supported. We ought to implement that following pandas' behaviors.


> Implement unary operator `invert` of integral ps.Series/Index
> -
>
> Key: SPARK-36003
> URL: https://issues.apache.org/jira/browse/SPARK-36003
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  
> {code:java}
> >>> ~ps.Series([1, 2, 3])
> Traceback (most recent call last):
> ...
> pyspark.sql.utils.AnalysisException: cannot resolve '(NOT `0`)' due to data 
> type mismatch: argument 1 requires boolean type, however, '`0`' is of bigint 
> type.;
> 'Project [unresolvedalias(NOT 0#1L, 
> Some(org.apache.spark.sql.Column$$Lambda$1365/2097273578@53165e1))]
> +- Project [__index_level_0__#0L, 0#1L, monotonically_increasing_id() AS 
> __natural_order__#4L]
>  +- LogicalRDD [__index_level_0__#0L, 0#1L], false
> {code}
>  
>  Currently, unary operator `invert` of integral ps.Series/Index is not 
> supported. We ought to implement that following pandas' behaviors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36003) Implement unary operator `invert` of integral ps.Series/Index

2021-07-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378343#comment-17378343
 ] 

Apache Spark commented on SPARK-36003:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/33285

> Implement unary operator `invert` of integral ps.Series/Index
> -
>
> Key: SPARK-36003
> URL: https://issues.apache.org/jira/browse/SPARK-36003
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  
> {code:java}
> >>> ~ps.Series([1, 2, 3])
> Traceback (most recent call last):
> ...
> pyspark.sql.utils.AnalysisException: cannot resolve '(NOT `0`)' due to data 
> type mismatch: argument 1 requires boolean type, however, '`0`' is of bigint 
> type.;
> 'Project [unresolvedalias(NOT 0#1L, 
> Some(org.apache.spark.sql.Column$$Lambda$1365/2097273578@53165e1))]
> +- Project [__index_level_0__#0L, 0#1L, monotonically_increasing_id() AS 
> __natural_order__#4L]
>  +- LogicalRDD [__index_level_0__#0L, 0#1L], false
> {code}
>  
>  Currently, unary operator `invert` of integral ps.Series/Index is not 
> supported. We ought to implement that following pandas' behaviors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36003) Implement unary operator `invert` of integral ps.Series/Index

2021-07-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36003:


Assignee: Apache Spark

> Implement unary operator `invert` of integral ps.Series/Index
> -
>
> Key: SPARK-36003
> URL: https://issues.apache.org/jira/browse/SPARK-36003
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
>  
> {code:java}
> >>> ~ps.Series([1, 2, 3])
> Traceback (most recent call last):
> ...
> pyspark.sql.utils.AnalysisException: cannot resolve '(NOT `0`)' due to data 
> type mismatch: argument 1 requires boolean type, however, '`0`' is of bigint 
> type.;
> 'Project [unresolvedalias(NOT 0#1L, 
> Some(org.apache.spark.sql.Column$$Lambda$1365/2097273578@53165e1))]
> +- Project [__index_level_0__#0L, 0#1L, monotonically_increasing_id() AS 
> __natural_order__#4L]
>  +- LogicalRDD [__index_level_0__#0L, 0#1L], false
> {code}
>  
>  Currently, unary operator `invert` of integral ps.Series/Index is not 
> supported. We ought to implement that following pandas' behaviors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36003) Implement unary operator `invert` of integral ps.Series/Index

2021-07-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36003:


Assignee: (was: Apache Spark)

> Implement unary operator `invert` of integral ps.Series/Index
> -
>
> Key: SPARK-36003
> URL: https://issues.apache.org/jira/browse/SPARK-36003
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  
> {code:java}
> >>> ~ps.Series([1, 2, 3])
> Traceback (most recent call last):
> ...
> pyspark.sql.utils.AnalysisException: cannot resolve '(NOT `0`)' due to data 
> type mismatch: argument 1 requires boolean type, however, '`0`' is of bigint 
> type.;
> 'Project [unresolvedalias(NOT 0#1L, 
> Some(org.apache.spark.sql.Column$$Lambda$1365/2097273578@53165e1))]
> +- Project [__index_level_0__#0L, 0#1L, monotonically_increasing_id() AS 
> __natural_order__#4L]
>  +- LogicalRDD [__index_level_0__#0L, 0#1L], false
> {code}
>  
>  Currently, unary operator `invert` of integral ps.Series/Index is not 
> supported. We ought to implement that following pandas' behaviors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36079) Filter estimate should always be non-negative

2021-07-09 Thread Karen Feng (Jira)
Karen Feng created SPARK-36079:
--

 Summary: Filter estimate should always be non-negative
 Key: SPARK-36079
 URL: https://issues.apache.org/jira/browse/SPARK-36079
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Karen Feng


It's possible for a column's statistics to have a higher `nullCount` than the 
table's `rowCount`. In this case, the filter estimates come back outside of the 
reasonable range (between 0 and 1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36079) Filter estimate should always be non-negative

2021-07-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36079:


Assignee: (was: Apache Spark)

> Filter estimate should always be non-negative
> -
>
> Key: SPARK-36079
> URL: https://issues.apache.org/jira/browse/SPARK-36079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Priority: Major
>
> It's possible for a column's statistics to have a higher `nullCount` than the 
> table's `rowCount`. In this case, the filter estimates come back outside of 
> the reasonable range (between 0 and 1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36079) Filter estimate should always be non-negative

2021-07-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378345#comment-17378345
 ] 

Apache Spark commented on SPARK-36079:
--

User 'karenfeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/33286

> Filter estimate should always be non-negative
> -
>
> Key: SPARK-36079
> URL: https://issues.apache.org/jira/browse/SPARK-36079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Priority: Major
>
> It's possible for a column's statistics to have a higher `nullCount` than the 
> table's `rowCount`. In this case, the filter estimates come back outside of 
> the reasonable range (between 0 and 1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36079) Filter estimate should always be non-negative

2021-07-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36079:


Assignee: Apache Spark

> Filter estimate should always be non-negative
> -
>
> Key: SPARK-36079
> URL: https://issues.apache.org/jira/browse/SPARK-36079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Assignee: Apache Spark
>Priority: Major
>
> It's possible for a column's statistics to have a higher `nullCount` than the 
> table's `rowCount`. In this case, the filter estimates come back outside of 
> the reasonable range (between 0 and 1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36079) Filter estimate should always be non-negative

2021-07-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378346#comment-17378346
 ] 

Apache Spark commented on SPARK-36079:
--

User 'karenfeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/33286

> Filter estimate should always be non-negative
> -
>
> Key: SPARK-36079
> URL: https://issues.apache.org/jira/browse/SPARK-36079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Priority: Major
>
> It's possible for a column's statistics to have a higher `nullCount` than the 
> table's `rowCount`. In this case, the filter estimates come back outside of 
> the reasonable range (between 0 and 1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36079) Null-based filter estimates should always be non-negative

2021-07-09 Thread Karen Feng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karen Feng updated SPARK-36079:
---
Summary: Null-based filter estimates should always be non-negative  (was: 
Filter estimate should always be non-negative)

> Null-based filter estimates should always be non-negative
> -
>
> Key: SPARK-36079
> URL: https://issues.apache.org/jira/browse/SPARK-36079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Priority: Major
>
> It's possible for a column's statistics to have a higher `nullCount` than the 
> table's `rowCount`. In this case, the filter estimates come back outside of 
> the reasonable range (between 0 and 1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36080) Broadcast join outer join stream side

2021-07-09 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-36080:
---

 Summary: Broadcast join outer join stream side
 Key: SPARK-36080
 URL: https://issues.apache.org/jira/browse/SPARK-36080
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36066) UTF8String trimAll() only can trim space but not ({@literal <=} ASCII 32)

2021-07-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36066:


Assignee: (was: Apache Spark)

> UTF8String trimAll() only can trim space but not ({@literal <=} ASCII 32)
> -
>
> Key: SPARK-36066
> URL: https://issues.apache.org/jira/browse/SPARK-36066
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1, 3.1.2
>Reporter: liukai
>Priority: Major
>
> In this method,Character.isWhitespace() is used to judged. But 
> Character.isWhitespace() can not work matching the method definition.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36066) UTF8String trimAll() only can trim space but not ({@literal <=} ASCII 32)

2021-07-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36066:


Assignee: Apache Spark

> UTF8String trimAll() only can trim space but not ({@literal <=} ASCII 32)
> -
>
> Key: SPARK-36066
> URL: https://issues.apache.org/jira/browse/SPARK-36066
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1, 3.1.2
>Reporter: liukai
>Assignee: Apache Spark
>Priority: Major
>
> In this method,Character.isWhitespace() is used to judged. But 
> Character.isWhitespace() can not work matching the method definition.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36066) UTF8String trimAll() only can trim space but not ({@literal <=} ASCII 32)

2021-07-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378405#comment-17378405
 ] 

Apache Spark commented on SPARK-36066:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/33287

> UTF8String trimAll() only can trim space but not ({@literal <=} ASCII 32)
> -
>
> Key: SPARK-36066
> URL: https://issues.apache.org/jira/browse/SPARK-36066
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1, 3.1.2
>Reporter: liukai
>Priority: Major
>
> In this method,Character.isWhitespace() is used to judged. But 
> Character.isWhitespace() can not work matching the method definition.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org