date:20230301

[jira] [Created] (SPARK-42648) Upgrade versions-maven-plugin to 2.15.0

2023-03-01 Thread Yang Jie (Jira)

Yang Jie created SPARK-42648:


 Summary: Upgrade versions-maven-plugin to 2.15.0
 Key: SPARK-42648
 URL: https://issues.apache.org/jira/browse/SPARK-42648
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: Yang Jie


https://github.com/mojohaus/versions/releases/tag/2.15.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42647) Remove aliases from deprecated numpy data types

2023-03-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42647:


Assignee: (was: Apache Spark)

> Remove aliases from deprecated numpy data types
> ---
>
> Key: SPARK-42647
> URL: https://issues.apache.org/jira/browse/SPARK-42647
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1
>Reporter: Aimilios Tsouvelekakis
>Priority: Major
>
> Numpy has started changing the alias to some of its data-types. This means 
> that users with the latest version of numpy they will face either warnings or 
> errors according to the type that they are using. This affects all the users 
> using numoy > 1.20.0. One of the types was fixed back in September with this 
> [pull|https://github.com/apache/spark/pull/37817] request.
> The problem can be split into 2 types:
> [numpy 1.24.0|https://github.com/numpy/numpy/pull/22607]: The scalar type 
> aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, 
> np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually 
> be removed. At this point in numpy 1.25.0 they give a awarning
> [numpy 1.20.0|https://github.com/numpy/numpy/pull/14882]: Using the aliases 
> of builtin types like np.int is deprecated and removed since numpy version 
> 1.24.0
> The changes are needed so pyspark can be compatible with the latest numpy and 
> avoid
>  * attribute errors on data types being deprecated from version 1.20.0: 
> [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations]
>  * warnings on deprecated data types from version 1.24.0: 
> [https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations]
>  
> From my main research I see the following:
> The only changes that are functional are related with the conversion.py file. 
> The rest of the changes are inside tests in the user_guide or in some 
> docstrings describing specific functions. Since I am not an expert in these 
> tests I wait for the reviewer and some people with more experience in the 
> pyspark code.
> These types are aliases for classic python types so yes they should work with 
> all the numpy versions 
> [1|https://numpy.org/devdocs/release/1.20.0-notes.html], 
> [2|https://stackoverflow.com/questions/74844262/how-can-i-solve-error-module-numpy-has-no-attribute-float-in-python].
>  The error or warning comes from the call to the numpy.
>  
> For the versions I chose to include from 3.3 and onwards but I see that 3.2 
> also is still in the 18 month maintenace cadence as it was released in 
> October 2021.
>  
> The pull request: [https://github.com/apache/spark/pull/40220]
> Best Regards,
> Aimilios



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42647) Remove aliases from deprecated numpy data types

2023-03-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695520#comment-17695520
 ] 

Apache Spark commented on SPARK-42647:
--

User 'aimtsou' has created a pull request for this issue:
https://github.com/apache/spark/pull/40220

> Remove aliases from deprecated numpy data types
> ---
>
> Key: SPARK-42647
> URL: https://issues.apache.org/jira/browse/SPARK-42647
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1
>Reporter: Aimilios Tsouvelekakis
>Priority: Major
>
> Numpy has started changing the alias to some of its data-types. This means 
> that users with the latest version of numpy they will face either warnings or 
> errors according to the type that they are using. This affects all the users 
> using numoy > 1.20.0. One of the types was fixed back in September with this 
> [pull|https://github.com/apache/spark/pull/37817] request.
> The problem can be split into 2 types:
> [numpy 1.24.0|https://github.com/numpy/numpy/pull/22607]: The scalar type 
> aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, 
> np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually 
> be removed. At this point in numpy 1.25.0 they give a awarning
> [numpy 1.20.0|https://github.com/numpy/numpy/pull/14882]: Using the aliases 
> of builtin types like np.int is deprecated and removed since numpy version 
> 1.24.0
> The changes are needed so pyspark can be compatible with the latest numpy and 
> avoid
>  * attribute errors on data types being deprecated from version 1.20.0: 
> [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations]
>  * warnings on deprecated data types from version 1.24.0: 
> [https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations]
>  
> From my main research I see the following:
> The only changes that are functional are related with the conversion.py file. 
> The rest of the changes are inside tests in the user_guide or in some 
> docstrings describing specific functions. Since I am not an expert in these 
> tests I wait for the reviewer and some people with more experience in the 
> pyspark code.
> These types are aliases for classic python types so yes they should work with 
> all the numpy versions 
> [1|https://numpy.org/devdocs/release/1.20.0-notes.html], 
> [2|https://stackoverflow.com/questions/74844262/how-can-i-solve-error-module-numpy-has-no-attribute-float-in-python].
>  The error or warning comes from the call to the numpy.
>  
> For the versions I chose to include from 3.3 and onwards but I see that 3.2 
> also is still in the 18 month maintenace cadence as it was released in 
> October 2021.
>  
> The pull request: [https://github.com/apache/spark/pull/40220]
> Best Regards,
> Aimilios



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42647) Remove aliases from deprecated numpy data types

2023-03-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42647:


Assignee: Apache Spark

> Remove aliases from deprecated numpy data types
> ---
>
> Key: SPARK-42647
> URL: https://issues.apache.org/jira/browse/SPARK-42647
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1
>Reporter: Aimilios Tsouvelekakis
>Assignee: Apache Spark
>Priority: Major
>
> Numpy has started changing the alias to some of its data-types. This means 
> that users with the latest version of numpy they will face either warnings or 
> errors according to the type that they are using. This affects all the users 
> using numoy > 1.20.0. One of the types was fixed back in September with this 
> [pull|https://github.com/apache/spark/pull/37817] request.
> The problem can be split into 2 types:
> [numpy 1.24.0|https://github.com/numpy/numpy/pull/22607]: The scalar type 
> aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, 
> np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually 
> be removed. At this point in numpy 1.25.0 they give a awarning
> [numpy 1.20.0|https://github.com/numpy/numpy/pull/14882]: Using the aliases 
> of builtin types like np.int is deprecated and removed since numpy version 
> 1.24.0
> The changes are needed so pyspark can be compatible with the latest numpy and 
> avoid
>  * attribute errors on data types being deprecated from version 1.20.0: 
> [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations]
>  * warnings on deprecated data types from version 1.24.0: 
> [https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations]
>  
> From my main research I see the following:
> The only changes that are functional are related with the conversion.py file. 
> The rest of the changes are inside tests in the user_guide or in some 
> docstrings describing specific functions. Since I am not an expert in these 
> tests I wait for the reviewer and some people with more experience in the 
> pyspark code.
> These types are aliases for classic python types so yes they should work with 
> all the numpy versions 
> [1|https://numpy.org/devdocs/release/1.20.0-notes.html], 
> [2|https://stackoverflow.com/questions/74844262/how-can-i-solve-error-module-numpy-has-no-attribute-float-in-python].
>  The error or warning comes from the call to the numpy.
>  
> For the versions I chose to include from 3.3 and onwards but I see that 3.2 
> also is still in the 18 month maintenace cadence as it was released in 
> October 2021.
>  
> The pull request: [https://github.com/apache/spark/pull/40220]
> Best Regards,
> Aimilios



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42639) Add createDataFrame/createDataset to SparkSession

2023-03-01 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42639.
---
Fix Version/s: 3.4.1
   Resolution: Fixed

Issue resolved by pull request 40242
[https://github.com/apache/spark/pull/40242]

> Add createDataFrame/createDataset to SparkSession
> -
>
> Key: SPARK-42639
> URL: https://issues.apache.org/jira/browse/SPARK-42639
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.4.1
>
>
> Add createDataFrame/createDataset to SparkSession



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42647) Remove aliases from deprecated numpy data types

2023-03-01 Thread Aimilios Tsouvelekakis (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aimilios Tsouvelekakis updated SPARK-42647:
---
Description: 
Numpy has started changing the alias to some of its data-types. This means that 
users with the latest version of numpy they will face either warnings or errors 
according to the type that they are using. This affects all the users using 
numoy > 1.20.0. One of the types was fixed back in September with this 
[pull|https://github.com/apache/spark/pull/37817] request.

The problem can be split into 2 types:

[numpy 1.24.0|https://github.com/numpy/numpy/pull/22607]: The scalar type 
aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, 
np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually be 
removed. At this point in numpy 1.25.0 they give a awarning
[numpy 1.20.0|https://github.com/numpy/numpy/pull/14882]: Using the aliases of 
builtin types like np.int is deprecated and removed since numpy version 1.24.0

The changes are needed so pyspark can be compatible with the latest numpy and 
avoid
 * attribute errors on data types being deprecated from version 1.20.0: 
[https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations]
 * warnings on deprecated data types from version 1.24.0: 
[https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations]

 

>From my main research I see the following:

The only changes that are functional are related with the conversion.py file. 
The rest of the changes are inside tests in the user_guide or in some 
docstrings describing specific functions. Since I am not an expert in these 
tests I wait for the reviewer and some people with more experience in the 
pyspark code.

These types are aliases for classic python types so yes they should work with 
all the numpy versions [1|https://numpy.org/devdocs/release/1.20.0-notes.html], 
[2|https://stackoverflow.com/questions/74844262/how-can-i-solve-error-module-numpy-has-no-attribute-float-in-python].
 The error or warning comes from the call to the numpy.

 

For the versions I chose to include from 3.3 and onwards but I see that 3.2 
also is still in the 18 month maintenace cadence as it was released in October 
2021.

 

The pull request: [https://github.com/apache/spark/pull/40220]

Best Regards,
Aimilios

  was:
Numpy has started changing the alias to some of its data-types. This means that 
users with the latest version of numpy they will face either warnings or errors 
according to the type that they are using. This affects all the users using 
numoy > 1.20.0. One of the types was fixed back in September with this 
[pull|https://github.com/apache/spark/pull/37817] request.

The problem can be split into 2 types:

[numpy 1.24.0|https://github.com/numpy/numpy/pull/22607]: The scalar type 
aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, 
np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually be 
removed. At this point in numpy 1.25.0 they give a awarning
[numpy 1.20.0|https://github.com/numpy/numpy/pull/14882]: Using the aliases of 
builtin types like np.int is deprecated and removed since numpy version 1.24.0

The changes are needed so pyspark can be compatible with the latest numpy and 
avoid
 * attribute errors on data types being deprecated from version 1.20.0: 
[https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations]
 * warnings on deprecated data types from version 1.24.0: 
[https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations]

 

>From my main research I see the following:

The only changes that are functional are related with the conversion.py file. 
The rest of the changes are inside tests in the user_guide or in some 
docstrings describing specific functions. Since I am not an expert in these 
tests I wait for the reviewer and some people with more experience in the 
pyspark code.

These types are aliases for classic python types so yes they should work with 
all the numpy versions [1|https://numpy.org/devdocs/release/1.20.0-notes.html], 
[2|https://stackoverflow.com/questions/74844262/how-can-i-solve-error-module-numpy-has-no-attribute-float-in-python].
 The error or warning comes from the call to the numpy.

 

The pull request: [https://github.com/apache/spark/pull/40220]

Best Regards,
Aimilios


> Remove aliases from deprecated numpy data types
> ---
>
> Key: SPARK-42647
> URL: https://issues.apache.org/jira/browse/SPARK-42647
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1
>Reporter: Aimilios Tsouvelekakis
>Priority: Major
>
> Numpy has started changing the alias to some of its data-types. This means 
> that users with the latest version of numpy they

[jira] [Created] (SPARK-42647) Remove aliases from deprecated numpy data types

2023-03-01 Thread Aimilios Tsouvelekakis (Jira)

Aimilios Tsouvelekakis created SPARK-42647:
--

 Summary: Remove aliases from deprecated numpy data types
 Key: SPARK-42647
 URL: https://issues.apache.org/jira/browse/SPARK-42647
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.3.2, 3.3.1, 3.3.0, 3.3.3, 3.4.0, 3.4.1
Reporter: Aimilios Tsouvelekakis


Numpy has started changing the alias to some of its data-types. This means that 
users with the latest version of numpy they will face either warnings or errors 
according to the type that they are using. This affects all the users using 
numoy > 1.20.0. One of the types was fixed back in September with this 
[pull|https://github.com/apache/spark/pull/37817] request.

The problem can be split into 2 types:

[numpy 1.24.0|https://github.com/numpy/numpy/pull/22607]: The scalar type 
aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, 
np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually be 
removed. At this point in numpy 1.25.0 they give a awarning
[numpy 1.20.0|https://github.com/numpy/numpy/pull/14882]: Using the aliases of 
builtin types like np.int is deprecated and removed since numpy version 1.24.0

The changes are needed so pyspark can be compatible with the latest numpy and 
avoid
 * attribute errors on data types being deprecated from version 1.20.0: 
[https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations]
 * warnings on deprecated data types from version 1.24.0: 
[https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations]

 

>From my main research I see the following:

The only changes that are functional are related with the conversion.py file. 
The rest of the changes are inside tests in the user_guide or in some 
docstrings describing specific functions. Since I am not an expert in these 
tests I wait for the reviewer and some people with more experience in the 
pyspark code.

These types are aliases for classic python types so yes they should work with 
all the numpy versions [1|https://numpy.org/devdocs/release/1.20.0-notes.html], 
[2|https://stackoverflow.com/questions/74844262/how-can-i-solve-error-module-numpy-has-no-attribute-float-in-python].
 The error or warning comes from the call to the numpy.

 

The pull request: [https://github.com/apache/spark/pull/40220]

Best Regards,
Aimilios



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42553) NonReserved keyword "interval" can't be column name

2023-03-01 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695498#comment-17695498
 ] 

Dongjoon Hyun commented on SPARK-42553:
---

Since RC2 tag is created, I changed the Fixed Version from 3.4.0 to 3.4.1 for 
now. We can adjust it later according to the RC2 result.

> NonReserved keyword "interval" can't be column name
> ---
>
> Key: SPARK-42553
> URL: https://issues.apache.org/jira/browse/SPARK-42553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.2.3, 3.3.2
> Environment: Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 
> 1.8.0_345)
> Spark version 3.2.3-SNAPSHOT
>Reporter: jiang13021
>Assignee: jiang13021
>Priority: Major
> Fix For: 3.4.1
>
>
> INTERVAL is a Non-Reserved keyword in spark. "Non-Reserved keywords" have a 
> special meaning in particular contexts and can be used as identifiers in 
> other contexts. So by design, interval can be used as a column name.
> {code:java}
> scala> spark.sql("select interval from mytable")
> org.apache.spark.sql.catalyst.parser.ParseException:
> at least one time unit should be given for interval literal(line 1, pos 7)== 
> SQL ==
> select interval from mytable
> ---^^^  at 
> org.apache.spark.sql.errors.QueryParsingErrors$.invalidIntervalLiteralError(QueryParsingErrors.scala:196)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$parseIntervalLiteral$1(AstBuilder.scala:2481)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.parseIntervalLiteral(AstBuilder.scala:2466)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitInterval$1(AstBuilder.scala:2432)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitInterval(AstBuilder.scala:2431)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitInterval(AstBuilder.scala:57)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$IntervalContext.accept(SqlBaseParser.java:17308)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitIntervalLiteral(SqlBaseBaseVisitor.java:1581)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$IntervalLiteralContext.accept(SqlBaseParser.java:16929)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitConstantDefault(SqlBaseBaseVisitor.java:1511)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ConstantDefaultContext.accept(SqlBaseParser.java:15905)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitValueExpressionDefault(SqlBaseBaseVisitor.java:1392)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ValueExpressionDefaultContext.accept(SqlBaseParser.java:15298)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.expression(AstBuilder.scala:1412)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitPredicated$1(AstBuilder.scala:1548)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitPredicated(AstBuilder.scala:1547)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitPredicated(AstBuilder.scala:57)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$PredicatedContext.accept(SqlBaseParser.java:14745)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitExpression(SqlBaseBaseVisitor.java:1343)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ExpressionContext.accept(SqlBaseParser.java:14606)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.expression(AstBuilder.scala:1412)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitNamedExpression$1(AstBuilder.scala:1434)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitNamedExpression(AstBuilder.scala:1433)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitNamedExpression(AstBuilder.scala:57)
>   at 
>

[jira] [Updated] (SPARK-42553) NonReserved keyword "interval" can't be column name

2023-03-01 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42553:
--
Fix Version/s: 3.4.1
   (was: 3.4.0)

> NonReserved keyword "interval" can't be column name
> ---
>
> Key: SPARK-42553
> URL: https://issues.apache.org/jira/browse/SPARK-42553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.2.3, 3.3.2
> Environment: Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 
> 1.8.0_345)
> Spark version 3.2.3-SNAPSHOT
>Reporter: jiang13021
>Assignee: jiang13021
>Priority: Major
> Fix For: 3.4.1
>
>
> INTERVAL is a Non-Reserved keyword in spark. "Non-Reserved keywords" have a 
> special meaning in particular contexts and can be used as identifiers in 
> other contexts. So by design, interval can be used as a column name.
> {code:java}
> scala> spark.sql("select interval from mytable")
> org.apache.spark.sql.catalyst.parser.ParseException:
> at least one time unit should be given for interval literal(line 1, pos 7)== 
> SQL ==
> select interval from mytable
> ---^^^  at 
> org.apache.spark.sql.errors.QueryParsingErrors$.invalidIntervalLiteralError(QueryParsingErrors.scala:196)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$parseIntervalLiteral$1(AstBuilder.scala:2481)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.parseIntervalLiteral(AstBuilder.scala:2466)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitInterval$1(AstBuilder.scala:2432)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitInterval(AstBuilder.scala:2431)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitInterval(AstBuilder.scala:57)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$IntervalContext.accept(SqlBaseParser.java:17308)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitIntervalLiteral(SqlBaseBaseVisitor.java:1581)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$IntervalLiteralContext.accept(SqlBaseParser.java:16929)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitConstantDefault(SqlBaseBaseVisitor.java:1511)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ConstantDefaultContext.accept(SqlBaseParser.java:15905)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitValueExpressionDefault(SqlBaseBaseVisitor.java:1392)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ValueExpressionDefaultContext.accept(SqlBaseParser.java:15298)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.expression(AstBuilder.scala:1412)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitPredicated$1(AstBuilder.scala:1548)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitPredicated(AstBuilder.scala:1547)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitPredicated(AstBuilder.scala:57)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$PredicatedContext.accept(SqlBaseParser.java:14745)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitExpression(SqlBaseBaseVisitor.java:1343)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ExpressionContext.accept(SqlBaseParser.java:14606)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.expression(AstBuilder.scala:1412)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitNamedExpression$1(AstBuilder.scala:1434)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitNamedExpression(AstBuilder.scala:1433)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitNamedExpression(AstBuilder.scala:57)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$NamedExpressionContext.accept(SqlBaseParser.java:14124)
>   at 
>

[jira] [Updated] (SPARK-42644) Add `hive` dependency to `connect` module

2023-03-01 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42644:
--
Fix Version/s: 3.4.1
   (was: 3.4.0)

> Add `hive` dependency to `connect` module
> -
>
> Key: SPARK-42644
> URL: https://issues.apache.org/jira/browse/SPARK-42644
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42644) Add `hive` dependency to `connect` module

2023-03-01 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42644.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40246
[https://github.com/apache/spark/pull/40246]

> Add `hive` dependency to `connect` module
> -
>
> Key: SPARK-42644
> URL: https://issues.apache.org/jira/browse/SPARK-42644
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42644) Add `hive` dependency to `connect` module

2023-03-01 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42644:
-

Assignee: Dongjoon Hyun

> Add `hive` dependency to `connect` module
> -
>
> Key: SPARK-42644
> URL: https://issues.apache.org/jira/browse/SPARK-42644
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42553) NonReserved keyword "interval" can't be column name

2023-03-01 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-42553.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40195
[https://github.com/apache/spark/pull/40195]

> NonReserved keyword "interval" can't be column name
> ---
>
> Key: SPARK-42553
> URL: https://issues.apache.org/jira/browse/SPARK-42553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.2.3, 3.3.2
> Environment: Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 
> 1.8.0_345)
> Spark version 3.2.3-SNAPSHOT
>Reporter: jiang13021
>Assignee: jiang13021
>Priority: Major
> Fix For: 3.4.0
>
>
> INTERVAL is a Non-Reserved keyword in spark. "Non-Reserved keywords" have a 
> special meaning in particular contexts and can be used as identifiers in 
> other contexts. So by design, interval can be used as a column name.
> {code:java}
> scala> spark.sql("select interval from mytable")
> org.apache.spark.sql.catalyst.parser.ParseException:
> at least one time unit should be given for interval literal(line 1, pos 7)== 
> SQL ==
> select interval from mytable
> ---^^^  at 
> org.apache.spark.sql.errors.QueryParsingErrors$.invalidIntervalLiteralError(QueryParsingErrors.scala:196)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$parseIntervalLiteral$1(AstBuilder.scala:2481)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.parseIntervalLiteral(AstBuilder.scala:2466)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitInterval$1(AstBuilder.scala:2432)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitInterval(AstBuilder.scala:2431)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitInterval(AstBuilder.scala:57)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$IntervalContext.accept(SqlBaseParser.java:17308)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitIntervalLiteral(SqlBaseBaseVisitor.java:1581)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$IntervalLiteralContext.accept(SqlBaseParser.java:16929)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitConstantDefault(SqlBaseBaseVisitor.java:1511)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ConstantDefaultContext.accept(SqlBaseParser.java:15905)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitValueExpressionDefault(SqlBaseBaseVisitor.java:1392)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ValueExpressionDefaultContext.accept(SqlBaseParser.java:15298)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.expression(AstBuilder.scala:1412)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitPredicated$1(AstBuilder.scala:1548)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitPredicated(AstBuilder.scala:1547)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitPredicated(AstBuilder.scala:57)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$PredicatedContext.accept(SqlBaseParser.java:14745)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitExpression(SqlBaseBaseVisitor.java:1343)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ExpressionContext.accept(SqlBaseParser.java:14606)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.expression(AstBuilder.scala:1412)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitNamedExpression$1(AstBuilder.scala:1434)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitNamedExpression(AstBuilder.scala:1433)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitNamedExpression(AstBuilder.scala:57)
>   at 
>

[jira] [Assigned] (SPARK-42553) NonReserved keyword "interval" can't be column name

2023-03-01 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-42553:


Assignee: jiang13021

> NonReserved keyword "interval" can't be column name
> ---
>
> Key: SPARK-42553
> URL: https://issues.apache.org/jira/browse/SPARK-42553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.2.3, 3.3.2
> Environment: Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 
> 1.8.0_345)
> Spark version 3.2.3-SNAPSHOT
>Reporter: jiang13021
>Assignee: jiang13021
>Priority: Major
>
> INTERVAL is a Non-Reserved keyword in spark. "Non-Reserved keywords" have a 
> special meaning in particular contexts and can be used as identifiers in 
> other contexts. So by design, interval can be used as a column name.
> {code:java}
> scala> spark.sql("select interval from mytable")
> org.apache.spark.sql.catalyst.parser.ParseException:
> at least one time unit should be given for interval literal(line 1, pos 7)== 
> SQL ==
> select interval from mytable
> ---^^^  at 
> org.apache.spark.sql.errors.QueryParsingErrors$.invalidIntervalLiteralError(QueryParsingErrors.scala:196)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$parseIntervalLiteral$1(AstBuilder.scala:2481)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.parseIntervalLiteral(AstBuilder.scala:2466)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitInterval$1(AstBuilder.scala:2432)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitInterval(AstBuilder.scala:2431)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitInterval(AstBuilder.scala:57)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$IntervalContext.accept(SqlBaseParser.java:17308)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitIntervalLiteral(SqlBaseBaseVisitor.java:1581)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$IntervalLiteralContext.accept(SqlBaseParser.java:16929)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitConstantDefault(SqlBaseBaseVisitor.java:1511)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ConstantDefaultContext.accept(SqlBaseParser.java:15905)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitValueExpressionDefault(SqlBaseBaseVisitor.java:1392)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ValueExpressionDefaultContext.accept(SqlBaseParser.java:15298)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.expression(AstBuilder.scala:1412)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitPredicated$1(AstBuilder.scala:1548)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitPredicated(AstBuilder.scala:1547)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitPredicated(AstBuilder.scala:57)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$PredicatedContext.accept(SqlBaseParser.java:14745)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitExpression(SqlBaseBaseVisitor.java:1343)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ExpressionContext.accept(SqlBaseParser.java:14606)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.expression(AstBuilder.scala:1412)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitNamedExpression$1(AstBuilder.scala:1434)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitNamedExpression(AstBuilder.scala:1433)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitNamedExpression(AstBuilder.scala:57)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$NamedExpressionContext.accept(SqlBaseParser.java:14124)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61)
>   at

[jira] [Updated] (SPARK-42642) Make Python the first code example tab in the Spark documentation

2023-03-01 Thread Allan Folting (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Folting updated SPARK-42642:
--
Summary: Make Python the first code example tab in the Spark documentation  
(was: Make Python the first code example tab)

> Make Python the first code example tab in the Spark documentation
> -
>
> Key: SPARK-42642
> URL: https://issues.apache.org/jira/browse/SPARK-42642
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Allan Folting
>Priority: Major
> Attachments: Screenshot 2023-03-01 at 8.10.08 PM.png, Screenshot 
> 2023-03-01 at 8.10.22 PM.png
>
>
> Python is the most approachable and most popular language so it should be the 
> default language in code examples so this makes Python the first code example 
> tab consistently across the documentation, where applicable.
> This is continuing the work started with:
> https://issues.apache.org/jira/browse/SPARK-42493
> where these two pages were updated:
> [https://spark.apache.org/docs/latest/sql-getting-started.html]
> [https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html]
>  
> Pages being updated now:
> [https://spark.apache.org/docs/latest/ml-classification-regression.html]
> [https://spark.apache.org/docs/latest/ml-clustering.html]
> [https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]
> [https://spark.apache.org/docs/latest/ml-datasource.html]
> [https://spark.apache.org/docs/latest/ml-features.html]
> [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]
> [https://spark.apache.org/docs/latest/ml-migration-guide.html]
> [https://spark.apache.org/docs/latest/ml-pipeline.html]
> [https://spark.apache.org/docs/latest/ml-statistics.html]
> [https://spark.apache.org/docs/latest/ml-tuning.html]
>  
> [https://spark.apache.org/docs/latest/mllib-clustering.html]
> [https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html]
> [https://spark.apache.org/docs/latest/mllib-data-types.html]
> [https://spark.apache.org/docs/latest/mllib-decision-tree.html]
> [https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html]
> [https://spark.apache.org/docs/latest/mllib-ensembles.html]
> [https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html]
> [https://spark.apache.org/docs/latest/mllib-feature-extraction.html]
> [https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html]
> [https://spark.apache.org/docs/latest/mllib-isotonic-regression.html]
> [https://spark.apache.org/docs/latest/mllib-linear-methods.html]
> [https://spark.apache.org/docs/latest/mllib-naive-bayes.html]
> [https://spark.apache.org/docs/latest/mllib-statistics.html]
>  
> [https://spark.apache.org/docs/latest/quick-start.html]
>  
> [https://spark.apache.org/docs/latest/rdd-programming-guide.html]
>  
> [https://spark.apache.org/docs/latest/sql-data-sources-avro.html]
> [https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html]
> [https://spark.apache.org/docs/latest/sql-data-sources-csv.html]
> [https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html]
> [https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html]
> [https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html]
> [https://spark.apache.org/docs/latest/sql-data-sources-json.html]
> [https://spark.apache.org/docs/latest/sql-data-sources-parquet.html]
> sql-data-sources-protobuf.html
> [https://spark.apache.org/docs/latest/sql-data-sources-text.html]
> [https://spark.apache.org/docs/latest/sql-migration-guide.html]
> [https://spark.apache.org/docs/latest/sql-performance-tuning.html]
> [https://spark.apache.org/docs/latest/sql-ref-datatypes.html]
>  
> [https://spark.apache.org/docs/latest/streaming-kinesis-integration.html]
> [https://spark.apache.org/docs/latest/streaming-programming-guide.html]
>  
> [https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html]
> [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html]
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42642) Make Python the first code example tab

2023-03-01 Thread Allan Folting (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Folting updated SPARK-42642:
--
Description: 
Python is the most approachable and most popular language so it should be the 
default language in code examples so this makes Python the first code example 
tab consistently across the documentation, where applicable.

This is continuing the work started with:

https://issues.apache.org/jira/browse/SPARK-42493

where these two pages were updated:

[https://spark.apache.org/docs/latest/sql-getting-started.html]

[https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html]

 

Pages being updated now:

[https://spark.apache.org/docs/latest/ml-classification-regression.html]

[https://spark.apache.org/docs/latest/ml-clustering.html]

[https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]

[https://spark.apache.org/docs/latest/ml-datasource.html]

[https://spark.apache.org/docs/latest/ml-features.html]

[https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]

[https://spark.apache.org/docs/latest/ml-migration-guide.html]

[https://spark.apache.org/docs/latest/ml-pipeline.html]

[https://spark.apache.org/docs/latest/ml-statistics.html]

[https://spark.apache.org/docs/latest/ml-tuning.html]

 

[https://spark.apache.org/docs/latest/mllib-clustering.html]

[https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html]

[https://spark.apache.org/docs/latest/mllib-data-types.html]

[https://spark.apache.org/docs/latest/mllib-decision-tree.html]

[https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html]

[https://spark.apache.org/docs/latest/mllib-ensembles.html]

[https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html]

[https://spark.apache.org/docs/latest/mllib-feature-extraction.html]

[https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html]

[https://spark.apache.org/docs/latest/mllib-isotonic-regression.html]

[https://spark.apache.org/docs/latest/mllib-linear-methods.html]

[https://spark.apache.org/docs/latest/mllib-naive-bayes.html]

[https://spark.apache.org/docs/latest/mllib-statistics.html]

 

[https://spark.apache.org/docs/latest/quick-start.html]

 

[https://spark.apache.org/docs/latest/rdd-programming-guide.html]

 

[https://spark.apache.org/docs/latest/sql-data-sources-avro.html]

[https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html]

[https://spark.apache.org/docs/latest/sql-data-sources-csv.html]

[https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html]

[https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html]

[https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html]

[https://spark.apache.org/docs/latest/sql-data-sources-json.html]

[https://spark.apache.org/docs/latest/sql-data-sources-parquet.html]

sql-data-sources-protobuf.html

[https://spark.apache.org/docs/latest/sql-data-sources-text.html]

[https://spark.apache.org/docs/latest/sql-migration-guide.html]

[https://spark.apache.org/docs/latest/sql-performance-tuning.html]

[https://spark.apache.org/docs/latest/sql-ref-datatypes.html]

 

[https://spark.apache.org/docs/latest/streaming-kinesis-integration.html]

[https://spark.apache.org/docs/latest/streaming-programming-guide.html]

 

[https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html]

[https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html]

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  was:
Python is the most approachable and most popular language so it should be the 
default language in code examples so this makes Python the first code example 
tab consistently across the documentation, where applicable.

This is continuing the work started with:

https://issues.apache.org/jira/browse/SPARK-42493

where these two pages were updated:

[https://spark.apache.org/docs/latest/sql-getting-started.html]

[https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html]

 

Pages being updated now:

[https://spark.apache.org/docs/latest/ml-classification-regression.html]

[https://spark.apache.org/docs/latest/ml-clustering.html]

[https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]

[https://spark.apache.org/docs/latest/ml-datasource.html]

[https://spark.apache.org/docs/latest/ml-features.html]

[https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]

[https://spark.apache.org/docs/latest/ml-migration-guide.html]

[https://spark.apache.org/docs/latest/ml-pipeline.html]

[https://spark.apache.org/docs/latest/ml-statistics.html]

[https://spark.apache.org/docs/latest/ml-tuning.html]

 

[https://spark.apache.org/docs/latest/mllib-clustering.html]

[https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html]

[https://spark.apache.org/docs/latest/mllib-data-types.html]

[jira] [Updated] (SPARK-42642) Make Python the first code example tab

2023-03-01 Thread Allan Folting (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Folting updated SPARK-42642:
--
Description: 
Python is the most approachable and most popular language so it should be the 
default language in code examples so this makes Python the first code example 
tab consistently across the documentation, where applicable.

This is continuing the work started with:

https://issues.apache.org/jira/browse/SPARK-42493

where these two pages were updated:

[https://spark.apache.org/docs/latest/sql-getting-started.html]

[https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html]

 

Pages being updated now:

[https://spark.apache.org/docs/latest/ml-classification-regression.html]

[https://spark.apache.org/docs/latest/ml-clustering.html]

[https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]

[https://spark.apache.org/docs/latest/ml-datasource.html]

[https://spark.apache.org/docs/latest/ml-features.html]

[https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]

[https://spark.apache.org/docs/latest/ml-migration-guide.html]

[https://spark.apache.org/docs/latest/ml-pipeline.html]

[https://spark.apache.org/docs/latest/ml-statistics.html]

[https://spark.apache.org/docs/latest/ml-tuning.html]

 

[https://spark.apache.org/docs/latest/mllib-clustering.html]

[https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html]

[https://spark.apache.org/docs/latest/mllib-data-types.html]

[https://spark.apache.org/docs/latest/mllib-decision-tree.html]

[https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html]

[https://spark.apache.org/docs/latest/mllib-ensembles.html]

[https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html]

[https://spark.apache.org/docs/latest/mllib-feature-extraction.html]

[https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html]

[https://spark.apache.org/docs/latest/mllib-isotonic-regression.html]

[https://spark.apache.org/docs/latest/mllib-linear-methods.html]

[https://spark.apache.org/docs/latest/mllib-naive-bayes.html]

[https://spark.apache.org/docs/latest/mllib-statistics.html]

 

[https://spark.apache.org/docs/latest/quick-start.html]

 

[https://spark.apache.org/docs/latest/rdd-programming-guide.html]

 

[https://spark.apache.org/docs/latest/sql-data-sources-avro.html]

[https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html]

[https://spark.apache.org/docs/latest/sql-data-sources-csv.html]

[https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html]

[https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html]

[https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html]

[https://spark.apache.org/docs/latest/sql-data-sources-json.html]

[https://spark.apache.org/docs/latest/sql-data-sources-parquet.html]

sql-data-sources-protobuf.md

[https://spark.apache.org/docs/latest/sql-data-sources-text.html]

[https://spark.apache.org/docs/latest/sql-migration-guide.html]

[https://spark.apache.org/docs/latest/sql-performance-tuning.html]

[https://spark.apache.org/docs/latest/sql-ref-datatypes.html]

 

[https://spark.apache.org/docs/latest/streaming-kinesis-integration.html]

[https://spark.apache.org/docs/latest/streaming-programming-guide.html]

 

[https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html]

[https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html]

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  was:
Python is the most approachable and most popular language so it should be the 
default language in code examples.

Continuing the work started with:

https://issues.apache.org/jira/browse/SPARK-42493

Making Python the first code example tab consistently across the documentation, 
where applicable.

Pages being updated:
[https://spark.apache.org/docs/latest/rdd-programming-guide.html]

[https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html]

[https://spark.apache.org/docs/latest/streaming-programming-guide.html]

[https://spark.apache.org/docs/latest/ml-statistics.html]

[https://spark.apache.org/docs/latest/ml-datasource.html]

[https://spark.apache.org/docs/latest/ml-pipeline.html]

[https://spark.apache.org/docs/latest/ml-features.html]

[https://spark.apache.org/docs/latest/ml-classification-regression.html]

[https://spark.apache.org/docs/latest/ml-clustering.html]

[https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]

[https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]

[https://spark.apache.org/docs/latest/ml-tuning.html]

[https://spark.apache.org/docs/latest/ml-migration-guide.html]

[https://spark.apache.org/docs/latest/mllib-data-types.html]

[https://spark.apache.org/docs/latest/mllib-statistics.html]

[https://spark.apache.org/docs/latest/mllib-linear-methods.html]

[jira] [Commented] (SPARK-42646) Upgrad cyclonedx from 2.7.3 to 2.7.5

2023-03-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695465#comment-17695465
 ] 

Apache Spark commented on SPARK-42646:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/40247

> Upgrad cyclonedx from 2.7.3 to 2.7.5
> 
>
> Key: SPARK-42646
> URL: https://issues.apache.org/jira/browse/SPARK-42646
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>
> !https://user-images.githubusercontent.com/15246973/222338040-d7c8d595-be0b-40bb-af49-6b260dc0c425.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42646) Upgrad cyclonedx from 2.7.3 to 2.7.5

2023-03-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42646:


Assignee: Apache Spark

> Upgrad cyclonedx from 2.7.3 to 2.7.5
> 
>
> Key: SPARK-42646
> URL: https://issues.apache.org/jira/browse/SPARK-42646
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
>
> !https://user-images.githubusercontent.com/15246973/222338040-d7c8d595-be0b-40bb-af49-6b260dc0c425.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42646) Upgrad cyclonedx from 2.7.3 to 2.7.5

2023-03-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42646:


Assignee: (was: Apache Spark)

> Upgrad cyclonedx from 2.7.3 to 2.7.5
> 
>
> Key: SPARK-42646
> URL: https://issues.apache.org/jira/browse/SPARK-42646
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>
> !https://user-images.githubusercontent.com/15246973/222338040-d7c8d595-be0b-40bb-af49-6b260dc0c425.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42644) Add `hive` dependency to `connect` module

2023-03-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42644:


Assignee: (was: Apache Spark)

> Add `hive` dependency to `connect` module
> -
>
> Key: SPARK-42644
> URL: https://issues.apache.org/jira/browse/SPARK-42644
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42644) Add `hive` dependency to `connect` module

2023-03-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42644:


Assignee: Apache Spark

> Add `hive` dependency to `connect` module
> -
>
> Key: SPARK-42644
> URL: https://issues.apache.org/jira/browse/SPARK-42644
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42644) Add `hive` dependency to `connect` module

2023-03-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695464#comment-17695464
 ] 

Apache Spark commented on SPARK-42644:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/40246

> Add `hive` dependency to `connect` module
> -
>
> Key: SPARK-42644
> URL: https://issues.apache.org/jira/browse/SPARK-42644
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42646) Upgrad cyclonedx from 2.7.3 to 2.7.5

2023-03-01 Thread BingKun Pan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-42646:

Description: 
!https://user-images.githubusercontent.com/15246973/222338040-d7c8d595-be0b-40bb-af49-6b260dc0c425.png!
  (was: !image-2023-03-02-13-07-01-579.png!)

> Upgrad cyclonedx from 2.7.3 to 2.7.5
> 
>
> Key: SPARK-42646
> URL: https://issues.apache.org/jira/browse/SPARK-42646
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>
> !https://user-images.githubusercontent.com/15246973/222338040-d7c8d595-be0b-40bb-af49-6b260dc0c425.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42646) Upgrad cyclonedx from 2.7.3 to 2.7.5

2023-03-01 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-42646:
---

 Summary: Upgrad cyclonedx from 2.7.3 to 2.7.5
 Key: SPARK-42646
 URL: https://issues.apache.org/jira/browse/SPARK-42646
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: BingKun Pan


!image-2023-03-02-13-07-01-579.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42645) Introduce feature to allow for function caching across input rows.

2023-03-01 Thread Michael Tong (Jira)

Michael Tong created SPARK-42645:


 Summary: Introduce feature to allow for function caching across 
input rows.
 Key: SPARK-42645
 URL: https://issues.apache.org/jira/browse/SPARK-42645
 Project: Spark
  Issue Type: Wish
  Components: Optimizer
Affects Versions: 3.3.2
Reporter: Michael Tong


Introduce the ability to make functions cachable across input rows. I'm 
imagining this function to work similarly to python's 
[functools.cache|https://docs.python.org/3/library/functools.html#functools.cache]
 where you could add a decorator to certain expensive functions that you know 
will regularly encounter repeated values as you read the input data.

 

With this new feature you would be able to significantly speed up many real 
world jobs that use expensive functions on data that naturally has repeated 
column values. An example of this would be parsing user agent fields from 
internet traffic logs partitioned by user id. Even though the data is not 
sorted by user agent, in a sample of 10k continuous rows there would be much 
less than 10k unique values because popular user agents exist on a large 
fraction of traffic and the user agent of the first event from a user is likely 
to be shared among all subsequent events from that user. Currently there is a 
way to hack an approximation of this in a python implementation of this via 
pandas_udfs. This works because pandas_udfs by default read in batches of 10k 
input rows, so you can used a caching UDF that empties every 10k rows. At my 
current job I have noticed that some applications of this trick can 
significantly speed up queries where custom UDFs are the bottleneck in a query. 
An example of this is

 
{code:java}
@F.pandas_udf(T.StringType())
def parse_user_agent_field(user_agent_series):
@functools.cache
def parse_user_agent_field_helper(user_agent):
# parse the user agent and return the relevant field
return None
return user_agent_series.apply(parse_user_agent_field_helper){code}
 

 

It would be nice if there was some official support for this behavior for both 
built in functions and UDFs. If there was official support for this I'd imagine 
it to look something like

 
{code:java}
# using pyspark dataframe API
df = df.withColumn(output_col, F.cache(F.function)(input_col)){code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42644) Add `hive` dependency to `connect` module

2023-03-01 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-42644:
-

 Summary: Add `hive` dependency to `connect` module
 Key: SPARK-42644
 URL: https://issues.apache.org/jira/browse/SPARK-42644
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42521) Add NULL values for INSERT commands with user-specified lists of fewer columns than the target table

2023-03-01 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-42521:
--

Assignee: Daniel

> Add NULL values for INSERT commands with user-specified lists of fewer 
> columns than the target table
> 
>
> Key: SPARK-42521
> URL: https://issues.apache.org/jira/browse/SPARK-42521
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42521) Add NULL values for INSERT commands with user-specified lists of fewer columns than the target table

2023-03-01 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-42521.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40229
[https://github.com/apache/spark/pull/40229]

> Add NULL values for INSERT commands with user-specified lists of fewer 
> columns than the target table
> 
>
> Key: SPARK-42521
> URL: https://issues.apache.org/jira/browse/SPARK-42521
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42642) Make Python the first code example tab

2023-03-01 Thread Allan Folting (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Folting updated SPARK-42642:
--
Description: 
Python is the most approachable and most popular language so it should be the 
default language in code examples.

Continuing the work started with:

https://issues.apache.org/jira/browse/SPARK-42493

Making Python the first code example tab consistently across the documentation, 
where applicable.

Pages being updated:
[https://spark.apache.org/docs/latest/rdd-programming-guide.html]

[https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html]

[https://spark.apache.org/docs/latest/streaming-programming-guide.html]

[https://spark.apache.org/docs/latest/ml-statistics.html]

[https://spark.apache.org/docs/latest/ml-datasource.html]

[https://spark.apache.org/docs/latest/ml-pipeline.html]

[https://spark.apache.org/docs/latest/ml-features.html]

[https://spark.apache.org/docs/latest/ml-classification-regression.html]

[https://spark.apache.org/docs/latest/ml-clustering.html]

[https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]

[https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]

[https://spark.apache.org/docs/latest/ml-tuning.html]

[https://spark.apache.org/docs/latest/ml-migration-guide.html]

[https://spark.apache.org/docs/latest/mllib-data-types.html]

[https://spark.apache.org/docs/latest/mllib-statistics.html]

[https://spark.apache.org/docs/latest/mllib-linear-methods.html]

[https://spark.apache.org/docs/latest/mllib-naive-bayes.html]

[https://spark.apache.org/docs/latest/mllib-decision-tree.html]

[https://spark.apache.org/docs/latest/mllib-ensembles.html]

[https://spark.apache.org/docs/latest/mllib-isotonic-regression.html]

[https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html]

[https://spark.apache.org/docs/latest/mllib-clustering.html]

[https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html]

[https://spark.apache.org/docs/latest/mllib-feature-extraction.html]

[https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html]

[https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html]

[https://spark.apache.org/docs/latest/quick-start.html]

 

  was:
Python is the most approachable and most popular language so it should be the 
default language in code examples.

Continuing the work started with:

https://issues.apache.org/jira/browse/SPARK-42493

Making Python the first code example tab consistently across the documentation, 
where applicable.

Pages being updated:
[https://spark.apache.org/docs/latest/rdd-programming-guide.html]

[https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html]

[https://spark.apache.org/docs/latest/streaming-programming-guide.html]

[https://spark.apache.org/docs/latest/ml-statistics.html]

[https://spark.apache.org/docs/latest/ml-datasource.html]

[https://spark.apache.org/docs/latest/ml-pipeline.html]

[https://spark.apache.org/docs/latest/ml-features.html]

[https://spark.apache.org/docs/latest/ml-classification-regression.html]

[https://spark.apache.org/docs/latest/ml-clustering.html]

[https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]

[https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]

[https://spark.apache.org/docs/latest/ml-tuning.html]

[https://spark.apache.org/docs/latest/mllib-data-types.html]

[https://spark.apache.org/docs/latest/mllib-statistics.html]

[https://spark.apache.org/docs/latest/mllib-linear-methods.html]

[https://spark.apache.org/docs/latest/mllib-naive-bayes.html]

[https://spark.apache.org/docs/latest/mllib-decision-tree.html]

[https://spark.apache.org/docs/latest/mllib-ensembles.html]

[https://spark.apache.org/docs/latest/mllib-isotonic-regression.html]

[https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html]

[https://spark.apache.org/docs/latest/mllib-clustering.html]

[https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html]

[https://spark.apache.org/docs/latest/mllib-feature-extraction.html]

[https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html]

[https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html]

 

 

 

 


> Make Python the first code example tab
> --
>
> Key: SPARK-42642
> URL: https://issues.apache.org/jira/browse/SPARK-42642
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Allan Folting
>Priority: Major
> Attachments: Screenshot 2023-03-01 at 8.10.08 PM.png, Screenshot 
> 2023-03-01 at 8.10.22 PM.png
>
>
> Python is the most approachable and most popular language so it should be the 
> default language in code examples.
> Continuing the work started with:
>

[jira] [Updated] (SPARK-42642) Make Python the first code example tab

2023-03-01 Thread Allan Folting (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Folting updated SPARK-42642:
--
Attachment: Screenshot 2023-03-01 at 8.10.22 PM.png

> Make Python the first code example tab
> --
>
> Key: SPARK-42642
> URL: https://issues.apache.org/jira/browse/SPARK-42642
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Allan Folting
>Priority: Major
> Attachments: Screenshot 2023-03-01 at 8.10.08 PM.png, Screenshot 
> 2023-03-01 at 8.10.22 PM.png
>
>
> Python is the most approachable and most popular language so it should be the 
> default language in code examples.
> Continuing the work started with:
> https://issues.apache.org/jira/browse/SPARK-42493
> Making Python the first code example tab consistently across the 
> documentation, where applicable.
> Pages being updated:
> [https://spark.apache.org/docs/latest/rdd-programming-guide.html]
> [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html]
> [https://spark.apache.org/docs/latest/streaming-programming-guide.html]
> [https://spark.apache.org/docs/latest/ml-statistics.html]
> [https://spark.apache.org/docs/latest/ml-datasource.html]
> [https://spark.apache.org/docs/latest/ml-pipeline.html]
> [https://spark.apache.org/docs/latest/ml-features.html]
> [https://spark.apache.org/docs/latest/ml-classification-regression.html]
> [https://spark.apache.org/docs/latest/ml-clustering.html]
> [https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]
> [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]
> [https://spark.apache.org/docs/latest/ml-tuning.html]
> [https://spark.apache.org/docs/latest/mllib-data-types.html]
> [https://spark.apache.org/docs/latest/mllib-statistics.html]
> [https://spark.apache.org/docs/latest/mllib-linear-methods.html]
> [https://spark.apache.org/docs/latest/mllib-naive-bayes.html]
> [https://spark.apache.org/docs/latest/mllib-decision-tree.html]
> [https://spark.apache.org/docs/latest/mllib-ensembles.html]
> [https://spark.apache.org/docs/latest/mllib-isotonic-regression.html]
> [https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html]
> [https://spark.apache.org/docs/latest/mllib-clustering.html]
> [https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html]
> [https://spark.apache.org/docs/latest/mllib-feature-extraction.html]
> [https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html]
> [https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html]
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42642) Make Python the first code example tab

2023-03-01 Thread Allan Folting (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Folting updated SPARK-42642:
--
Attachment: Screenshot 2023-03-01 at 8.10.08 PM.png

> Make Python the first code example tab
> --
>
> Key: SPARK-42642
> URL: https://issues.apache.org/jira/browse/SPARK-42642
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Allan Folting
>Priority: Major
> Attachments: Screenshot 2023-03-01 at 8.10.08 PM.png, Screenshot 
> 2023-03-01 at 8.10.22 PM.png
>
>
> Python is the most approachable and most popular language so it should be the 
> default language in code examples.
> Continuing the work started with:
> https://issues.apache.org/jira/browse/SPARK-42493
> Making Python the first code example tab consistently across the 
> documentation, where applicable.
> Pages being updated:
> [https://spark.apache.org/docs/latest/rdd-programming-guide.html]
> [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html]
> [https://spark.apache.org/docs/latest/streaming-programming-guide.html]
> [https://spark.apache.org/docs/latest/ml-statistics.html]
> [https://spark.apache.org/docs/latest/ml-datasource.html]
> [https://spark.apache.org/docs/latest/ml-pipeline.html]
> [https://spark.apache.org/docs/latest/ml-features.html]
> [https://spark.apache.org/docs/latest/ml-classification-regression.html]
> [https://spark.apache.org/docs/latest/ml-clustering.html]
> [https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]
> [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]
> [https://spark.apache.org/docs/latest/ml-tuning.html]
> [https://spark.apache.org/docs/latest/mllib-data-types.html]
> [https://spark.apache.org/docs/latest/mllib-statistics.html]
> [https://spark.apache.org/docs/latest/mllib-linear-methods.html]
> [https://spark.apache.org/docs/latest/mllib-naive-bayes.html]
> [https://spark.apache.org/docs/latest/mllib-decision-tree.html]
> [https://spark.apache.org/docs/latest/mllib-ensembles.html]
> [https://spark.apache.org/docs/latest/mllib-isotonic-regression.html]
> [https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html]
> [https://spark.apache.org/docs/latest/mllib-clustering.html]
> [https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html]
> [https://spark.apache.org/docs/latest/mllib-feature-extraction.html]
> [https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html]
> [https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html]
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42642) Make Python the first code example tab

2023-03-01 Thread Allan Folting (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Folting updated SPARK-42642:
--
Description: 
Python is the most approachable and most popular language so it should be the 
default language in code examples.

Continuing the work started with:

https://issues.apache.org/jira/browse/SPARK-42493

Making Python the first code example tab consistently across the documentation, 
where applicable.

Pages being updated:
[https://spark.apache.org/docs/latest/rdd-programming-guide.html]

[https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html]

[https://spark.apache.org/docs/latest/streaming-programming-guide.html]

[https://spark.apache.org/docs/latest/ml-statistics.html]

[https://spark.apache.org/docs/latest/ml-datasource.html]

[https://spark.apache.org/docs/latest/ml-pipeline.html]

[https://spark.apache.org/docs/latest/ml-features.html]

[https://spark.apache.org/docs/latest/ml-classification-regression.html]

[https://spark.apache.org/docs/latest/ml-clustering.html]

[https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]

[https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]

[https://spark.apache.org/docs/latest/ml-tuning.html]

[https://spark.apache.org/docs/latest/mllib-data-types.html]

[https://spark.apache.org/docs/latest/mllib-statistics.html]

[https://spark.apache.org/docs/latest/mllib-linear-methods.html]

[https://spark.apache.org/docs/latest/mllib-naive-bayes.html]

[https://spark.apache.org/docs/latest/mllib-decision-tree.html]

[https://spark.apache.org/docs/latest/mllib-ensembles.html]

[https://spark.apache.org/docs/latest/mllib-isotonic-regression.html]

[https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html]

[https://spark.apache.org/docs/latest/mllib-clustering.html]

[https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html]

[https://spark.apache.org/docs/latest/mllib-feature-extraction.html]

[https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html]

[https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html]

 

 

 

 

  was:
Python is the most approachable and most popular language so it should be the 
default language in code examples.

Continuing the work started with:

https://issues.apache.org/jira/browse/SPARK-42493

Making Python the first code example tab consistently across the documentation, 
where applicable.

Pages being updated:
[https://spark.apache.org/docs/latest/rdd-programming-guide.html]

[https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html]

[https://spark.apache.org/docs/latest/streaming-programming-guide.html]

[https://spark.apache.org/docs/latest/ml-statistics.html]

[https://spark.apache.org/docs/latest/ml-datasource.html]

[https://spark.apache.org/docs/latest/ml-pipeline.html]

[https://spark.apache.org/docs/latest/ml-features.html]

[https://spark.apache.org/docs/latest/ml-classification-regression.html]

[https://spark.apache.org/docs/latest/ml-clustering.html]

[https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]

[https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]

[https://spark.apache.org/docs/latest/ml-tuning.html]

[https://spark.apache.org/docs/latest/mllib-data-types.html]

[https://spark.apache.org/docs/latest/mllib-statistics.html]

[https://spark.apache.org/docs/latest/mllib-linear-methods.html]

[https://spark.apache.org/docs/latest/mllib-naive-bayes.html]

[https://spark.apache.org/docs/latest/mllib-decision-tree.html]

[https://spark.apache.org/docs/latest/mllib-ensembles.html]

[https://spark.apache.org/docs/latest/mllib-isotonic-regression.html]

[https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html]

[https://spark.apache.org/docs/latest/mllib-clustering.html]

[https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html]

[https://spark.apache.org/docs/latest/mllib-feature-extraction.html]

[https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html]

[https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html]

 

 


> Make Python the first code example tab
> --
>
> Key: SPARK-42642
> URL: https://issues.apache.org/jira/browse/SPARK-42642
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Allan Folting
>Priority: Major
>
> Python is the most approachable and most popular language so it should be the 
> default language in code examples.
> Continuing the work started with:
> https://issues.apache.org/jira/browse/SPARK-42493
> Making Python the first code example tab consistently across the 
> documentation, where applicable.
> Pages being updated:
>

[jira] [Updated] (SPARK-42642) Make Python the first code example tab

2023-03-01 Thread Allan Folting (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Folting updated SPARK-42642:
--
Description: 
Python is the most approachable and most popular language so it should be the 
default language in code examples.

Continuing the work started with:

https://issues.apache.org/jira/browse/SPARK-42493

Making Python the first code example tab consistently across the documentation, 
where applicable.

Pages being updated:
[https://spark.apache.org/docs/latest/rdd-programming-guide.html]

[https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html]

[https://spark.apache.org/docs/latest/streaming-programming-guide.html]

[https://spark.apache.org/docs/latest/ml-statistics.html]

[https://spark.apache.org/docs/latest/ml-datasource.html]

[https://spark.apache.org/docs/latest/ml-pipeline.html]

[https://spark.apache.org/docs/latest/ml-features.html]

[https://spark.apache.org/docs/latest/ml-classification-regression.html]

[https://spark.apache.org/docs/latest/ml-clustering.html]

[https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]

[https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]

[https://spark.apache.org/docs/latest/ml-tuning.html]

[https://spark.apache.org/docs/latest/mllib-data-types.html]

[https://spark.apache.org/docs/latest/mllib-statistics.html]

[https://spark.apache.org/docs/latest/mllib-linear-methods.html]

[https://spark.apache.org/docs/latest/mllib-naive-bayes.html]

[https://spark.apache.org/docs/latest/mllib-decision-tree.html]

[https://spark.apache.org/docs/latest/mllib-ensembles.html]

[https://spark.apache.org/docs/latest/mllib-isotonic-regression.html]

[https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html]

[https://spark.apache.org/docs/latest/mllib-clustering.html]

[https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html]

[https://spark.apache.org/docs/latest/mllib-feature-extraction.html]

[https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html]

[https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html]

 

 

  was:
Python is the most approachable and most popular language so it should be the 
default language in code examples.

Continuing the work started with:

https://issues.apache.org/jira/browse/SPARK-42493

Making Python the first code example tab consistently across the documentation, 
where applicable.


> Make Python the first code example tab
> --
>
> Key: SPARK-42642
> URL: https://issues.apache.org/jira/browse/SPARK-42642
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Allan Folting
>Priority: Major
>
> Python is the most approachable and most popular language so it should be the 
> default language in code examples.
> Continuing the work started with:
> https://issues.apache.org/jira/browse/SPARK-42493
> Making Python the first code example tab consistently across the 
> documentation, where applicable.
> Pages being updated:
> [https://spark.apache.org/docs/latest/rdd-programming-guide.html]
> [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html]
> [https://spark.apache.org/docs/latest/streaming-programming-guide.html]
> [https://spark.apache.org/docs/latest/ml-statistics.html]
> [https://spark.apache.org/docs/latest/ml-datasource.html]
> [https://spark.apache.org/docs/latest/ml-pipeline.html]
> [https://spark.apache.org/docs/latest/ml-features.html]
> [https://spark.apache.org/docs/latest/ml-classification-regression.html]
> [https://spark.apache.org/docs/latest/ml-clustering.html]
> [https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]
> [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]
> [https://spark.apache.org/docs/latest/ml-tuning.html]
> [https://spark.apache.org/docs/latest/mllib-data-types.html]
> [https://spark.apache.org/docs/latest/mllib-statistics.html]
> [https://spark.apache.org/docs/latest/mllib-linear-methods.html]
> [https://spark.apache.org/docs/latest/mllib-naive-bayes.html]
> [https://spark.apache.org/docs/latest/mllib-decision-tree.html]
> [https://spark.apache.org/docs/latest/mllib-ensembles.html]
> [https://spark.apache.org/docs/latest/mllib-isotonic-regression.html]
> [https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html]
> [https://spark.apache.org/docs/latest/mllib-clustering.html]
> [https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html]
> [https://spark.apache.org/docs/latest/mllib-feature-extraction.html]
> [https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html]
> [https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html]
>  
>  



--
This message was sent by Atlassian Jira

[jira] [Commented] (SPARK-42643) Implement `spark.udf.registerJavaFunction`

2023-03-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695429#comment-17695429
 ] 

Apache Spark commented on SPARK-42643:
--

User 'xinrong-meng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40244

> Implement `spark.udf.registerJavaFunction`
> --
>
> Key: SPARK-42643
> URL: https://issues.apache.org/jira/browse/SPARK-42643
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Implement `spark.udf.registerJavaFunction`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42643) Implement `spark.udf.registerJavaFunction`

2023-03-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42643:


Assignee: Apache Spark

> Implement `spark.udf.registerJavaFunction`
> --
>
> Key: SPARK-42643
> URL: https://issues.apache.org/jira/browse/SPARK-42643
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Implement `spark.udf.registerJavaFunction`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42643) Implement `spark.udf.registerJavaFunction`

2023-03-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42643:


Assignee: (was: Apache Spark)

> Implement `spark.udf.registerJavaFunction`
> --
>
> Key: SPARK-42643
> URL: https://issues.apache.org/jira/browse/SPARK-42643
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Implement `spark.udf.registerJavaFunction`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41823) DataFrame.join creating ambiguous column names

2023-03-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695428#comment-17695428
 ] 

Apache Spark commented on SPARK-41823:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/40245

> DataFrame.join creating ambiguous column names
> --
>
> Key: SPARK-41823
> URL: https://issues.apache.org/jira/browse/SPARK-41823
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Ruifeng Zheng
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 254, in pyspark.sql.connect.dataframe.DataFrame.drop
> Failed example:
>     df.join(df2, df.name == df2.name, 'inner').drop('name').show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 
> 1, in 
>         df.join(df2, df.name == df2.name, 'inner').drop('name').show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [AMBIGUOUS_REFERENCE] Reference `name` is ambiguous, could be: [`name`, 
> `name`].
>     Plan: {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42643) Implement `spark.udf.registerJavaFunction`

2023-03-01 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-42643:


 Summary: Implement `spark.udf.registerJavaFunction`
 Key: SPARK-42643
 URL: https://issues.apache.org/jira/browse/SPARK-42643
 Project: Spark
  Issue Type: Improvement
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Implement `spark.udf.registerJavaFunction`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42642) Make Python the first code example tab

2023-03-01 Thread Allan Folting (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Folting updated SPARK-42642:
--
Summary: Make Python the first code example tab  (was: Make Python the 
first code example tab - )

> Make Python the first code example tab
> --
>
> Key: SPARK-42642
> URL: https://issues.apache.org/jira/browse/SPARK-42642
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Allan Folting
>Priority: Major
>
> Python is the most approachable and most popular language so it should be the 
> default language in code examples.
> Continuing the work started with:
> https://issues.apache.org/jira/browse/SPARK-42493
> Making Python the first code example tab consistently across the 
> documentation, where applicable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42642) Make Python the first code example tab -

2023-03-01 Thread Allan Folting (Jira)

Allan Folting created SPARK-42642:
-

 Summary: Make Python the first code example tab - 
 Key: SPARK-42642
 URL: https://issues.apache.org/jira/browse/SPARK-42642
 Project: Spark
  Issue Type: Documentation
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Allan Folting


Python is the most approachable and most popular language so it should be the 
default language in code examples.

Continuing the work started with:

https://issues.apache.org/jira/browse/SPARK-42493

Making Python the first code example tab consistently across the documentation, 
where applicable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39316) Merge PromotePrecision and CheckOverflow into decimal binary arithmetic

2023-03-01 Thread XiDuo You (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-39316:
--
Description: 
Fix the bug of `TypeCoercion`, for example:

{code:java}
  SELECT CAST(1 AS DECIMAL(28, 2))
  UNION ALL
  SELECT CAST(1 AS DECIMAL(18, 2)) / CAST(1 AS DECIMAL(18, 2));
{code}

The union result data type is not correct according to the formula:

|| Operation || Result Precision || Result Scale ||
|e1 union e2 | max(s1, s2) + max(p1-s1, p2-s2) | max(s1, s2) |


{code:java}
-- before
-- query schema
decimal(28,2)
-- query output
1.00
1.00

-- after
-- query schema
decimal(38,20)
-- query output
1.
1.
{code}


  was:
Merge {{PromotePrecision}} into {{{}dataType{}}}, for example, {{{}Add{}}}:
{code:java}
override def dataType: DataType = (left, right) match {
  case (DecimalType.Expression(p1, s1), DecimalType.Expression(p2, s2)) =>
val resultScale = max(s1, s2)
if (allowPrecisionLoss) {
  DecimalType.adjustPrecisionScale(max(p1 - s1, p2 - s2) + resultScale + 1,
resultScale)
} else {
  DecimalType.bounded(max(p1 - s1, p2 - s2) + resultScale + 1, resultScale)
}
  case _ => super.dataType
} {code}
Merge {{{}CheckOverflow{}}}, for example, {{Add}} eval:
{code:java}
dataType match {
  case decimalType: DecimalType =>
val value = numeric.plus(input1, input2)
checkOverflow(value.asInstanceOf[Decimal], decimalType)
  ...
} {code}


> Merge PromotePrecision and CheckOverflow into decimal binary arithmetic
> ---
>
> Key: SPARK-39316
> URL: https://issues.apache.org/jira/browse/SPARK-39316
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
> Fix the bug of `TypeCoercion`, for example:
> {code:java}
>   SELECT CAST(1 AS DECIMAL(28, 2))
>   UNION ALL
>   SELECT CAST(1 AS DECIMAL(18, 2)) / CAST(1 AS DECIMAL(18, 2));
> {code}
> The union result data type is not correct according to the formula:
> || Operation || Result Precision || Result Scale ||
> |e1 union e2 | max(s1, s2) + max(p1-s1, p2-s2) | max(s1, s2) |
> {code:java}
> -- before
> -- query schema
> decimal(28,2)
> -- query output
> 1.00
> 1.00
> -- after
> -- query schema
> decimal(38,20)
> -- query output
> 1.
> 1.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42641) Upgrade buf to v1.15.0

2023-03-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42641:


Assignee: (was: Apache Spark)

> Upgrade buf to v1.15.0
> --
>
> Key: SPARK-42641
> URL: https://issues.apache.org/jira/browse/SPARK-42641
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42641) Upgrade buf to v1.15.0

2023-03-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42641:


Assignee: Apache Spark

> Upgrade buf to v1.15.0
> --
>
> Key: SPARK-42641
> URL: https://issues.apache.org/jira/browse/SPARK-42641
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42641) Upgrade buf to v1.15.0

2023-03-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695404#comment-17695404
 ] 

Apache Spark commented on SPARK-42641:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40243

> Upgrade buf to v1.15.0
> --
>
> Key: SPARK-42641
> URL: https://issues.apache.org/jira/browse/SPARK-42641
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42641) Upgrade buf to v1.15.0

2023-03-01 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-42641:
-

 Summary: Upgrade buf to v1.15.0
 Key: SPARK-42641
 URL: https://issues.apache.org/jira/browse/SPARK-42641
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42631) Support custom extensions in Spark Connect Scala client

2023-03-01 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-42631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell updated SPARK-42631:
--
Epic Link: SPARK-42554

> Support custom extensions in Spark Connect Scala client
> ---
>
> Key: SPARK-42631
> URL: https://issues.apache.org/jira/browse/SPARK-42631
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Tom van Bussel
>Assignee: Tom van Bussel
>Priority: Major
> Fix For: 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42631) Support custom extensions in Spark Connect Scala client

2023-03-01 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-42631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-42631.
---
Fix Version/s: 3.4.1
 Assignee: Tom van Bussel
   Resolution: Fixed

> Support custom extensions in Spark Connect Scala client
> ---
>
> Key: SPARK-42631
> URL: https://issues.apache.org/jira/browse/SPARK-42631
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Tom van Bussel
>Assignee: Tom van Bussel
>Priority: Major
> Fix For: 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42640) Remove stale entries from the excluding rules for CompabilitySuite

2023-03-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42640:


Assignee: Apache Spark  (was: Rui Wang)

> Remove stale entries from the excluding rules for CompabilitySuite
> --
>
> Key: SPARK-42640
> URL: https://issues.apache.org/jira/browse/SPARK-42640
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: Rui Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42640) Remove stale entries from the excluding rules for CompabilitySuite

2023-03-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42640:


Assignee: Rui Wang  (was: Apache Spark)

> Remove stale entries from the excluding rules for CompabilitySuite
> --
>
> Key: SPARK-42640
> URL: https://issues.apache.org/jira/browse/SPARK-42640
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42640) Remove stale entries from the excluding rules for CompabilitySuite

2023-03-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695387#comment-17695387
 ] 

Apache Spark commented on SPARK-42640:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/40241

> Remove stale entries from the excluding rules for CompabilitySuite
> --
>
> Key: SPARK-42640
> URL: https://issues.apache.org/jira/browse/SPARK-42640
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42639) Add createDataFrame/createDataset to SparkSession

2023-03-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695386#comment-17695386
 ] 

Apache Spark commented on SPARK-42639:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/40242

> Add createDataFrame/createDataset to SparkSession
> -
>
> Key: SPARK-42639
> URL: https://issues.apache.org/jira/browse/SPARK-42639
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>
> Add createDataFrame/createDataset to SparkSession



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42639) Add createDataFrame/createDataset to SparkSession

2023-03-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42639:


Assignee: Apache Spark  (was: Herman van Hövell)

> Add createDataFrame/createDataset to SparkSession
> -
>
> Key: SPARK-42639
> URL: https://issues.apache.org/jira/browse/SPARK-42639
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Apache Spark
>Priority: Major
>
> Add createDataFrame/createDataset to SparkSession



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42639) Add createDataFrame/createDataset to SparkSession

2023-03-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42639:


Assignee: Herman van Hövell  (was: Apache Spark)

> Add createDataFrame/createDataset to SparkSession
> -
>
> Key: SPARK-42639
> URL: https://issues.apache.org/jira/browse/SPARK-42639
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>
> Add createDataFrame/createDataset to SparkSession



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42640) Remove stale entries from the excluding rules for CompabilitySuite

2023-03-01 Thread Rui Wang (Jira)

Rui Wang created SPARK-42640:


 Summary: Remove stale entries from the excluding rules for 
CompabilitySuite
 Key: SPARK-42640
 URL: https://issues.apache.org/jira/browse/SPARK-42640
 Project: Spark
  Issue Type: Task
  Components: Connect
Affects Versions: 3.4.1
Reporter: Rui Wang
Assignee: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42639) Add createDataFrame/createDataset to SparkSession

2023-03-01 Thread Jira

Herman van Hövell created SPARK-42639:
-

 Summary: Add createDataFrame/createDataset to SparkSession
 Key: SPARK-42639
 URL: https://issues.apache.org/jira/browse/SPARK-42639
 Project: Spark
  Issue Type: New Feature
  Components: Connect
Affects Versions: 3.4.0
Reporter: Herman van Hövell


Add createDataFrame/createDataset to SparkSession



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42639) Add createDataFrame/createDataset to SparkSession

2023-03-01 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-42639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell reassigned SPARK-42639:
-

Assignee: Herman van Hövell

> Add createDataFrame/createDataset to SparkSession
> -
>
> Key: SPARK-42639
> URL: https://issues.apache.org/jira/browse/SPARK-42639
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>
> Add createDataFrame/createDataset to SparkSession



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42493) Spark SQL, DataFrames and Datasets Guide - make Python the first code example tab

2023-03-01 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42493.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40087
[https://github.com/apache/spark/pull/40087]

> Spark SQL, DataFrames and Datasets Guide - make Python the first code example 
> tab
> -
>
> Key: SPARK-42493
> URL: https://issues.apache.org/jira/browse/SPARK-42493
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Allan Folting
>Assignee: Allan Folting
>Priority: Major
> Fix For: 3.5.0
>
>
> Python is the easiest approachable and most popular language so it should be 
> the primary language in examples etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42493) Spark SQL, DataFrames and Datasets Guide - make Python the first code example tab

2023-03-01 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42493:


Assignee: Allan Folting

> Spark SQL, DataFrames and Datasets Guide - make Python the first code example 
> tab
> -
>
> Key: SPARK-42493
> URL: https://issues.apache.org/jira/browse/SPARK-42493
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Allan Folting
>Assignee: Allan Folting
>Priority: Major
>
> Python is the easiest approachable and most popular language so it should be 
> the primary language in examples etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default

2023-03-01 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42613.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40212
[https://github.com/apache/spark/pull/40212]

> PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor 
> cores by default
> -
>
> Key: SPARK-42613
> URL: https://issues.apache.org/jira/browse/SPARK-42613
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 3.3.0
>Reporter: John Zhuge
>Assignee: John Zhuge
>Priority: Major
> Fix For: 3.5.0
>
>
> Follow up from 
> [https://github.com/apache/spark/pull/40199#discussion_r1119453996]
> If OMP_NUM_THREADS is not set explicitly, we should set it to 
> `spark.task.cpus` instead of `spark.executor.cores` as described in [PR 
> #38699|https://github.com/apache/spark/pull/38699].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default

2023-03-01 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42613:


Assignee: John Zhuge

> PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor 
> cores by default
> -
>
> Key: SPARK-42613
> URL: https://issues.apache.org/jira/browse/SPARK-42613
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 3.3.0
>Reporter: John Zhuge
>Assignee: John Zhuge
>Priority: Major
>
> Follow up from 
> [https://github.com/apache/spark/pull/40199#discussion_r1119453996]
> If OMP_NUM_THREADS is not set explicitly, we should set it to 
> `spark.task.cpus` instead of `spark.executor.cores` as described in [PR 
> #38699|https://github.com/apache/spark/pull/38699].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42632) Fix scala paths in tests

2023-03-01 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42632.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40235
[https://github.com/apache/spark/pull/40235]

> Fix scala paths in tests
> 
>
> Key: SPARK-42632
> URL: https://issues.apache.org/jira/browse/SPARK-42632
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.4.0
>
>
> The jar resolution in the connect client tests can resolve the jar for the 
> wrong scala version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42637) Add SparkSession.stop

2023-03-01 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42637.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40239
[https://github.com/apache/spark/pull/40239]

> Add SparkSession.stop
> -
>
> Key: SPARK-42637
> URL: https://issues.apache.org/jira/browse/SPARK-42637
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.4.0
>
>
> Add SparkSession.stop()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42458) createDataFrame should support DDL string as schema

2023-03-01 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42458:


Assignee: Takuya Ueshin

> createDataFrame should support DDL string as schema
> ---
>
> Key: SPARK-42458
> URL: https://issues.apache.org/jira/browse/SPARK-42458
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>
> {code:python}
> File "/.../python/pyspark/sql/connect/readwriter.py", line 393, in 
> pyspark.sql.connect.readwriter.DataFrameWriter.option
> Failed example:
> with tempfile.TemporaryDirectory() as d:
> # Write a DataFrame into a CSV file with 'nullValue' option set to 
> 'Hyukjin Kwon'.
> df = spark.createDataFrame([(100, None)], "age INT, name STRING")
> df.write.option("nullValue", "Hyukjin 
> Kwon").mode("overwrite").format("csv").save(d)
> # Read the CSV file as a DataFrame.
> spark.read.schema(df.schema).format('csv').load(d).show()
> Exception raised:
> Traceback (most recent call last):
>   File "/.../lib/python3.9/doctest.py", line 1334, in __run
> exec(compile(example.source, filename, "single",
>   File " pyspark.sql.connect.readwriter.DataFrameWriter.option[2]>", line 3, in 
> 
> df = spark.createDataFrame([(100, None)], "age INT, name STRING")
>   File "/.../python/pyspark/sql/connect/session.py", line 312, in 
> createDataFrame
> raise ValueError(
> ValueError: Some of types cannot be determined after inferring, a 
> StructType Schema is required in this case
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42458) createDataFrame should support DDL string as schema

2023-03-01 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42458.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40240
[https://github.com/apache/spark/pull/40240]

> createDataFrame should support DDL string as schema
> ---
>
> Key: SPARK-42458
> URL: https://issues.apache.org/jira/browse/SPARK-42458
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:python}
> File "/.../python/pyspark/sql/connect/readwriter.py", line 393, in 
> pyspark.sql.connect.readwriter.DataFrameWriter.option
> Failed example:
> with tempfile.TemporaryDirectory() as d:
> # Write a DataFrame into a CSV file with 'nullValue' option set to 
> 'Hyukjin Kwon'.
> df = spark.createDataFrame([(100, None)], "age INT, name STRING")
> df.write.option("nullValue", "Hyukjin 
> Kwon").mode("overwrite").format("csv").save(d)
> # Read the CSV file as a DataFrame.
> spark.read.schema(df.schema).format('csv').load(d).show()
> Exception raised:
> Traceback (most recent call last):
>   File "/.../lib/python3.9/doctest.py", line 1334, in __run
> exec(compile(example.source, filename, "single",
>   File " pyspark.sql.connect.readwriter.DataFrameWriter.option[2]>", line 3, in 
> 
> df = spark.createDataFrame([(100, None)], "age INT, name STRING")
>   File "/.../python/pyspark/sql/connect/session.py", line 312, in 
> createDataFrame
> raise ValueError(
> ValueError: Some of types cannot be determined after inferring, a 
> StructType Schema is required in this case
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42458) createDataFrame should support DDL string as schema

2023-03-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42458:


Assignee: Apache Spark

> createDataFrame should support DDL string as schema
> ---
>
> Key: SPARK-42458
> URL: https://issues.apache.org/jira/browse/SPARK-42458
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> {code:python}
> File "/.../python/pyspark/sql/connect/readwriter.py", line 393, in 
> pyspark.sql.connect.readwriter.DataFrameWriter.option
> Failed example:
> with tempfile.TemporaryDirectory() as d:
> # Write a DataFrame into a CSV file with 'nullValue' option set to 
> 'Hyukjin Kwon'.
> df = spark.createDataFrame([(100, None)], "age INT, name STRING")
> df.write.option("nullValue", "Hyukjin 
> Kwon").mode("overwrite").format("csv").save(d)
> # Read the CSV file as a DataFrame.
> spark.read.schema(df.schema).format('csv').load(d).show()
> Exception raised:
> Traceback (most recent call last):
>   File "/.../lib/python3.9/doctest.py", line 1334, in __run
> exec(compile(example.source, filename, "single",
>   File " pyspark.sql.connect.readwriter.DataFrameWriter.option[2]>", line 3, in 
> 
> df = spark.createDataFrame([(100, None)], "age INT, name STRING")
>   File "/.../python/pyspark/sql/connect/session.py", line 312, in 
> createDataFrame
> raise ValueError(
> ValueError: Some of types cannot be determined after inferring, a 
> StructType Schema is required in this case
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42458) createDataFrame should support DDL string as schema

2023-03-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695351#comment-17695351
 ] 

Apache Spark commented on SPARK-42458:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/40240

> createDataFrame should support DDL string as schema
> ---
>
> Key: SPARK-42458
> URL: https://issues.apache.org/jira/browse/SPARK-42458
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> {code:python}
> File "/.../python/pyspark/sql/connect/readwriter.py", line 393, in 
> pyspark.sql.connect.readwriter.DataFrameWriter.option
> Failed example:
> with tempfile.TemporaryDirectory() as d:
> # Write a DataFrame into a CSV file with 'nullValue' option set to 
> 'Hyukjin Kwon'.
> df = spark.createDataFrame([(100, None)], "age INT, name STRING")
> df.write.option("nullValue", "Hyukjin 
> Kwon").mode("overwrite").format("csv").save(d)
> # Read the CSV file as a DataFrame.
> spark.read.schema(df.schema).format('csv').load(d).show()
> Exception raised:
> Traceback (most recent call last):
>   File "/.../lib/python3.9/doctest.py", line 1334, in __run
> exec(compile(example.source, filename, "single",
>   File " pyspark.sql.connect.readwriter.DataFrameWriter.option[2]>", line 3, in 
> 
> df = spark.createDataFrame([(100, None)], "age INT, name STRING")
>   File "/.../python/pyspark/sql/connect/session.py", line 312, in 
> createDataFrame
> raise ValueError(
> ValueError: Some of types cannot be determined after inferring, a 
> StructType Schema is required in this case
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42458) createDataFrame should support DDL string as schema

2023-03-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42458:


Assignee: (was: Apache Spark)

> createDataFrame should support DDL string as schema
> ---
>
> Key: SPARK-42458
> URL: https://issues.apache.org/jira/browse/SPARK-42458
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> {code:python}
> File "/.../python/pyspark/sql/connect/readwriter.py", line 393, in 
> pyspark.sql.connect.readwriter.DataFrameWriter.option
> Failed example:
> with tempfile.TemporaryDirectory() as d:
> # Write a DataFrame into a CSV file with 'nullValue' option set to 
> 'Hyukjin Kwon'.
> df = spark.createDataFrame([(100, None)], "age INT, name STRING")
> df.write.option("nullValue", "Hyukjin 
> Kwon").mode("overwrite").format("csv").save(d)
> # Read the CSV file as a DataFrame.
> spark.read.schema(df.schema).format('csv').load(d).show()
> Exception raised:
> Traceback (most recent call last):
>   File "/.../lib/python3.9/doctest.py", line 1334, in __run
> exec(compile(example.source, filename, "single",
>   File " pyspark.sql.connect.readwriter.DataFrameWriter.option[2]>", line 3, in 
> 
> df = spark.createDataFrame([(100, None)], "age INT, name STRING")
>   File "/.../python/pyspark/sql/connect/session.py", line 312, in 
> createDataFrame
> raise ValueError(
> ValueError: Some of types cannot be determined after inferring, a 
> StructType Schema is required in this case
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40159) Aggregate should be group only after collapse project to aggregate

2023-03-01 Thread Ritika Maheshwari (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695327#comment-17695327
 ] 

Ritika Maheshwari commented on SPARK-40159:
---

This issue seems to have been resolved by SPARK-38489

> Aggregate should be group only after collapse project to aggregate
> --
>
> Key: SPARK-40159
> URL: https://issues.apache.org/jira/browse/SPARK-40159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wan Kun
>Priority: Major
>
> CollapseProject rule will merge project expressions into AggregateExpressions 
> in aggregate, which will make the *aggregate.groupOnly* to false.
> {code}
> val df = testData.distinct().select('key + 1, ('key + 1).cast("long"))
> df.queryExecution.optimizedPlan.collect {
>   case a: Aggregate => a
> }.foreach(agg => assert(agg.groupOnly === true)) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42638) current_user() is blocked from VALUES, but current_timestamp() is not

2023-03-01 Thread Serge Rielau (Jira)

Serge Rielau created SPARK-42638:


 Summary: current_user() is blocked from VALUES, but 
current_timestamp() is not
 Key: SPARK-42638
 URL: https://issues.apache.org/jira/browse/SPARK-42638
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Serge Rielau


VALUES(current_user());
returns:

cannot evaluate expression current_user() in inline table definition.; line 1 
pos 8

 

The same with current_timestamp() works.

It appears current_user() is recognized as non-deterministic. But it is 
constant within the statement, just like current_timestanmp().

PS: It's not clear why we block non-deterministic functions to begin with



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42633) Use the actual schema in a LocalRelation

2023-03-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695280#comment-17695280
 ] 

Apache Spark commented on SPARK-42633:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/40238

> Use the actual schema in a LocalRelation
> 
>
> Key: SPARK-42633
> URL: https://issues.apache.org/jira/browse/SPARK-42633
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>
> Make the LocalRelation proto take an actual schema message instead of a 
> string.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42637) Add SparkSession.stop

2023-03-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695279#comment-17695279
 ] 

Apache Spark commented on SPARK-42637:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/40239

> Add SparkSession.stop
> -
>
> Key: SPARK-42637
> URL: https://issues.apache.org/jira/browse/SPARK-42637
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>
> Add SparkSession.stop()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression

2023-03-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42635:


Assignee: (was: Apache Spark)

> Several counter-intuitive behaviours in the TimestampAdd expression
> ---
>
> Key: SPARK-42635
> URL: https://issues.apache.org/jira/browse/SPARK-42635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.3.2
>Reporter: Chenhao Li
>Priority: Major
>
> # When the time is close to daylight saving time transition, the result may 
> be discontinuous and not monotonic.
> We currently have:
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
> timestamp'2011-03-12 03:00:00')").show
> ++
> |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
> ++
> | 2011-03-13 03:59:59|
> ++
> scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
> 03:00:00')").show
> +--+
> |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
> +--+
> |   2011-03-13 03:00:00|
> +--+ {code}
>  
> In the second query, adding one more second will set the time back one hour 
> instead. Plus, there are only {{23 * 3600}} seconds from {{2011-03-12 
> 03:00:00}} to {{2011-03-13 03:00:00}}, instead of {{24 * 3600}} seconds, due 
> to the daylight saving time transition.
> The root cause of the problem is the Spark code at 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790]
>  wrongly assumes every day has {{MICROS_PER_DAY}} seconds, and does the day 
> and time-in-day split before looking at the timezone.
> 2. Adding month, quarter, and year silently ignores Int overflow during unit 
> conversion.
> The root cause is 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246].
>  {{quantity}} is multiplied by {{3}} or {{MONTHS_PER_YEAR}} without checking 
> overflow. Note that we do have overflow checking in adding the amount to the 
> timestamp, so the behavior is inconsistent.
> This can cause counter-intuitive results like this:
> {code:scala}
> scala> spark.sql("select timestampadd(quarter, 1431655764, 
> timestamp'1970-01-01')").show
> +--+
> |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')|
> +--+
> |   1969-09-01 00:00:00|
> +--+{code}
> 3. Adding sub-month units (week, day, hour, minute, second, millisecond, 
> microsecond)silently ignores Long overflow during unit conversion.
> This is similar to the previous problem:
> {code:scala}
>  scala> spark.sql("select timestampadd(day, 106751992, 
> timestamp'1970-01-01')").show(false)
> +-+
> |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')|
> +-+
> |-290308-12-22 15:58:10.448384|
> +-+{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression

2023-03-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695278#comment-17695278
 ] 

Apache Spark commented on SPARK-42635:
--

User 'chenhao-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/40237

> Several counter-intuitive behaviours in the TimestampAdd expression
> ---
>
> Key: SPARK-42635
> URL: https://issues.apache.org/jira/browse/SPARK-42635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.3.2
>Reporter: Chenhao Li
>Priority: Major
>
> # When the time is close to daylight saving time transition, the result may 
> be discontinuous and not monotonic.
> We currently have:
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
> timestamp'2011-03-12 03:00:00')").show
> ++
> |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
> ++
> | 2011-03-13 03:59:59|
> ++
> scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
> 03:00:00')").show
> +--+
> |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
> +--+
> |   2011-03-13 03:00:00|
> +--+ {code}
>  
> In the second query, adding one more second will set the time back one hour 
> instead. Plus, there are only {{23 * 3600}} seconds from {{2011-03-12 
> 03:00:00}} to {{2011-03-13 03:00:00}}, instead of {{24 * 3600}} seconds, due 
> to the daylight saving time transition.
> The root cause of the problem is the Spark code at 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790]
>  wrongly assumes every day has {{MICROS_PER_DAY}} seconds, and does the day 
> and time-in-day split before looking at the timezone.
> 2. Adding month, quarter, and year silently ignores Int overflow during unit 
> conversion.
> The root cause is 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246].
>  {{quantity}} is multiplied by {{3}} or {{MONTHS_PER_YEAR}} without checking 
> overflow. Note that we do have overflow checking in adding the amount to the 
> timestamp, so the behavior is inconsistent.
> This can cause counter-intuitive results like this:
> {code:scala}
> scala> spark.sql("select timestampadd(quarter, 1431655764, 
> timestamp'1970-01-01')").show
> +--+
> |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')|
> +--+
> |   1969-09-01 00:00:00|
> +--+{code}
> 3. Adding sub-month units (week, day, hour, minute, second, millisecond, 
> microsecond)silently ignores Long overflow during unit conversion.
> This is similar to the previous problem:
> {code:scala}
>  scala> spark.sql("select timestampadd(day, 106751992, 
> timestamp'1970-01-01')").show(false)
> +-+
> |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')|
> +-+
> |-290308-12-22 15:58:10.448384|
> +-+{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression

2023-03-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42635:


Assignee: Apache Spark

> Several counter-intuitive behaviours in the TimestampAdd expression
> ---
>
> Key: SPARK-42635
> URL: https://issues.apache.org/jira/browse/SPARK-42635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.3.2
>Reporter: Chenhao Li
>Assignee: Apache Spark
>Priority: Major
>
> # When the time is close to daylight saving time transition, the result may 
> be discontinuous and not monotonic.
> We currently have:
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
> timestamp'2011-03-12 03:00:00')").show
> ++
> |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
> ++
> | 2011-03-13 03:59:59|
> ++
> scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
> 03:00:00')").show
> +--+
> |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
> +--+
> |   2011-03-13 03:00:00|
> +--+ {code}
>  
> In the second query, adding one more second will set the time back one hour 
> instead. Plus, there are only {{23 * 3600}} seconds from {{2011-03-12 
> 03:00:00}} to {{2011-03-13 03:00:00}}, instead of {{24 * 3600}} seconds, due 
> to the daylight saving time transition.
> The root cause of the problem is the Spark code at 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790]
>  wrongly assumes every day has {{MICROS_PER_DAY}} seconds, and does the day 
> and time-in-day split before looking at the timezone.
> 2. Adding month, quarter, and year silently ignores Int overflow during unit 
> conversion.
> The root cause is 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246].
>  {{quantity}} is multiplied by {{3}} or {{MONTHS_PER_YEAR}} without checking 
> overflow. Note that we do have overflow checking in adding the amount to the 
> timestamp, so the behavior is inconsistent.
> This can cause counter-intuitive results like this:
> {code:scala}
> scala> spark.sql("select timestampadd(quarter, 1431655764, 
> timestamp'1970-01-01')").show
> +--+
> |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')|
> +--+
> |   1969-09-01 00:00:00|
> +--+{code}
> 3. Adding sub-month units (week, day, hour, minute, second, millisecond, 
> microsecond)silently ignores Long overflow during unit conversion.
> This is similar to the previous problem:
> {code:scala}
>  scala> spark.sql("select timestampadd(day, 106751992, 
> timestamp'1970-01-01')").show(false)
> +-+
> |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')|
> +-+
> |-290308-12-22 15:58:10.448384|
> +-+{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42637) Add SparkSession.stop

2023-03-01 Thread Jira

Herman van Hövell created SPARK-42637:
-

 Summary: Add SparkSession.stop
 Key: SPARK-42637
 URL: https://issues.apache.org/jira/browse/SPARK-42637
 Project: Spark
  Issue Type: New Feature
  Components: Connect
Affects Versions: 3.4.0
Reporter: Herman van Hövell
Assignee: Herman van Hövell


Add SparkSession.stop()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42636) Audit annotation usage

2023-03-01 Thread Jira

Herman van Hövell created SPARK-42636:
-

 Summary: Audit annotation usage
 Key: SPARK-42636
 URL: https://issues.apache.org/jira/browse/SPARK-42636
 Project: Spark
  Issue Type: New Feature
  Components: Connect
Affects Versions: 3.4.0
Reporter: Herman van Hövell


Annotation usage is not entirely consistent in the client. We should probably 
remove all Stable annotations and add a few DevelopApi ones.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42634) Several counter-intuitive behaviours in the TimestampAdd expression

2023-03-01 Thread Chenhao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenhao Li updated SPARK-42634:
---
Description: (was: # When the time is close to daylight saving time 
transition, the result may be discontinuous and not monotonic.

We currently have:

 

{{scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
timestamp'2011-03-12 03:00:00')").show
++
|timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
++
| 2011-03-13 03:59:59|
++
scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
03:00:00')").show
+--+
|timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
+--+
|   2011-03-13 03:00:00|
+--+}}

In the second query, adding one more second will set the time back one hour 
instead. Plus, there is only 23 * 3600seconds from 2011-03-12 03:00:00 to 
2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving 
time transition.

The root cause of the problem is the Spark code at 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790
 wrongly assumes every day has MICROS_PER_DAY seconds, and does the day and 
time-in-day split before looking at the timezone.

2. Adding month, quarter, and year silently ignores Int overflow during unit 
conversion.

The root cause is 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246.
 quantity is multiplied by 3 or MONTHS_PER_YEARwithout checking overflow. Note 
that we do have overflow checking in adding the amount to the timestamp, so the 
behavior is inconsistent.

This can cause counter-intuitive results like this:

 

{{scala> spark.sql("select timestampadd(quarter, 1431655764, 
timestamp'1970-01-01')").show
+--+
|timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')|
+--+
|   1969-09-01 00:00:00|
+--+}}

3. Adding sub-month units (week, day, hour, minute, second, millisecond, 
microsecond)silently ignores Long overflow during unit conversion.

This is similar to the previous problem:

 

{{scala> spark.sql("select timestampadd(day, 106751992, 
timestamp'1970-01-01')").show(false)
+-+
|timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')|
+-+
|-290308-12-22 15:58:10.448384|
+-+}}

 )

> Several counter-intuitive behaviours in the TimestampAdd expression
> ---
>
> Key: SPARK-42634
> URL: https://issues.apache.org/jira/browse/SPARK-42634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0, 3.3.1, 3.3.2
>Reporter: Chenhao Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-42634) Several counter-intuitive behaviours in the TimestampAdd expression

2023-03-01 Thread Chenhao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenhao Li closed SPARK-42634.
--

This is a duplicate, created by mistake.

> Several counter-intuitive behaviours in the TimestampAdd expression
> ---
>
> Key: SPARK-42634
> URL: https://issues.apache.org/jira/browse/SPARK-42634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0, 3.3.1, 3.3.2
>Reporter: Chenhao Li
>Priority: Major
>
> # When the time is close to daylight saving time transition, the result may 
> be discontinuous and not monotonic.
> We currently have:
>  
> {{scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
> timestamp'2011-03-12 03:00:00')").show
> ++
> |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
> ++
> | 2011-03-13 03:59:59|
> ++
> scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
> 03:00:00')").show
> +--+
> |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
> +--+
> |   2011-03-13 03:00:00|
> +--+}}
> In the second query, adding one more second will set the time back one hour 
> instead. Plus, there is only 23 * 3600seconds from 2011-03-12 03:00:00 to 
> 2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving 
> time transition.
> The root cause of the problem is the Spark code at 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790
>  wrongly assumes every day has MICROS_PER_DAY seconds, and does the day and 
> time-in-day split before looking at the timezone.
> 2. Adding month, quarter, and year silently ignores Int overflow during unit 
> conversion.
> The root cause is 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246.
>  quantity is multiplied by 3 or MONTHS_PER_YEARwithout checking overflow. 
> Note that we do have overflow checking in adding the amount to the timestamp, 
> so the behavior is inconsistent.
> This can cause counter-intuitive results like this:
>  
> {{scala> spark.sql("select timestampadd(quarter, 1431655764, 
> timestamp'1970-01-01')").show
> +--+
> |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')|
> +--+
> |   1969-09-01 00:00:00|
> +--+}}
> 3. Adding sub-month units (week, day, hour, minute, second, millisecond, 
> microsecond)silently ignores Long overflow during unit conversion.
> This is similar to the previous problem:
>  
> {{scala> spark.sql("select timestampadd(day, 106751992, 
> timestamp'1970-01-01')").show(false)
> +-+
> |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')|
> +-+
> |-290308-12-22 15:58:10.448384|
> +-+}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42634) Several counter-intuitive behaviours in the TimestampAdd expression

2023-03-01 Thread Chenhao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenhao Li resolved SPARK-42634.

Resolution: Fixed

duplicate

> Several counter-intuitive behaviours in the TimestampAdd expression
> ---
>
> Key: SPARK-42634
> URL: https://issues.apache.org/jira/browse/SPARK-42634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0, 3.3.1, 3.3.2
>Reporter: Chenhao Li
>Priority: Major
>
> # When the time is close to daylight saving time transition, the result may 
> be discontinuous and not monotonic.
> We currently have:
>  
> {{scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
> timestamp'2011-03-12 03:00:00')").show
> ++
> |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
> ++
> | 2011-03-13 03:59:59|
> ++
> scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
> 03:00:00')").show
> +--+
> |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
> +--+
> |   2011-03-13 03:00:00|
> +--+}}
> In the second query, adding one more second will set the time back one hour 
> instead. Plus, there is only 23 * 3600seconds from 2011-03-12 03:00:00 to 
> 2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving 
> time transition.
> The root cause of the problem is the Spark code at 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790
>  wrongly assumes every day has MICROS_PER_DAY seconds, and does the day and 
> time-in-day split before looking at the timezone.
> 2. Adding month, quarter, and year silently ignores Int overflow during unit 
> conversion.
> The root cause is 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246.
>  quantity is multiplied by 3 or MONTHS_PER_YEARwithout checking overflow. 
> Note that we do have overflow checking in adding the amount to the timestamp, 
> so the behavior is inconsistent.
> This can cause counter-intuitive results like this:
>  
> {{scala> spark.sql("select timestampadd(quarter, 1431655764, 
> timestamp'1970-01-01')").show
> +--+
> |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')|
> +--+
> |   1969-09-01 00:00:00|
> +--+}}
> 3. Adding sub-month units (week, day, hour, minute, second, millisecond, 
> microsecond)silently ignores Long overflow during unit conversion.
> This is similar to the previous problem:
>  
> {{scala> spark.sql("select timestampadd(day, 106751992, 
> timestamp'1970-01-01')").show(false)
> +-+
> |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')|
> +-+
> |-290308-12-22 15:58:10.448384|
> +-+}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression

2023-03-01 Thread Chenhao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenhao Li updated SPARK-42635:
---
Component/s: SQL
 (was: Spark Core)

> Several counter-intuitive behaviours in the TimestampAdd expression
> ---
>
> Key: SPARK-42635
> URL: https://issues.apache.org/jira/browse/SPARK-42635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.3.2
>Reporter: Chenhao Li
>Priority: Major
>
> # When the time is close to daylight saving time transition, the result may 
> be discontinuous and not monotonic.
> We currently have:
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
> timestamp'2011-03-12 03:00:00')").show
> ++
> |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
> ++
> | 2011-03-13 03:59:59|
> ++
> scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
> 03:00:00')").show
> +--+
> |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
> +--+
> |   2011-03-13 03:00:00|
> +--+ {code}
>  
> In the second query, adding one more second will set the time back one hour 
> instead. Plus, there are only {{23 * 3600}} seconds from {{2011-03-12 
> 03:00:00}} to {{2011-03-13 03:00:00}}, instead of {{24 * 3600}} seconds, due 
> to the daylight saving time transition.
> The root cause of the problem is the Spark code at 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790]
>  wrongly assumes every day has {{MICROS_PER_DAY}} seconds, and does the day 
> and time-in-day split before looking at the timezone.
> 2. Adding month, quarter, and year silently ignores Int overflow during unit 
> conversion.
> The root cause is 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246].
>  {{quantity}} is multiplied by {{3}} or {{MONTHS_PER_YEAR}} without checking 
> overflow. Note that we do have overflow checking in adding the amount to the 
> timestamp, so the behavior is inconsistent.
> This can cause counter-intuitive results like this:
> {code:scala}
> scala> spark.sql("select timestampadd(quarter, 1431655764, 
> timestamp'1970-01-01')").show
> +--+
> |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')|
> +--+
> |   1969-09-01 00:00:00|
> +--+{code}
> 3. Adding sub-month units (week, day, hour, minute, second, millisecond, 
> microsecond)silently ignores Long overflow during unit conversion.
> This is similar to the previous problem:
> {code:scala}
>  scala> spark.sql("select timestampadd(day, 106751992, 
> timestamp'1970-01-01')").show(false)
> +-+
> |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')|
> +-+
> |-290308-12-22 15:58:10.448384|
> +-+{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression

2023-03-01 Thread Chenhao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenhao Li updated SPARK-42635:
---
Component/s: (was: SQL)

> Several counter-intuitive behaviours in the TimestampAdd expression
> ---
>
> Key: SPARK-42635
> URL: https://issues.apache.org/jira/browse/SPARK-42635
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0, 3.3.1, 3.3.2
>Reporter: Chenhao Li
>Priority: Major
>
> # When the time is close to daylight saving time transition, the result may 
> be discontinuous and not monotonic.
> We currently have:
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
> timestamp'2011-03-12 03:00:00')").show
> ++
> |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
> ++
> | 2011-03-13 03:59:59|
> ++
> scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
> 03:00:00')").show
> +--+
> |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
> +--+
> |   2011-03-13 03:00:00|
> +--+ {code}
>  
> In the second query, adding one more second will set the time back one hour 
> instead. Plus, there are only {{23 * 3600}} seconds from {{2011-03-12 
> 03:00:00}} to {{2011-03-13 03:00:00}}, instead of {{24 * 3600}} seconds, due 
> to the daylight saving time transition.
> The root cause of the problem is the Spark code at 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790]
>  wrongly assumes every day has {{MICROS_PER_DAY}} seconds, and does the day 
> and time-in-day split before looking at the timezone.
> 2. Adding month, quarter, and year silently ignores Int overflow during unit 
> conversion.
> The root cause is 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246].
>  {{quantity}} is multiplied by {{3}} or {{MONTHS_PER_YEAR}} without checking 
> overflow. Note that we do have overflow checking in adding the amount to the 
> timestamp, so the behavior is inconsistent.
> This can cause counter-intuitive results like this:
> {code:scala}
> scala> spark.sql("select timestampadd(quarter, 1431655764, 
> timestamp'1970-01-01')").show
> +--+
> |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')|
> +--+
> |   1969-09-01 00:00:00|
> +--+{code}
> 3. Adding sub-month units (week, day, hour, minute, second, millisecond, 
> microsecond)silently ignores Long overflow during unit conversion.
> This is similar to the previous problem:
> {code:scala}
>  scala> spark.sql("select timestampadd(day, 106751992, 
> timestamp'1970-01-01')").show(false)
> +-+
> |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')|
> +-+
> |-290308-12-22 15:58:10.448384|
> +-+{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41843) Implement SparkSession.udf

2023-03-01 Thread Takuya Ueshin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-41843.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

> Implement SparkSession.udf
> --
>
> Key: SPARK-41843
> URL: https://issues.apache.org/jira/browse/SPARK-41843
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 2331, in pyspark.sql.connect.functions.call_udf
> Failed example:
>     _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType())
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType())
>     AttributeError: 'SparkSession' object has no attribute 'udf'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38735) Test the error class: INTERNAL_ERROR

2023-03-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695259#comment-17695259
 ] 

Apache Spark commented on SPARK-38735:
--

User 'the8thC' has created a pull request for this issue:
https://github.com/apache/spark/pull/40236

> Test the error class: INTERNAL_ERROR
> 
>
> Key: SPARK-38735
> URL: https://issues.apache.org/jira/browse/SPARK-38735
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add tests for the error class *INTERNAL_ERROR* to QueryExecutionErrorsSuite. 
> The test should cover the exception throw in QueryExecutionErrors:
> {code:scala}
>   def logicalHintOperatorNotRemovedDuringAnalysisError(): Throwable = {
> new SparkIllegalStateException(errorClass = "INTERNAL_ERROR",
>   messageParameters = Array(
> "Internal error: logical hint operator should have been removed 
> during analysis"))
>   }
>   def cannotEvaluateExpressionError(expression: Expression): Throwable = {
> new SparkUnsupportedOperationException(errorClass = "INTERNAL_ERROR",
>   messageParameters = Array(s"Cannot evaluate expression: $expression"))
>   }
>   def cannotGenerateCodeForExpressionError(expression: Expression): Throwable 
> = {
> new SparkUnsupportedOperationException(errorClass = "INTERNAL_ERROR",
>   messageParameters = Array(s"Cannot generate code for expression: 
> $expression"))
>   }
>   def cannotTerminateGeneratorError(generator: UnresolvedGenerator): 
> Throwable = {
> new SparkUnsupportedOperationException(errorClass = "INTERNAL_ERROR",
>   messageParameters = Array(s"Cannot terminate expression: $generator"))
>   }
>   def methodNotDeclaredError(name: String): Throwable = {
> new SparkNoSuchMethodException(errorClass = "INTERNAL_ERROR",
>   messageParameters = Array(
> s"""A method named "$name" is not declared in any enclosing class nor 
> any supertype"""))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38735) Test the error class: INTERNAL_ERROR

2023-03-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38735:


Assignee: (was: Apache Spark)

> Test the error class: INTERNAL_ERROR
> 
>
> Key: SPARK-38735
> URL: https://issues.apache.org/jira/browse/SPARK-38735
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add tests for the error class *INTERNAL_ERROR* to QueryExecutionErrorsSuite. 
> The test should cover the exception throw in QueryExecutionErrors:
> {code:scala}
>   def logicalHintOperatorNotRemovedDuringAnalysisError(): Throwable = {
> new SparkIllegalStateException(errorClass = "INTERNAL_ERROR",
>   messageParameters = Array(
> "Internal error: logical hint operator should have been removed 
> during analysis"))
>   }
>   def cannotEvaluateExpressionError(expression: Expression): Throwable = {
> new SparkUnsupportedOperationException(errorClass = "INTERNAL_ERROR",
>   messageParameters = Array(s"Cannot evaluate expression: $expression"))
>   }
>   def cannotGenerateCodeForExpressionError(expression: Expression): Throwable 
> = {
> new SparkUnsupportedOperationException(errorClass = "INTERNAL_ERROR",
>   messageParameters = Array(s"Cannot generate code for expression: 
> $expression"))
>   }
>   def cannotTerminateGeneratorError(generator: UnresolvedGenerator): 
> Throwable = {
> new SparkUnsupportedOperationException(errorClass = "INTERNAL_ERROR",
>   messageParameters = Array(s"Cannot terminate expression: $generator"))
>   }
>   def methodNotDeclaredError(name: String): Throwable = {
> new SparkNoSuchMethodException(errorClass = "INTERNAL_ERROR",
>   messageParameters = Array(
> s"""A method named "$name" is not declared in any enclosing class nor 
> any supertype"""))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38735) Test the error class: INTERNAL_ERROR

2023-03-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695261#comment-17695261
 ] 

Apache Spark commented on SPARK-38735:
--

User 'the8thC' has created a pull request for this issue:
https://github.com/apache/spark/pull/40236

> Test the error class: INTERNAL_ERROR
> 
>
> Key: SPARK-38735
> URL: https://issues.apache.org/jira/browse/SPARK-38735
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add tests for the error class *INTERNAL_ERROR* to QueryExecutionErrorsSuite. 
> The test should cover the exception throw in QueryExecutionErrors:
> {code:scala}
>   def logicalHintOperatorNotRemovedDuringAnalysisError(): Throwable = {
> new SparkIllegalStateException(errorClass = "INTERNAL_ERROR",
>   messageParameters = Array(
> "Internal error: logical hint operator should have been removed 
> during analysis"))
>   }
>   def cannotEvaluateExpressionError(expression: Expression): Throwable = {
> new SparkUnsupportedOperationException(errorClass = "INTERNAL_ERROR",
>   messageParameters = Array(s"Cannot evaluate expression: $expression"))
>   }
>   def cannotGenerateCodeForExpressionError(expression: Expression): Throwable 
> = {
> new SparkUnsupportedOperationException(errorClass = "INTERNAL_ERROR",
>   messageParameters = Array(s"Cannot generate code for expression: 
> $expression"))
>   }
>   def cannotTerminateGeneratorError(generator: UnresolvedGenerator): 
> Throwable = {
> new SparkUnsupportedOperationException(errorClass = "INTERNAL_ERROR",
>   messageParameters = Array(s"Cannot terminate expression: $generator"))
>   }
>   def methodNotDeclaredError(name: String): Throwable = {
> new SparkNoSuchMethodException(errorClass = "INTERNAL_ERROR",
>   messageParameters = Array(
> s"""A method named "$name" is not declared in any enclosing class nor 
> any supertype"""))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38735) Test the error class: INTERNAL_ERROR

2023-03-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38735:


Assignee: Apache Spark

> Test the error class: INTERNAL_ERROR
> 
>
> Key: SPARK-38735
> URL: https://issues.apache.org/jira/browse/SPARK-38735
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> Add tests for the error class *INTERNAL_ERROR* to QueryExecutionErrorsSuite. 
> The test should cover the exception throw in QueryExecutionErrors:
> {code:scala}
>   def logicalHintOperatorNotRemovedDuringAnalysisError(): Throwable = {
> new SparkIllegalStateException(errorClass = "INTERNAL_ERROR",
>   messageParameters = Array(
> "Internal error: logical hint operator should have been removed 
> during analysis"))
>   }
>   def cannotEvaluateExpressionError(expression: Expression): Throwable = {
> new SparkUnsupportedOperationException(errorClass = "INTERNAL_ERROR",
>   messageParameters = Array(s"Cannot evaluate expression: $expression"))
>   }
>   def cannotGenerateCodeForExpressionError(expression: Expression): Throwable 
> = {
> new SparkUnsupportedOperationException(errorClass = "INTERNAL_ERROR",
>   messageParameters = Array(s"Cannot generate code for expression: 
> $expression"))
>   }
>   def cannotTerminateGeneratorError(generator: UnresolvedGenerator): 
> Throwable = {
> new SparkUnsupportedOperationException(errorClass = "INTERNAL_ERROR",
>   messageParameters = Array(s"Cannot terminate expression: $generator"))
>   }
>   def methodNotDeclaredError(name: String): Throwable = {
> new SparkNoSuchMethodException(errorClass = "INTERNAL_ERROR",
>   messageParameters = Array(
> s"""A method named "$name" is not declared in any enclosing class nor 
> any supertype"""))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression

2023-03-01 Thread Chenhao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenhao Li updated SPARK-42635:
---
Description: 
# When the time is close to daylight saving time transition, the result may be 
discontinuous and not monotonic.

We currently have:
{code:scala}
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
timestamp'2011-03-12 03:00:00')").show
++
|timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
++
| 2011-03-13 03:59:59|
++
scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
03:00:00')").show
+--+
|timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
+--+
|   2011-03-13 03:00:00|
+--+ {code}
 

In the second query, adding one more second will set the time back one hour 
instead. Plus, there are only {{23 * 3600}} seconds from {{2011-03-12 
03:00:00}} to {{2011-03-13 03:00:00}}, instead of {{24 * 3600}} seconds, due to 
the daylight saving time transition.

The root cause of the problem is the Spark code at 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790]
 wrongly assumes every day has {{MICROS_PER_DAY}} seconds, and does the day and 
time-in-day split before looking at the timezone.

2. Adding month, quarter, and year silently ignores Int overflow during unit 
conversion.

The root cause is 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246].
 {{quantity}} is multiplied by {{3}} or {{MONTHS_PER_YEAR}} without checking 
overflow. Note that we do have overflow checking in adding the amount to the 
timestamp, so the behavior is inconsistent.

This can cause counter-intuitive results like this:

{code:scala}
scala> spark.sql("select timestampadd(quarter, 1431655764, 
timestamp'1970-01-01')").show
+--+
|timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')|
+--+
|   1969-09-01 00:00:00|
+--+{code}

3. Adding sub-month units (week, day, hour, minute, second, millisecond, 
microsecond)silently ignores Long overflow during unit conversion.

This is similar to the previous problem:

{code:scala}
 scala> spark.sql("select timestampadd(day, 106751992, 
timestamp'1970-01-01')").show(false)
+-+
|timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')|
+-+
|-290308-12-22 15:58:10.448384|
+-+{code}

 

  was:
# When the time is close to daylight saving time transition, the result may be 
discontinuous and not monotonic.

We currently have:
{code:java}
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
timestamp'2011-03-12 03:00:00')").show
++
|timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
++
| 2011-03-13 03:59:59|
++
scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
03:00:00')").show
+--+
|timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
+--+
|   2011-03-13 03:00:00|
+--+ {code}
 

In the second query, adding one more second will set the time back one hour 
instead. Plus, there are only {{23 * 3600}} seconds from {{2011-03-12 
03:00:00}} to {{2011-03-13 03:00:00}}, instead of {{24 * 3600}} seconds, due to 
the daylight saving time transition.

The root cause of the problem is the Spark code at

[jira] [Updated] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression

2023-03-01 Thread Chenhao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenhao Li updated SPARK-42635:
---
Description: 
# When the time is close to daylight saving time transition, the result may be 
discontinuous and not monotonic.

We currently have:
{code:java}
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
timestamp'2011-03-12 03:00:00')").show
++
|timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
++
| 2011-03-13 03:59:59|
++
scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
03:00:00')").show
+--+
|timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
+--+
|   2011-03-13 03:00:00|
+--+ {code}
 

In the second query, adding one more second will set the time back one hour 
instead. Plus, there are only {{23 * 3600}} seconds from {{2011-03-12 
03:00:00}} to {{2011-03-13 03:00:00}}, instead of {{24 * 3600}} seconds, due to 
the daylight saving time transition.

The root cause of the problem is the Spark code at wrongly assumes every day 
has {{MICROS_PER_DAY}} seconds, and does the day and time-in-day split before 
looking at the timezone.

2. Adding month, quarter, and year silently ignores Int overflow during unit 
conversion.

The root cause is here . quantity is multiplied by 3 or MONTHS_PER_YEARwithout 
checking overflow. Note that we do have overflow checking in adding the amount 
to the timestamp, so the behavior is inconsistent.

This can cause counter-intuitive results like this:

 

{{scala> spark.sql("select timestampadd(quarter, 1431655764, 
timestamp'1970-01-01')").show
+--+
|timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')|

+--+
|1969-09-01 00:00:00|

{+}--{+}}}

3. Adding sub-month units (week, day, hour, minute, second, millisecond, 
microsecond)silently ignores Long overflow during unit conversion.

This is similar to the previous problem:

 

{{scala> spark.sql("select timestampadd(day, 106751992, 
timestamp'1970-01-01')").show(false)
+-+
|timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')|

+-+
|-290308-12-22 15:58:10.448384|

{+}-{+}}}

 

  was:
# When the time is close to daylight saving time transition, the result may be 
discontinuous and not monotonic.

We currently have:
{code:java}
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
timestamp'2011-03-12 03:00:00')").show
++
|timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
++
| 2011-03-13 03:59:59|
++
scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
03:00:00')").show
+--+
|timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
+--+
|   2011-03-13 03:00:00|
+--+ {code}
 

In the second query, adding one more second will set the time back one hour 
instead. Plus, there is only 23 * 3600 seconds from 2011-03-12 03:00:00 to 
2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving 
time transition.

The root cause of the problem is the Spark code at 
[https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L797]
 wrongly assumes every day has MICROS_PER_DAYseconds, and does the day and 
time-in-day split before looking at the timezone.

2. Adding month, quarter, and year silently ignores Int overflow during unit 
conversion.

The root cause is here

[jira] [Commented] (SPARK-42631) Support custom extensions in Spark Connect Scala client

2023-03-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695254#comment-17695254
 ] 

Apache Spark commented on SPARK-42631:
--

User 'tomvanbussel' has created a pull request for this issue:
https://github.com/apache/spark/pull/40234

> Support custom extensions in Spark Connect Scala client
> ---
>
> Key: SPARK-42631
> URL: https://issues.apache.org/jira/browse/SPARK-42631
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Tom van Bussel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression

2023-03-01 Thread Chenhao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenhao Li updated SPARK-42635:
---
Description: 
# When the time is close to daylight saving time transition, the result may be 
discontinuous and not monotonic.

We currently have:
{code:java}
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
timestamp'2011-03-12 03:00:00')").show
++
|timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
++
| 2011-03-13 03:59:59|
++
scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
03:00:00')").show
+--+
|timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
+--+
|   2011-03-13 03:00:00|
+--+ {code}
 

In the second query, adding one more second will set the time back one hour 
instead. Plus, there is only 23 * 3600 seconds from 2011-03-12 03:00:00 to 
2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving 
time transition.

The root cause of the problem is the Spark code at 
[https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L797]
 wrongly assumes every day has MICROS_PER_DAYseconds, and does the day and 
time-in-day split before looking at the timezone.

2. Adding month, quarter, and year silently ignores Int overflow during unit 
conversion.

The root cause is here 
[https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L1254].
 quantity is multiplied by 3 or MONTHS_PER_YEARwithout checking overflow. Note 
that we do have overflow checking in adding the amount to the timestamp, so the 
behavior is inconsistent.

This can cause counter-intuitive results like this:

 

{{scala> spark.sql("select timestampadd(quarter, 1431655764, 
timestamp'1970-01-01')").show
+--+
|timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')|

+--+
|1969-09-01 00:00:00|

{+}--{+}}}

3. Adding sub-month units (week, day, hour, minute, second, millisecond, 
microsecond)silently ignores Long overflow during unit conversion.

This is similar to the previous problem:

 

{{scala> spark.sql("select timestampadd(day, 106751992, 
timestamp'1970-01-01')").show(false)
+-+
|timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')|

+-+
|-290308-12-22 15:58:10.448384|

{+}-{+}}}

 

  was:
# When the time is close to daylight saving time transition, the result may be 
discontinuous and not monotonic.

We currently have:
{code:java}
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
timestamp'2011-03-12 03:00:00')").show
++
|timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
++
| 2011-03-13 03:59:59|
++
scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
03:00:00')").show
+--+
|timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
+--+
|   2011-03-13 03:00:00|
+--+ {code}
 

In the second query, adding one more second will set the time back one hour 
instead. Plus, there is only `23 * 3600` seconds from 2011-03-12 03:00:00 to 
2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving 
time transition.

The root cause of the problem is the Spark code at

[jira] [Assigned] (SPARK-42631) Support custom extensions in Spark Connect Scala client

2023-03-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42631:


Assignee: Apache Spark

> Support custom extensions in Spark Connect Scala client
> ---
>
> Key: SPARK-42631
> URL: https://issues.apache.org/jira/browse/SPARK-42631
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Tom van Bussel
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression

2023-03-01 Thread Chenhao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenhao Li updated SPARK-42635:
---
Description: 
# When the time is close to daylight saving time transition, the result may be 
discontinuous and not monotonic.

We currently have:
{code:java}
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
timestamp'2011-03-12 03:00:00')").show
++
|timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
++
| 2011-03-13 03:59:59|
++
scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
03:00:00')").show
+--+
|timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
+--+
|   2011-03-13 03:00:00|
+--+ {code}
 

In the second query, adding one more second will set the time back one hour 
instead. Plus, there is only `23 * 3600` seconds from 2011-03-12 03:00:00 to 
2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving 
time transition.

The root cause of the problem is the Spark code at 
[https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L797]
 wrongly assumes every day has MICROS_PER_DAYseconds, and does the day and 
time-in-day split before looking at the timezone.

2. Adding month, quarter, and year silently ignores Int overflow during unit 
conversion.

The root cause is here 
[https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L1254].
 quantity is multiplied by 3 or MONTHS_PER_YEARwithout checking overflow. Note 
that we do have overflow checking in adding the amount to the timestamp, so the 
behavior is inconsistent.

This can cause counter-intuitive results like this:

 

{{scala> spark.sql("select timestampadd(quarter, 1431655764, 
timestamp'1970-01-01')").show
+--+
|timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')|

+--+
|1969-09-01 00:00:00|

{+}--{+}}}

3. Adding sub-month units (week, day, hour, minute, second, millisecond, 
microsecond)silently ignores Long overflow during unit conversion.

This is similar to the previous problem:

 

{{scala> spark.sql("select timestampadd(day, 106751992, 
timestamp'1970-01-01')").show(false)
+-+
|timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')|

+-+
|-290308-12-22 15:58:10.448384|

{+}-{+}}}

 

  was:
# When the time is close to daylight saving time transition, the result may be 
discontinuous and not monotonic.

We currently have:

 
{code:java}
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
timestamp'2011-03-12 03:00:00')").show
++
|timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
++
| 2011-03-13 03:59:59|
++
scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
03:00:00')").show
+--+
|timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
+--+
|   2011-03-13 03:00:00|
+--+ {code}
 

In the second query, adding one more second will set the time back one hour 
instead. Plus, there is only 23 * 3600 seconds from 2011-03-12 03:00:00 to 
2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving 
time transition.

The root cause of the problem is the Spark code at

[jira] [Assigned] (SPARK-42631) Support custom extensions in Spark Connect Scala client

2023-03-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42631:


Assignee: (was: Apache Spark)

> Support custom extensions in Spark Connect Scala client
> ---
>
> Key: SPARK-42631
> URL: https://issues.apache.org/jira/browse/SPARK-42631
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Tom van Bussel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42631) Support custom extensions in Spark Connect Scala client

2023-03-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695252#comment-17695252
 ] 

Apache Spark commented on SPARK-42631:
--

User 'tomvanbussel' has created a pull request for this issue:
https://github.com/apache/spark/pull/40234

> Support custom extensions in Spark Connect Scala client
> ---
>
> Key: SPARK-42631
> URL: https://issues.apache.org/jira/browse/SPARK-42631
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Tom van Bussel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression

2023-03-01 Thread Chenhao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenhao Li updated SPARK-42635:
---
Description: 
# When the time is close to daylight saving time transition, the result may be 
discontinuous and not monotonic.

We currently have:

 
{code:java}
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
timestamp'2011-03-12 03:00:00')").show
++
|timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
++
| 2011-03-13 03:59:59|
++
scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
03:00:00')").show
+--+
|timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
+--+
|   2011-03-13 03:00:00|
+--+ {code}
 

In the second query, adding one more second will set the time back one hour 
instead. Plus, there is only 23 * 3600 seconds from 2011-03-12 03:00:00 to 
2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving 
time transition.

The root cause of the problem is the Spark code at 
[https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L797]
 wrongly assumes every day has MICROS_PER_DAYseconds, and does the day and 
time-in-day split before looking at the timezone.

2. Adding month, quarter, and year silently ignores Int overflow during unit 
conversion.

The root cause is here 
[https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L1254].
 quantity is multiplied by 3 or MONTHS_PER_YEARwithout checking overflow. Note 
that we do have overflow checking in adding the amount to the timestamp, so the 
behavior is inconsistent.

This can cause counter-intuitive results like this:

 

{{scala> spark.sql("select timestampadd(quarter, 1431655764, 
timestamp'1970-01-01')").show
+--+
|timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')|

+--+
|1969-09-01 00:00:00|

{+}--{+}}}

3. Adding sub-month units (week, day, hour, minute, second, millisecond, 
microsecond)silently ignores Long overflow during unit conversion.

This is similar to the previous problem:

 

{{scala> spark.sql("select timestampadd(day, 106751992, 
timestamp'1970-01-01')").show(false)
+-+
|timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')|

+-+
|-290308-12-22 15:58:10.448384|

{+}-{+}}}

 

  was:
# When the time is close to daylight saving time transition, the result may be 
discontinuous and not monotonic.

As pointed out by me and @Utkarsh Agarwal in 
[https://github.com/databricks/runtime/pull/54936/files#r1118047445], the 
result is conter-intuitive when the time is close to daylight saving time 
transition and the added amount is close to the multiple of days.

We currently have:

```

scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
timestamp'2011-03-12 03:00:00')").show
++
|timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
++
|                                                     2011-03-13 03:59:59|
++
scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
03:00:00')").show
+--+
|timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
+--+
|                                               2011-03-13 03:00:00|
+--+

```

In the second query, adding one more second will set the time back one hour 
instead. Plus, there is only 23 * 3600seconds from 2011-03-12 03:00:00 to 
2011-03-13

[jira] [Updated] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression

2023-03-01 Thread Chenhao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenhao Li updated SPARK-42635:
---
Description: 
# When the time is close to daylight saving time transition, the result may be 
discontinuous and not monotonic.

As pointed out by me and @Utkarsh Agarwal in 
[https://github.com/databricks/runtime/pull/54936/files#r1118047445], the 
result is conter-intuitive when the time is close to daylight saving time 
transition and the added amount is close to the multiple of days.

We currently have:

```

scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
timestamp'2011-03-12 03:00:00')").show
++
|timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
++
|                                                     2011-03-13 03:59:59|
++
scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
03:00:00')").show
+--+
|timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
+--+
|                                               2011-03-13 03:00:00|
+--+

```

In the second query, adding one more second will set the time back one hour 
instead. Plus, there is only 23 * 3600seconds from 2011-03-12 03:00:00 to 
2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving 
time transition.

The root cause of the problem is the Spark code at 
[https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L797]
 wrongly assumes every day has MICROS_PER_DAYseconds, and does the day and 
time-in-day split before looking at the timezone.

2. Adding month, quarter, and year silently ignores Int overflow during unit 
conversion.

The root cause is here 
[https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L1254].
 quantity is multiplied by 3 or MONTHS_PER_YEARwithout checking overflow. Note 
that we do have overflow checking in adding the amount to the timestamp, so the 
behavior is inconsistent.

This can cause counter-intuitive results like this:

 

{{scala> spark.sql("select timestampadd(quarter, 1431655764, 
timestamp'1970-01-01')").show
+--+
|timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')|

+--+
|1969-09-01 00:00:00|

{+}--{+}}}

3. Adding sub-month units (week, day, hour, minute, second, millisecond, 
microsecond)silently ignores Long overflow during unit conversion.

This is similar to the previous problem:

 

{{scala> spark.sql("select timestampadd(day, 106751992, 
timestamp'1970-01-01')").show(false)
+-+
|timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')|

+-+
|-290308-12-22 15:58:10.448384|

{+}-{+}}}

 

  was:
# When the time is close to daylight saving time transition, the result may be 
discontinuous and not monotonic.

As pointed out by me and @Utkarsh Agarwal in 
[https://github.com/databricks/runtime/pull/54936/files#r1118047445], the 
result is conter-intuitive when the time is close to daylight saving time 
transition and the added amount is close to the multiple of days.

We currently have:

```

scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
timestamp'2011-03-12 03:00:00')").show
++
|timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|

++
|2011-03-13 03:59:59|

++
scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
03:00:00')").show
+--+
|timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|

+--+
|2011-03-13 03:00:00|

+--+

```

In the

[jira] [Updated] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression

2023-03-01 Thread Chenhao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenhao Li updated SPARK-42635:
---
Description: 
# When the time is close to daylight saving time transition, the result may be 
discontinuous and not monotonic.

As pointed out by me and @Utkarsh Agarwal in 
[https://github.com/databricks/runtime/pull/54936/files#r1118047445], the 
result is conter-intuitive when the time is close to daylight saving time 
transition and the added amount is close to the multiple of days.

We currently have:

```

scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
timestamp'2011-03-12 03:00:00')").show
++
|timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|

++
|2011-03-13 03:59:59|

++
scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
03:00:00')").show
+--+
|timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|

+--+
|2011-03-13 03:00:00|

+--+

```

In the second query, adding one more second will set the time back one hour 
instead. Plus, there is only 23 * 3600seconds from 2011-03-12 03:00:00 to 
2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving 
time transition.

The root cause of the problem is the Spark code at 
[https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L797]
 wrongly assumes every day has MICROS_PER_DAYseconds, and does the day and 
time-in-day split before looking at the timezone.

2. Adding month, quarter, and year silently ignores Int overflow during unit 
conversion.

The root cause is here 
[https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L1254].
 quantity is multiplied by 3 or MONTHS_PER_YEARwithout checking overflow. Note 
that we do have overflow checking in adding the amount to the timestamp, so the 
behavior is inconsistent.

This can cause counter-intuitive results like this:

 

{{scala> spark.sql("select timestampadd(quarter, 1431655764, 
timestamp'1970-01-01')").show
+--+
|timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')|

+--+
|1969-09-01 00:00:00|

{+}--{+}}}

3. Adding sub-month units (week, day, hour, minute, second, millisecond, 
microsecond)silently ignores Long overflow during unit conversion.

This is similar to the previous problem:

 

{{scala> spark.sql("select timestampadd(day, 106751992, 
timestamp'1970-01-01')").show(false)
+-+
|timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')|

+-+
|-290308-12-22 15:58:10.448384|

{+}-{+}}}

 

  was:
# When the time is close to daylight saving time transition, the result may be 
discontinuous and not monotonic.

As pointed out by me and @Utkarsh Agarwal in 
[https://github.com/databricks/runtime/pull/54936/files#r1118047445], the 
result is conter-intuitive when the time is close to daylight saving time 
transition and the added amount is close to the multiple of days.

We currently have:

 

{{scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
timestamp'2011-03-12 03:00:00')").show
++
|timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
++
| 2011-03-13 03:59:59|
++
scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
03:00:00')").show
+--+
|timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
+--+
|   2011-03-13 03:00:00|
+--+}}

In the second

[jira] [Updated] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression

2023-03-01 Thread Chenhao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenhao Li updated SPARK-42635:
---
Description: 
# When the time is close to daylight saving time transition, the result may be 
discontinuous and not monotonic.

As pointed out by me and @Utkarsh Agarwal in 
[https://github.com/databricks/runtime/pull/54936/files#r1118047445], the 
result is conter-intuitive when the time is close to daylight saving time 
transition and the added amount is close to the multiple of days.

We currently have:

 

{{scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
timestamp'2011-03-12 03:00:00')").show
++
|timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
++
| 2011-03-13 03:59:59|
++
scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
03:00:00')").show
+--+
|timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
+--+
|   2011-03-13 03:00:00|
+--+}}

In the second query, adding one more second will set the time back one hour 
instead. Plus, there is only 23 * 3600seconds from 2011-03-12 03:00:00 to 
2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving 
time transition.

The root cause of the problem is the Spark code at 
[https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L797]
 wrongly assumes every day has MICROS_PER_DAYseconds, and does the day and 
time-in-day split before looking at the timezone.

2. Adding month, quarter, and year silently ignores Int overflow during unit 
conversion.

The root cause is here 
[https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L1254].
 quantity is multiplied by 3 or MONTHS_PER_YEARwithout checking overflow. Note 
that we do have overflow checking in adding the amount to the timestamp, so the 
behavior is inconsistent.

This can cause counter-intuitive results like this:

 

{{scala> spark.sql("select timestampadd(quarter, 1431655764, 
timestamp'1970-01-01')").show
+--+
|timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')|
+--+
|   1969-09-01 00:00:00|
+--+}}

3. Adding sub-month units (week, day, hour, minute, second, millisecond, 
microsecond)silently ignores Long overflow during unit conversion.

This is similar to the previous problem:

 

{{scala> spark.sql("select timestampadd(day, 106751992, 
timestamp'1970-01-01')").show(false)
+-+
|timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')|
+-+
|-290308-12-22 15:58:10.448384|
+-+}}

 

> Several counter-intuitive behaviours in the TimestampAdd expression
> ---
>
> Key: SPARK-42635
> URL: https://issues.apache.org/jira/browse/SPARK-42635
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0, 3.3.1, 3.3.2
>Reporter: Chenhao Li
>Priority: Major
>
> # When the time is close to daylight saving time transition, the result may 
> be discontinuous and not monotonic.
> As pointed out by me and @Utkarsh Agarwal in 
> [https://github.com/databricks/runtime/pull/54936/files#r1118047445], the 
> result is conter-intuitive when the time is close to daylight saving time 
> transition and the added amount is close to the multiple of days.
> We currently have:
>  
> {{scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
> timestamp'2011-03-12 03:00:00')").show
> ++
> |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
>

[jira] [Updated] (SPARK-39993) Spark on Kubernetes doesn't filter data by date

2023-03-01 Thread Hanna Liashchuk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hanna Liashchuk updated SPARK-39993:

Affects Version/s: (was: 3.3.2)

> Spark on Kubernetes doesn't filter data by date
> ---
>
> Key: SPARK-39993
> URL: https://issues.apache.org/jira/browse/SPARK-39993
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.2
> Environment: Kubernetes v1.23.6
> Spark 3.2.2
> Java 1.8.0_312
> Python 3.9.13
> Aws dependencies:
> aws-java-sdk-bundle-1.11.901.jar and hadoop-aws-3.3.1.jar
>Reporter: Hanna Liashchuk
>Priority: Major
>  Labels: kubernetes
>
> I'm creating a Dataset with type date and saving it into s3. When I read it 
> and try to use where() clause, I've noticed it doesn't return data even 
> though it's there
> Below is the code snippet I'm running
>  
> {code:java}
> from pyspark.sql.types import Row
> from pyspark.sql.functions import *
> ds = spark.range(10).withColumn("date", lit("2022-01-01")).withColumn("date", 
> col("date").cast("date"))
> ds.where("date = '2022-01-01'").show()
> ds.write.mode("overwrite").parquet("s3a://bucket/test")
> df = spark.read.format("parquet").load("s3a://bucket/test")
> df.where("date = '2022-01-01'").show()
> {code}
> The first show() returns data, while the second one - no.
> I've noticed that it's Kubernetes master related, as the same code snipped 
> works ok with master "local"
> UPD: if the column is used as a partition and has the type "date" there is no 
> filtering problem.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 147 matches

Mail list logo