[jira] [Created] (SPARK-42648) Upgrade versions-maven-plugin to 2.15.0
Yang Jie created SPARK-42648: Summary: Upgrade versions-maven-plugin to 2.15.0 Key: SPARK-42648 URL: https://issues.apache.org/jira/browse/SPARK-42648 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: Yang Jie https://github.com/mojohaus/versions/releases/tag/2.15.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42647) Remove aliases from deprecated numpy data types
[ https://issues.apache.org/jira/browse/SPARK-42647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42647: Assignee: (was: Apache Spark) > Remove aliases from deprecated numpy data types > --- > > Key: SPARK-42647 > URL: https://issues.apache.org/jira/browse/SPARK-42647 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1 >Reporter: Aimilios Tsouvelekakis >Priority: Major > > Numpy has started changing the alias to some of its data-types. This means > that users with the latest version of numpy they will face either warnings or > errors according to the type that they are using. This affects all the users > using numoy > 1.20.0. One of the types was fixed back in September with this > [pull|https://github.com/apache/spark/pull/37817] request. > The problem can be split into 2 types: > [numpy 1.24.0|https://github.com/numpy/numpy/pull/22607]: The scalar type > aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, > np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually > be removed. At this point in numpy 1.25.0 they give a awarning > [numpy 1.20.0|https://github.com/numpy/numpy/pull/14882]: Using the aliases > of builtin types like np.int is deprecated and removed since numpy version > 1.24.0 > The changes are needed so pyspark can be compatible with the latest numpy and > avoid > * attribute errors on data types being deprecated from version 1.20.0: > [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations] > * warnings on deprecated data types from version 1.24.0: > [https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations] > > From my main research I see the following: > The only changes that are functional are related with the conversion.py file. > The rest of the changes are inside tests in the user_guide or in some > docstrings describing specific functions. Since I am not an expert in these > tests I wait for the reviewer and some people with more experience in the > pyspark code. > These types are aliases for classic python types so yes they should work with > all the numpy versions > [1|https://numpy.org/devdocs/release/1.20.0-notes.html], > [2|https://stackoverflow.com/questions/74844262/how-can-i-solve-error-module-numpy-has-no-attribute-float-in-python]. > The error or warning comes from the call to the numpy. > > For the versions I chose to include from 3.3 and onwards but I see that 3.2 > also is still in the 18 month maintenace cadence as it was released in > October 2021. > > The pull request: [https://github.com/apache/spark/pull/40220] > Best Regards, > Aimilios -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42647) Remove aliases from deprecated numpy data types
[ https://issues.apache.org/jira/browse/SPARK-42647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695520#comment-17695520 ] Apache Spark commented on SPARK-42647: -- User 'aimtsou' has created a pull request for this issue: https://github.com/apache/spark/pull/40220 > Remove aliases from deprecated numpy data types > --- > > Key: SPARK-42647 > URL: https://issues.apache.org/jira/browse/SPARK-42647 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1 >Reporter: Aimilios Tsouvelekakis >Priority: Major > > Numpy has started changing the alias to some of its data-types. This means > that users with the latest version of numpy they will face either warnings or > errors according to the type that they are using. This affects all the users > using numoy > 1.20.0. One of the types was fixed back in September with this > [pull|https://github.com/apache/spark/pull/37817] request. > The problem can be split into 2 types: > [numpy 1.24.0|https://github.com/numpy/numpy/pull/22607]: The scalar type > aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, > np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually > be removed. At this point in numpy 1.25.0 they give a awarning > [numpy 1.20.0|https://github.com/numpy/numpy/pull/14882]: Using the aliases > of builtin types like np.int is deprecated and removed since numpy version > 1.24.0 > The changes are needed so pyspark can be compatible with the latest numpy and > avoid > * attribute errors on data types being deprecated from version 1.20.0: > [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations] > * warnings on deprecated data types from version 1.24.0: > [https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations] > > From my main research I see the following: > The only changes that are functional are related with the conversion.py file. > The rest of the changes are inside tests in the user_guide or in some > docstrings describing specific functions. Since I am not an expert in these > tests I wait for the reviewer and some people with more experience in the > pyspark code. > These types are aliases for classic python types so yes they should work with > all the numpy versions > [1|https://numpy.org/devdocs/release/1.20.0-notes.html], > [2|https://stackoverflow.com/questions/74844262/how-can-i-solve-error-module-numpy-has-no-attribute-float-in-python]. > The error or warning comes from the call to the numpy. > > For the versions I chose to include from 3.3 and onwards but I see that 3.2 > also is still in the 18 month maintenace cadence as it was released in > October 2021. > > The pull request: [https://github.com/apache/spark/pull/40220] > Best Regards, > Aimilios -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42647) Remove aliases from deprecated numpy data types
[ https://issues.apache.org/jira/browse/SPARK-42647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42647: Assignee: Apache Spark > Remove aliases from deprecated numpy data types > --- > > Key: SPARK-42647 > URL: https://issues.apache.org/jira/browse/SPARK-42647 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1 >Reporter: Aimilios Tsouvelekakis >Assignee: Apache Spark >Priority: Major > > Numpy has started changing the alias to some of its data-types. This means > that users with the latest version of numpy they will face either warnings or > errors according to the type that they are using. This affects all the users > using numoy > 1.20.0. One of the types was fixed back in September with this > [pull|https://github.com/apache/spark/pull/37817] request. > The problem can be split into 2 types: > [numpy 1.24.0|https://github.com/numpy/numpy/pull/22607]: The scalar type > aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, > np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually > be removed. At this point in numpy 1.25.0 they give a awarning > [numpy 1.20.0|https://github.com/numpy/numpy/pull/14882]: Using the aliases > of builtin types like np.int is deprecated and removed since numpy version > 1.24.0 > The changes are needed so pyspark can be compatible with the latest numpy and > avoid > * attribute errors on data types being deprecated from version 1.20.0: > [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations] > * warnings on deprecated data types from version 1.24.0: > [https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations] > > From my main research I see the following: > The only changes that are functional are related with the conversion.py file. > The rest of the changes are inside tests in the user_guide or in some > docstrings describing specific functions. Since I am not an expert in these > tests I wait for the reviewer and some people with more experience in the > pyspark code. > These types are aliases for classic python types so yes they should work with > all the numpy versions > [1|https://numpy.org/devdocs/release/1.20.0-notes.html], > [2|https://stackoverflow.com/questions/74844262/how-can-i-solve-error-module-numpy-has-no-attribute-float-in-python]. > The error or warning comes from the call to the numpy. > > For the versions I chose to include from 3.3 and onwards but I see that 3.2 > also is still in the 18 month maintenace cadence as it was released in > October 2021. > > The pull request: [https://github.com/apache/spark/pull/40220] > Best Regards, > Aimilios -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42639) Add createDataFrame/createDataset to SparkSession
[ https://issues.apache.org/jira/browse/SPARK-42639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42639. --- Fix Version/s: 3.4.1 Resolution: Fixed Issue resolved by pull request 40242 [https://github.com/apache/spark/pull/40242] > Add createDataFrame/createDataset to SparkSession > - > > Key: SPARK-42639 > URL: https://issues.apache.org/jira/browse/SPARK-42639 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.4.1 > > > Add createDataFrame/createDataset to SparkSession -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42647) Remove aliases from deprecated numpy data types
[ https://issues.apache.org/jira/browse/SPARK-42647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aimilios Tsouvelekakis updated SPARK-42647: --- Description: Numpy has started changing the alias to some of its data-types. This means that users with the latest version of numpy they will face either warnings or errors according to the type that they are using. This affects all the users using numoy > 1.20.0. One of the types was fixed back in September with this [pull|https://github.com/apache/spark/pull/37817] request. The problem can be split into 2 types: [numpy 1.24.0|https://github.com/numpy/numpy/pull/22607]: The scalar type aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually be removed. At this point in numpy 1.25.0 they give a awarning [numpy 1.20.0|https://github.com/numpy/numpy/pull/14882]: Using the aliases of builtin types like np.int is deprecated and removed since numpy version 1.24.0 The changes are needed so pyspark can be compatible with the latest numpy and avoid * attribute errors on data types being deprecated from version 1.20.0: [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations] * warnings on deprecated data types from version 1.24.0: [https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations] >From my main research I see the following: The only changes that are functional are related with the conversion.py file. The rest of the changes are inside tests in the user_guide or in some docstrings describing specific functions. Since I am not an expert in these tests I wait for the reviewer and some people with more experience in the pyspark code. These types are aliases for classic python types so yes they should work with all the numpy versions [1|https://numpy.org/devdocs/release/1.20.0-notes.html], [2|https://stackoverflow.com/questions/74844262/how-can-i-solve-error-module-numpy-has-no-attribute-float-in-python]. The error or warning comes from the call to the numpy. For the versions I chose to include from 3.3 and onwards but I see that 3.2 also is still in the 18 month maintenace cadence as it was released in October 2021. The pull request: [https://github.com/apache/spark/pull/40220] Best Regards, Aimilios was: Numpy has started changing the alias to some of its data-types. This means that users with the latest version of numpy they will face either warnings or errors according to the type that they are using. This affects all the users using numoy > 1.20.0. One of the types was fixed back in September with this [pull|https://github.com/apache/spark/pull/37817] request. The problem can be split into 2 types: [numpy 1.24.0|https://github.com/numpy/numpy/pull/22607]: The scalar type aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually be removed. At this point in numpy 1.25.0 they give a awarning [numpy 1.20.0|https://github.com/numpy/numpy/pull/14882]: Using the aliases of builtin types like np.int is deprecated and removed since numpy version 1.24.0 The changes are needed so pyspark can be compatible with the latest numpy and avoid * attribute errors on data types being deprecated from version 1.20.0: [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations] * warnings on deprecated data types from version 1.24.0: [https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations] >From my main research I see the following: The only changes that are functional are related with the conversion.py file. The rest of the changes are inside tests in the user_guide or in some docstrings describing specific functions. Since I am not an expert in these tests I wait for the reviewer and some people with more experience in the pyspark code. These types are aliases for classic python types so yes they should work with all the numpy versions [1|https://numpy.org/devdocs/release/1.20.0-notes.html], [2|https://stackoverflow.com/questions/74844262/how-can-i-solve-error-module-numpy-has-no-attribute-float-in-python]. The error or warning comes from the call to the numpy. The pull request: [https://github.com/apache/spark/pull/40220] Best Regards, Aimilios > Remove aliases from deprecated numpy data types > --- > > Key: SPARK-42647 > URL: https://issues.apache.org/jira/browse/SPARK-42647 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1 >Reporter: Aimilios Tsouvelekakis >Priority: Major > > Numpy has started changing the alias to some of its data-types. This means > that users with the latest version of numpy they
[jira] [Created] (SPARK-42647) Remove aliases from deprecated numpy data types
Aimilios Tsouvelekakis created SPARK-42647: -- Summary: Remove aliases from deprecated numpy data types Key: SPARK-42647 URL: https://issues.apache.org/jira/browse/SPARK-42647 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3.2, 3.3.1, 3.3.0, 3.3.3, 3.4.0, 3.4.1 Reporter: Aimilios Tsouvelekakis Numpy has started changing the alias to some of its data-types. This means that users with the latest version of numpy they will face either warnings or errors according to the type that they are using. This affects all the users using numoy > 1.20.0. One of the types was fixed back in September with this [pull|https://github.com/apache/spark/pull/37817] request. The problem can be split into 2 types: [numpy 1.24.0|https://github.com/numpy/numpy/pull/22607]: The scalar type aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually be removed. At this point in numpy 1.25.0 they give a awarning [numpy 1.20.0|https://github.com/numpy/numpy/pull/14882]: Using the aliases of builtin types like np.int is deprecated and removed since numpy version 1.24.0 The changes are needed so pyspark can be compatible with the latest numpy and avoid * attribute errors on data types being deprecated from version 1.20.0: [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations] * warnings on deprecated data types from version 1.24.0: [https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations] >From my main research I see the following: The only changes that are functional are related with the conversion.py file. The rest of the changes are inside tests in the user_guide or in some docstrings describing specific functions. Since I am not an expert in these tests I wait for the reviewer and some people with more experience in the pyspark code. These types are aliases for classic python types so yes they should work with all the numpy versions [1|https://numpy.org/devdocs/release/1.20.0-notes.html], [2|https://stackoverflow.com/questions/74844262/how-can-i-solve-error-module-numpy-has-no-attribute-float-in-python]. The error or warning comes from the call to the numpy. The pull request: [https://github.com/apache/spark/pull/40220] Best Regards, Aimilios -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42553) NonReserved keyword "interval" can't be column name
[ https://issues.apache.org/jira/browse/SPARK-42553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695498#comment-17695498 ] Dongjoon Hyun commented on SPARK-42553: --- Since RC2 tag is created, I changed the Fixed Version from 3.4.0 to 3.4.1 for now. We can adjust it later according to the RC2 result. > NonReserved keyword "interval" can't be column name > --- > > Key: SPARK-42553 > URL: https://issues.apache.org/jira/browse/SPARK-42553 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.2.3, 3.3.2 > Environment: Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java > 1.8.0_345) > Spark version 3.2.3-SNAPSHOT >Reporter: jiang13021 >Assignee: jiang13021 >Priority: Major > Fix For: 3.4.1 > > > INTERVAL is a Non-Reserved keyword in spark. "Non-Reserved keywords" have a > special meaning in particular contexts and can be used as identifiers in > other contexts. So by design, interval can be used as a column name. > {code:java} > scala> spark.sql("select interval from mytable") > org.apache.spark.sql.catalyst.parser.ParseException: > at least one time unit should be given for interval literal(line 1, pos 7)== > SQL == > select interval from mytable > ---^^^ at > org.apache.spark.sql.errors.QueryParsingErrors$.invalidIntervalLiteralError(QueryParsingErrors.scala:196) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$parseIntervalLiteral$1(AstBuilder.scala:2481) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.parseIntervalLiteral(AstBuilder.scala:2466) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitInterval$1(AstBuilder.scala:2432) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitInterval(AstBuilder.scala:2431) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitInterval(AstBuilder.scala:57) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$IntervalContext.accept(SqlBaseParser.java:17308) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitIntervalLiteral(SqlBaseBaseVisitor.java:1581) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$IntervalLiteralContext.accept(SqlBaseParser.java:16929) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitConstantDefault(SqlBaseBaseVisitor.java:1511) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$ConstantDefaultContext.accept(SqlBaseParser.java:15905) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitValueExpressionDefault(SqlBaseBaseVisitor.java:1392) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$ValueExpressionDefaultContext.accept(SqlBaseParser.java:15298) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.expression(AstBuilder.scala:1412) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitPredicated$1(AstBuilder.scala:1548) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitPredicated(AstBuilder.scala:1547) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitPredicated(AstBuilder.scala:57) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$PredicatedContext.accept(SqlBaseParser.java:14745) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitExpression(SqlBaseBaseVisitor.java:1343) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$ExpressionContext.accept(SqlBaseParser.java:14606) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.expression(AstBuilder.scala:1412) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitNamedExpression$1(AstBuilder.scala:1434) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitNamedExpression(AstBuilder.scala:1433) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitNamedExpression(AstBuilder.scala:57) > at >
[jira] [Updated] (SPARK-42553) NonReserved keyword "interval" can't be column name
[ https://issues.apache.org/jira/browse/SPARK-42553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42553: -- Fix Version/s: 3.4.1 (was: 3.4.0) > NonReserved keyword "interval" can't be column name > --- > > Key: SPARK-42553 > URL: https://issues.apache.org/jira/browse/SPARK-42553 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.2.3, 3.3.2 > Environment: Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java > 1.8.0_345) > Spark version 3.2.3-SNAPSHOT >Reporter: jiang13021 >Assignee: jiang13021 >Priority: Major > Fix For: 3.4.1 > > > INTERVAL is a Non-Reserved keyword in spark. "Non-Reserved keywords" have a > special meaning in particular contexts and can be used as identifiers in > other contexts. So by design, interval can be used as a column name. > {code:java} > scala> spark.sql("select interval from mytable") > org.apache.spark.sql.catalyst.parser.ParseException: > at least one time unit should be given for interval literal(line 1, pos 7)== > SQL == > select interval from mytable > ---^^^ at > org.apache.spark.sql.errors.QueryParsingErrors$.invalidIntervalLiteralError(QueryParsingErrors.scala:196) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$parseIntervalLiteral$1(AstBuilder.scala:2481) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.parseIntervalLiteral(AstBuilder.scala:2466) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitInterval$1(AstBuilder.scala:2432) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitInterval(AstBuilder.scala:2431) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitInterval(AstBuilder.scala:57) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$IntervalContext.accept(SqlBaseParser.java:17308) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitIntervalLiteral(SqlBaseBaseVisitor.java:1581) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$IntervalLiteralContext.accept(SqlBaseParser.java:16929) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitConstantDefault(SqlBaseBaseVisitor.java:1511) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$ConstantDefaultContext.accept(SqlBaseParser.java:15905) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitValueExpressionDefault(SqlBaseBaseVisitor.java:1392) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$ValueExpressionDefaultContext.accept(SqlBaseParser.java:15298) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.expression(AstBuilder.scala:1412) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitPredicated$1(AstBuilder.scala:1548) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitPredicated(AstBuilder.scala:1547) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitPredicated(AstBuilder.scala:57) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$PredicatedContext.accept(SqlBaseParser.java:14745) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitExpression(SqlBaseBaseVisitor.java:1343) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$ExpressionContext.accept(SqlBaseParser.java:14606) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.expression(AstBuilder.scala:1412) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitNamedExpression$1(AstBuilder.scala:1434) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitNamedExpression(AstBuilder.scala:1433) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitNamedExpression(AstBuilder.scala:57) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$NamedExpressionContext.accept(SqlBaseParser.java:14124) > at >
[jira] [Updated] (SPARK-42644) Add `hive` dependency to `connect` module
[ https://issues.apache.org/jira/browse/SPARK-42644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42644: -- Fix Version/s: 3.4.1 (was: 3.4.0) > Add `hive` dependency to `connect` module > - > > Key: SPARK-42644 > URL: https://issues.apache.org/jira/browse/SPARK-42644 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42644) Add `hive` dependency to `connect` module
[ https://issues.apache.org/jira/browse/SPARK-42644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42644. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40246 [https://github.com/apache/spark/pull/40246] > Add `hive` dependency to `connect` module > - > > Key: SPARK-42644 > URL: https://issues.apache.org/jira/browse/SPARK-42644 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42644) Add `hive` dependency to `connect` module
[ https://issues.apache.org/jira/browse/SPARK-42644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42644: - Assignee: Dongjoon Hyun > Add `hive` dependency to `connect` module > - > > Key: SPARK-42644 > URL: https://issues.apache.org/jira/browse/SPARK-42644 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42553) NonReserved keyword "interval" can't be column name
[ https://issues.apache.org/jira/browse/SPARK-42553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-42553. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40195 [https://github.com/apache/spark/pull/40195] > NonReserved keyword "interval" can't be column name > --- > > Key: SPARK-42553 > URL: https://issues.apache.org/jira/browse/SPARK-42553 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.2.3, 3.3.2 > Environment: Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java > 1.8.0_345) > Spark version 3.2.3-SNAPSHOT >Reporter: jiang13021 >Assignee: jiang13021 >Priority: Major > Fix For: 3.4.0 > > > INTERVAL is a Non-Reserved keyword in spark. "Non-Reserved keywords" have a > special meaning in particular contexts and can be used as identifiers in > other contexts. So by design, interval can be used as a column name. > {code:java} > scala> spark.sql("select interval from mytable") > org.apache.spark.sql.catalyst.parser.ParseException: > at least one time unit should be given for interval literal(line 1, pos 7)== > SQL == > select interval from mytable > ---^^^ at > org.apache.spark.sql.errors.QueryParsingErrors$.invalidIntervalLiteralError(QueryParsingErrors.scala:196) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$parseIntervalLiteral$1(AstBuilder.scala:2481) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.parseIntervalLiteral(AstBuilder.scala:2466) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitInterval$1(AstBuilder.scala:2432) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitInterval(AstBuilder.scala:2431) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitInterval(AstBuilder.scala:57) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$IntervalContext.accept(SqlBaseParser.java:17308) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitIntervalLiteral(SqlBaseBaseVisitor.java:1581) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$IntervalLiteralContext.accept(SqlBaseParser.java:16929) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitConstantDefault(SqlBaseBaseVisitor.java:1511) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$ConstantDefaultContext.accept(SqlBaseParser.java:15905) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitValueExpressionDefault(SqlBaseBaseVisitor.java:1392) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$ValueExpressionDefaultContext.accept(SqlBaseParser.java:15298) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.expression(AstBuilder.scala:1412) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitPredicated$1(AstBuilder.scala:1548) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitPredicated(AstBuilder.scala:1547) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitPredicated(AstBuilder.scala:57) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$PredicatedContext.accept(SqlBaseParser.java:14745) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitExpression(SqlBaseBaseVisitor.java:1343) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$ExpressionContext.accept(SqlBaseParser.java:14606) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.expression(AstBuilder.scala:1412) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitNamedExpression$1(AstBuilder.scala:1434) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitNamedExpression(AstBuilder.scala:1433) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitNamedExpression(AstBuilder.scala:57) > at >
[jira] [Assigned] (SPARK-42553) NonReserved keyword "interval" can't be column name
[ https://issues.apache.org/jira/browse/SPARK-42553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-42553: Assignee: jiang13021 > NonReserved keyword "interval" can't be column name > --- > > Key: SPARK-42553 > URL: https://issues.apache.org/jira/browse/SPARK-42553 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.2.3, 3.3.2 > Environment: Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java > 1.8.0_345) > Spark version 3.2.3-SNAPSHOT >Reporter: jiang13021 >Assignee: jiang13021 >Priority: Major > > INTERVAL is a Non-Reserved keyword in spark. "Non-Reserved keywords" have a > special meaning in particular contexts and can be used as identifiers in > other contexts. So by design, interval can be used as a column name. > {code:java} > scala> spark.sql("select interval from mytable") > org.apache.spark.sql.catalyst.parser.ParseException: > at least one time unit should be given for interval literal(line 1, pos 7)== > SQL == > select interval from mytable > ---^^^ at > org.apache.spark.sql.errors.QueryParsingErrors$.invalidIntervalLiteralError(QueryParsingErrors.scala:196) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$parseIntervalLiteral$1(AstBuilder.scala:2481) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.parseIntervalLiteral(AstBuilder.scala:2466) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitInterval$1(AstBuilder.scala:2432) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitInterval(AstBuilder.scala:2431) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitInterval(AstBuilder.scala:57) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$IntervalContext.accept(SqlBaseParser.java:17308) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitIntervalLiteral(SqlBaseBaseVisitor.java:1581) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$IntervalLiteralContext.accept(SqlBaseParser.java:16929) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitConstantDefault(SqlBaseBaseVisitor.java:1511) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$ConstantDefaultContext.accept(SqlBaseParser.java:15905) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitValueExpressionDefault(SqlBaseBaseVisitor.java:1392) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$ValueExpressionDefaultContext.accept(SqlBaseParser.java:15298) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.expression(AstBuilder.scala:1412) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitPredicated$1(AstBuilder.scala:1548) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitPredicated(AstBuilder.scala:1547) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitPredicated(AstBuilder.scala:57) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$PredicatedContext.accept(SqlBaseParser.java:14745) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitExpression(SqlBaseBaseVisitor.java:1343) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$ExpressionContext.accept(SqlBaseParser.java:14606) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.expression(AstBuilder.scala:1412) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitNamedExpression$1(AstBuilder.scala:1434) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitNamedExpression(AstBuilder.scala:1433) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitNamedExpression(AstBuilder.scala:57) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$NamedExpressionContext.accept(SqlBaseParser.java:14124) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61) > at
[jira] [Updated] (SPARK-42642) Make Python the first code example tab in the Spark documentation
[ https://issues.apache.org/jira/browse/SPARK-42642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allan Folting updated SPARK-42642: -- Summary: Make Python the first code example tab in the Spark documentation (was: Make Python the first code example tab) > Make Python the first code example tab in the Spark documentation > - > > Key: SPARK-42642 > URL: https://issues.apache.org/jira/browse/SPARK-42642 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Allan Folting >Priority: Major > Attachments: Screenshot 2023-03-01 at 8.10.08 PM.png, Screenshot > 2023-03-01 at 8.10.22 PM.png > > > Python is the most approachable and most popular language so it should be the > default language in code examples so this makes Python the first code example > tab consistently across the documentation, where applicable. > This is continuing the work started with: > https://issues.apache.org/jira/browse/SPARK-42493 > where these two pages were updated: > [https://spark.apache.org/docs/latest/sql-getting-started.html] > [https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html] > > Pages being updated now: > [https://spark.apache.org/docs/latest/ml-classification-regression.html] > [https://spark.apache.org/docs/latest/ml-clustering.html] > [https://spark.apache.org/docs/latest/ml-collaborative-filtering.html] > [https://spark.apache.org/docs/latest/ml-datasource.html] > [https://spark.apache.org/docs/latest/ml-features.html] > [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] > [https://spark.apache.org/docs/latest/ml-migration-guide.html] > [https://spark.apache.org/docs/latest/ml-pipeline.html] > [https://spark.apache.org/docs/latest/ml-statistics.html] > [https://spark.apache.org/docs/latest/ml-tuning.html] > > [https://spark.apache.org/docs/latest/mllib-clustering.html] > [https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html] > [https://spark.apache.org/docs/latest/mllib-data-types.html] > [https://spark.apache.org/docs/latest/mllib-decision-tree.html] > [https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html] > [https://spark.apache.org/docs/latest/mllib-ensembles.html] > [https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html] > [https://spark.apache.org/docs/latest/mllib-feature-extraction.html] > [https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html] > [https://spark.apache.org/docs/latest/mllib-isotonic-regression.html] > [https://spark.apache.org/docs/latest/mllib-linear-methods.html] > [https://spark.apache.org/docs/latest/mllib-naive-bayes.html] > [https://spark.apache.org/docs/latest/mllib-statistics.html] > > [https://spark.apache.org/docs/latest/quick-start.html] > > [https://spark.apache.org/docs/latest/rdd-programming-guide.html] > > [https://spark.apache.org/docs/latest/sql-data-sources-avro.html] > [https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html] > [https://spark.apache.org/docs/latest/sql-data-sources-csv.html] > [https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html] > [https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html] > [https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html] > [https://spark.apache.org/docs/latest/sql-data-sources-json.html] > [https://spark.apache.org/docs/latest/sql-data-sources-parquet.html] > sql-data-sources-protobuf.html > [https://spark.apache.org/docs/latest/sql-data-sources-text.html] > [https://spark.apache.org/docs/latest/sql-migration-guide.html] > [https://spark.apache.org/docs/latest/sql-performance-tuning.html] > [https://spark.apache.org/docs/latest/sql-ref-datatypes.html] > > [https://spark.apache.org/docs/latest/streaming-kinesis-integration.html] > [https://spark.apache.org/docs/latest/streaming-programming-guide.html] > > [https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html] > [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html] > > > > > > > > > > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42642) Make Python the first code example tab
[ https://issues.apache.org/jira/browse/SPARK-42642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allan Folting updated SPARK-42642: -- Description: Python is the most approachable and most popular language so it should be the default language in code examples so this makes Python the first code example tab consistently across the documentation, where applicable. This is continuing the work started with: https://issues.apache.org/jira/browse/SPARK-42493 where these two pages were updated: [https://spark.apache.org/docs/latest/sql-getting-started.html] [https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html] Pages being updated now: [https://spark.apache.org/docs/latest/ml-classification-regression.html] [https://spark.apache.org/docs/latest/ml-clustering.html] [https://spark.apache.org/docs/latest/ml-collaborative-filtering.html] [https://spark.apache.org/docs/latest/ml-datasource.html] [https://spark.apache.org/docs/latest/ml-features.html] [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] [https://spark.apache.org/docs/latest/ml-migration-guide.html] [https://spark.apache.org/docs/latest/ml-pipeline.html] [https://spark.apache.org/docs/latest/ml-statistics.html] [https://spark.apache.org/docs/latest/ml-tuning.html] [https://spark.apache.org/docs/latest/mllib-clustering.html] [https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html] [https://spark.apache.org/docs/latest/mllib-data-types.html] [https://spark.apache.org/docs/latest/mllib-decision-tree.html] [https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html] [https://spark.apache.org/docs/latest/mllib-ensembles.html] [https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html] [https://spark.apache.org/docs/latest/mllib-feature-extraction.html] [https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html] [https://spark.apache.org/docs/latest/mllib-isotonic-regression.html] [https://spark.apache.org/docs/latest/mllib-linear-methods.html] [https://spark.apache.org/docs/latest/mllib-naive-bayes.html] [https://spark.apache.org/docs/latest/mllib-statistics.html] [https://spark.apache.org/docs/latest/quick-start.html] [https://spark.apache.org/docs/latest/rdd-programming-guide.html] [https://spark.apache.org/docs/latest/sql-data-sources-avro.html] [https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html] [https://spark.apache.org/docs/latest/sql-data-sources-csv.html] [https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html] [https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html] [https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html] [https://spark.apache.org/docs/latest/sql-data-sources-json.html] [https://spark.apache.org/docs/latest/sql-data-sources-parquet.html] sql-data-sources-protobuf.html [https://spark.apache.org/docs/latest/sql-data-sources-text.html] [https://spark.apache.org/docs/latest/sql-migration-guide.html] [https://spark.apache.org/docs/latest/sql-performance-tuning.html] [https://spark.apache.org/docs/latest/sql-ref-datatypes.html] [https://spark.apache.org/docs/latest/streaming-kinesis-integration.html] [https://spark.apache.org/docs/latest/streaming-programming-guide.html] [https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html] [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html] was: Python is the most approachable and most popular language so it should be the default language in code examples so this makes Python the first code example tab consistently across the documentation, where applicable. This is continuing the work started with: https://issues.apache.org/jira/browse/SPARK-42493 where these two pages were updated: [https://spark.apache.org/docs/latest/sql-getting-started.html] [https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html] Pages being updated now: [https://spark.apache.org/docs/latest/ml-classification-regression.html] [https://spark.apache.org/docs/latest/ml-clustering.html] [https://spark.apache.org/docs/latest/ml-collaborative-filtering.html] [https://spark.apache.org/docs/latest/ml-datasource.html] [https://spark.apache.org/docs/latest/ml-features.html] [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] [https://spark.apache.org/docs/latest/ml-migration-guide.html] [https://spark.apache.org/docs/latest/ml-pipeline.html] [https://spark.apache.org/docs/latest/ml-statistics.html] [https://spark.apache.org/docs/latest/ml-tuning.html] [https://spark.apache.org/docs/latest/mllib-clustering.html] [https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html] [https://spark.apache.org/docs/latest/mllib-data-types.html]
[jira] [Updated] (SPARK-42642) Make Python the first code example tab
[ https://issues.apache.org/jira/browse/SPARK-42642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allan Folting updated SPARK-42642: -- Description: Python is the most approachable and most popular language so it should be the default language in code examples so this makes Python the first code example tab consistently across the documentation, where applicable. This is continuing the work started with: https://issues.apache.org/jira/browse/SPARK-42493 where these two pages were updated: [https://spark.apache.org/docs/latest/sql-getting-started.html] [https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html] Pages being updated now: [https://spark.apache.org/docs/latest/ml-classification-regression.html] [https://spark.apache.org/docs/latest/ml-clustering.html] [https://spark.apache.org/docs/latest/ml-collaborative-filtering.html] [https://spark.apache.org/docs/latest/ml-datasource.html] [https://spark.apache.org/docs/latest/ml-features.html] [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] [https://spark.apache.org/docs/latest/ml-migration-guide.html] [https://spark.apache.org/docs/latest/ml-pipeline.html] [https://spark.apache.org/docs/latest/ml-statistics.html] [https://spark.apache.org/docs/latest/ml-tuning.html] [https://spark.apache.org/docs/latest/mllib-clustering.html] [https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html] [https://spark.apache.org/docs/latest/mllib-data-types.html] [https://spark.apache.org/docs/latest/mllib-decision-tree.html] [https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html] [https://spark.apache.org/docs/latest/mllib-ensembles.html] [https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html] [https://spark.apache.org/docs/latest/mllib-feature-extraction.html] [https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html] [https://spark.apache.org/docs/latest/mllib-isotonic-regression.html] [https://spark.apache.org/docs/latest/mllib-linear-methods.html] [https://spark.apache.org/docs/latest/mllib-naive-bayes.html] [https://spark.apache.org/docs/latest/mllib-statistics.html] [https://spark.apache.org/docs/latest/quick-start.html] [https://spark.apache.org/docs/latest/rdd-programming-guide.html] [https://spark.apache.org/docs/latest/sql-data-sources-avro.html] [https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html] [https://spark.apache.org/docs/latest/sql-data-sources-csv.html] [https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html] [https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html] [https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html] [https://spark.apache.org/docs/latest/sql-data-sources-json.html] [https://spark.apache.org/docs/latest/sql-data-sources-parquet.html] sql-data-sources-protobuf.md [https://spark.apache.org/docs/latest/sql-data-sources-text.html] [https://spark.apache.org/docs/latest/sql-migration-guide.html] [https://spark.apache.org/docs/latest/sql-performance-tuning.html] [https://spark.apache.org/docs/latest/sql-ref-datatypes.html] [https://spark.apache.org/docs/latest/streaming-kinesis-integration.html] [https://spark.apache.org/docs/latest/streaming-programming-guide.html] [https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html] [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html] was: Python is the most approachable and most popular language so it should be the default language in code examples. Continuing the work started with: https://issues.apache.org/jira/browse/SPARK-42493 Making Python the first code example tab consistently across the documentation, where applicable. Pages being updated: [https://spark.apache.org/docs/latest/rdd-programming-guide.html] [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html] [https://spark.apache.org/docs/latest/streaming-programming-guide.html] [https://spark.apache.org/docs/latest/ml-statistics.html] [https://spark.apache.org/docs/latest/ml-datasource.html] [https://spark.apache.org/docs/latest/ml-pipeline.html] [https://spark.apache.org/docs/latest/ml-features.html] [https://spark.apache.org/docs/latest/ml-classification-regression.html] [https://spark.apache.org/docs/latest/ml-clustering.html] [https://spark.apache.org/docs/latest/ml-collaborative-filtering.html] [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] [https://spark.apache.org/docs/latest/ml-tuning.html] [https://spark.apache.org/docs/latest/ml-migration-guide.html] [https://spark.apache.org/docs/latest/mllib-data-types.html] [https://spark.apache.org/docs/latest/mllib-statistics.html] [https://spark.apache.org/docs/latest/mllib-linear-methods.html]
[jira] [Commented] (SPARK-42646) Upgrad cyclonedx from 2.7.3 to 2.7.5
[ https://issues.apache.org/jira/browse/SPARK-42646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695465#comment-17695465 ] Apache Spark commented on SPARK-42646: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/40247 > Upgrad cyclonedx from 2.7.3 to 2.7.5 > > > Key: SPARK-42646 > URL: https://issues.apache.org/jira/browse/SPARK-42646 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > > !https://user-images.githubusercontent.com/15246973/222338040-d7c8d595-be0b-40bb-af49-6b260dc0c425.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42646) Upgrad cyclonedx from 2.7.3 to 2.7.5
[ https://issues.apache.org/jira/browse/SPARK-42646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42646: Assignee: Apache Spark > Upgrad cyclonedx from 2.7.3 to 2.7.5 > > > Key: SPARK-42646 > URL: https://issues.apache.org/jira/browse/SPARK-42646 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > > !https://user-images.githubusercontent.com/15246973/222338040-d7c8d595-be0b-40bb-af49-6b260dc0c425.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42646) Upgrad cyclonedx from 2.7.3 to 2.7.5
[ https://issues.apache.org/jira/browse/SPARK-42646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42646: Assignee: (was: Apache Spark) > Upgrad cyclonedx from 2.7.3 to 2.7.5 > > > Key: SPARK-42646 > URL: https://issues.apache.org/jira/browse/SPARK-42646 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > > !https://user-images.githubusercontent.com/15246973/222338040-d7c8d595-be0b-40bb-af49-6b260dc0c425.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42644) Add `hive` dependency to `connect` module
[ https://issues.apache.org/jira/browse/SPARK-42644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42644: Assignee: (was: Apache Spark) > Add `hive` dependency to `connect` module > - > > Key: SPARK-42644 > URL: https://issues.apache.org/jira/browse/SPARK-42644 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42644) Add `hive` dependency to `connect` module
[ https://issues.apache.org/jira/browse/SPARK-42644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42644: Assignee: Apache Spark > Add `hive` dependency to `connect` module > - > > Key: SPARK-42644 > URL: https://issues.apache.org/jira/browse/SPARK-42644 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42644) Add `hive` dependency to `connect` module
[ https://issues.apache.org/jira/browse/SPARK-42644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695464#comment-17695464 ] Apache Spark commented on SPARK-42644: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/40246 > Add `hive` dependency to `connect` module > - > > Key: SPARK-42644 > URL: https://issues.apache.org/jira/browse/SPARK-42644 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42646) Upgrad cyclonedx from 2.7.3 to 2.7.5
[ https://issues.apache.org/jira/browse/SPARK-42646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-42646: Description: !https://user-images.githubusercontent.com/15246973/222338040-d7c8d595-be0b-40bb-af49-6b260dc0c425.png! (was: !image-2023-03-02-13-07-01-579.png!) > Upgrad cyclonedx from 2.7.3 to 2.7.5 > > > Key: SPARK-42646 > URL: https://issues.apache.org/jira/browse/SPARK-42646 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > > !https://user-images.githubusercontent.com/15246973/222338040-d7c8d595-be0b-40bb-af49-6b260dc0c425.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42646) Upgrad cyclonedx from 2.7.3 to 2.7.5
BingKun Pan created SPARK-42646: --- Summary: Upgrad cyclonedx from 2.7.3 to 2.7.5 Key: SPARK-42646 URL: https://issues.apache.org/jira/browse/SPARK-42646 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: BingKun Pan !image-2023-03-02-13-07-01-579.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42645) Introduce feature to allow for function caching across input rows.
Michael Tong created SPARK-42645: Summary: Introduce feature to allow for function caching across input rows. Key: SPARK-42645 URL: https://issues.apache.org/jira/browse/SPARK-42645 Project: Spark Issue Type: Wish Components: Optimizer Affects Versions: 3.3.2 Reporter: Michael Tong Introduce the ability to make functions cachable across input rows. I'm imagining this function to work similarly to python's [functools.cache|https://docs.python.org/3/library/functools.html#functools.cache] where you could add a decorator to certain expensive functions that you know will regularly encounter repeated values as you read the input data. With this new feature you would be able to significantly speed up many real world jobs that use expensive functions on data that naturally has repeated column values. An example of this would be parsing user agent fields from internet traffic logs partitioned by user id. Even though the data is not sorted by user agent, in a sample of 10k continuous rows there would be much less than 10k unique values because popular user agents exist on a large fraction of traffic and the user agent of the first event from a user is likely to be shared among all subsequent events from that user. Currently there is a way to hack an approximation of this in a python implementation of this via pandas_udfs. This works because pandas_udfs by default read in batches of 10k input rows, so you can used a caching UDF that empties every 10k rows. At my current job I have noticed that some applications of this trick can significantly speed up queries where custom UDFs are the bottleneck in a query. An example of this is {code:java} @F.pandas_udf(T.StringType()) def parse_user_agent_field(user_agent_series): @functools.cache def parse_user_agent_field_helper(user_agent): # parse the user agent and return the relevant field return None return user_agent_series.apply(parse_user_agent_field_helper){code} It would be nice if there was some official support for this behavior for both built in functions and UDFs. If there was official support for this I'd imagine it to look something like {code:java} # using pyspark dataframe API df = df.withColumn(output_col, F.cache(F.function)(input_col)){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42644) Add `hive` dependency to `connect` module
Dongjoon Hyun created SPARK-42644: - Summary: Add `hive` dependency to `connect` module Key: SPARK-42644 URL: https://issues.apache.org/jira/browse/SPARK-42644 Project: Spark Issue Type: Bug Components: Project Infra Affects Versions: 3.4.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42521) Add NULL values for INSERT commands with user-specified lists of fewer columns than the target table
[ https://issues.apache.org/jira/browse/SPARK-42521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-42521: -- Assignee: Daniel > Add NULL values for INSERT commands with user-specified lists of fewer > columns than the target table > > > Key: SPARK-42521 > URL: https://issues.apache.org/jira/browse/SPARK-42521 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42521) Add NULL values for INSERT commands with user-specified lists of fewer columns than the target table
[ https://issues.apache.org/jira/browse/SPARK-42521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-42521. Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40229 [https://github.com/apache/spark/pull/40229] > Add NULL values for INSERT commands with user-specified lists of fewer > columns than the target table > > > Key: SPARK-42521 > URL: https://issues.apache.org/jira/browse/SPARK-42521 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42642) Make Python the first code example tab
[ https://issues.apache.org/jira/browse/SPARK-42642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allan Folting updated SPARK-42642: -- Description: Python is the most approachable and most popular language so it should be the default language in code examples. Continuing the work started with: https://issues.apache.org/jira/browse/SPARK-42493 Making Python the first code example tab consistently across the documentation, where applicable. Pages being updated: [https://spark.apache.org/docs/latest/rdd-programming-guide.html] [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html] [https://spark.apache.org/docs/latest/streaming-programming-guide.html] [https://spark.apache.org/docs/latest/ml-statistics.html] [https://spark.apache.org/docs/latest/ml-datasource.html] [https://spark.apache.org/docs/latest/ml-pipeline.html] [https://spark.apache.org/docs/latest/ml-features.html] [https://spark.apache.org/docs/latest/ml-classification-regression.html] [https://spark.apache.org/docs/latest/ml-clustering.html] [https://spark.apache.org/docs/latest/ml-collaborative-filtering.html] [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] [https://spark.apache.org/docs/latest/ml-tuning.html] [https://spark.apache.org/docs/latest/ml-migration-guide.html] [https://spark.apache.org/docs/latest/mllib-data-types.html] [https://spark.apache.org/docs/latest/mllib-statistics.html] [https://spark.apache.org/docs/latest/mllib-linear-methods.html] [https://spark.apache.org/docs/latest/mllib-naive-bayes.html] [https://spark.apache.org/docs/latest/mllib-decision-tree.html] [https://spark.apache.org/docs/latest/mllib-ensembles.html] [https://spark.apache.org/docs/latest/mllib-isotonic-regression.html] [https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html] [https://spark.apache.org/docs/latest/mllib-clustering.html] [https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html] [https://spark.apache.org/docs/latest/mllib-feature-extraction.html] [https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html] [https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html] [https://spark.apache.org/docs/latest/quick-start.html] was: Python is the most approachable and most popular language so it should be the default language in code examples. Continuing the work started with: https://issues.apache.org/jira/browse/SPARK-42493 Making Python the first code example tab consistently across the documentation, where applicable. Pages being updated: [https://spark.apache.org/docs/latest/rdd-programming-guide.html] [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html] [https://spark.apache.org/docs/latest/streaming-programming-guide.html] [https://spark.apache.org/docs/latest/ml-statistics.html] [https://spark.apache.org/docs/latest/ml-datasource.html] [https://spark.apache.org/docs/latest/ml-pipeline.html] [https://spark.apache.org/docs/latest/ml-features.html] [https://spark.apache.org/docs/latest/ml-classification-regression.html] [https://spark.apache.org/docs/latest/ml-clustering.html] [https://spark.apache.org/docs/latest/ml-collaborative-filtering.html] [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] [https://spark.apache.org/docs/latest/ml-tuning.html] [https://spark.apache.org/docs/latest/mllib-data-types.html] [https://spark.apache.org/docs/latest/mllib-statistics.html] [https://spark.apache.org/docs/latest/mllib-linear-methods.html] [https://spark.apache.org/docs/latest/mllib-naive-bayes.html] [https://spark.apache.org/docs/latest/mllib-decision-tree.html] [https://spark.apache.org/docs/latest/mllib-ensembles.html] [https://spark.apache.org/docs/latest/mllib-isotonic-regression.html] [https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html] [https://spark.apache.org/docs/latest/mllib-clustering.html] [https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html] [https://spark.apache.org/docs/latest/mllib-feature-extraction.html] [https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html] [https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html] > Make Python the first code example tab > -- > > Key: SPARK-42642 > URL: https://issues.apache.org/jira/browse/SPARK-42642 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Allan Folting >Priority: Major > Attachments: Screenshot 2023-03-01 at 8.10.08 PM.png, Screenshot > 2023-03-01 at 8.10.22 PM.png > > > Python is the most approachable and most popular language so it should be the > default language in code examples. > Continuing the work started with: >
[jira] [Updated] (SPARK-42642) Make Python the first code example tab
[ https://issues.apache.org/jira/browse/SPARK-42642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allan Folting updated SPARK-42642: -- Attachment: Screenshot 2023-03-01 at 8.10.22 PM.png > Make Python the first code example tab > -- > > Key: SPARK-42642 > URL: https://issues.apache.org/jira/browse/SPARK-42642 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Allan Folting >Priority: Major > Attachments: Screenshot 2023-03-01 at 8.10.08 PM.png, Screenshot > 2023-03-01 at 8.10.22 PM.png > > > Python is the most approachable and most popular language so it should be the > default language in code examples. > Continuing the work started with: > https://issues.apache.org/jira/browse/SPARK-42493 > Making Python the first code example tab consistently across the > documentation, where applicable. > Pages being updated: > [https://spark.apache.org/docs/latest/rdd-programming-guide.html] > [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html] > [https://spark.apache.org/docs/latest/streaming-programming-guide.html] > [https://spark.apache.org/docs/latest/ml-statistics.html] > [https://spark.apache.org/docs/latest/ml-datasource.html] > [https://spark.apache.org/docs/latest/ml-pipeline.html] > [https://spark.apache.org/docs/latest/ml-features.html] > [https://spark.apache.org/docs/latest/ml-classification-regression.html] > [https://spark.apache.org/docs/latest/ml-clustering.html] > [https://spark.apache.org/docs/latest/ml-collaborative-filtering.html] > [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] > [https://spark.apache.org/docs/latest/ml-tuning.html] > [https://spark.apache.org/docs/latest/mllib-data-types.html] > [https://spark.apache.org/docs/latest/mllib-statistics.html] > [https://spark.apache.org/docs/latest/mllib-linear-methods.html] > [https://spark.apache.org/docs/latest/mllib-naive-bayes.html] > [https://spark.apache.org/docs/latest/mllib-decision-tree.html] > [https://spark.apache.org/docs/latest/mllib-ensembles.html] > [https://spark.apache.org/docs/latest/mllib-isotonic-regression.html] > [https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html] > [https://spark.apache.org/docs/latest/mllib-clustering.html] > [https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html] > [https://spark.apache.org/docs/latest/mllib-feature-extraction.html] > [https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html] > [https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html] > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42642) Make Python the first code example tab
[ https://issues.apache.org/jira/browse/SPARK-42642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allan Folting updated SPARK-42642: -- Attachment: Screenshot 2023-03-01 at 8.10.08 PM.png > Make Python the first code example tab > -- > > Key: SPARK-42642 > URL: https://issues.apache.org/jira/browse/SPARK-42642 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Allan Folting >Priority: Major > Attachments: Screenshot 2023-03-01 at 8.10.08 PM.png, Screenshot > 2023-03-01 at 8.10.22 PM.png > > > Python is the most approachable and most popular language so it should be the > default language in code examples. > Continuing the work started with: > https://issues.apache.org/jira/browse/SPARK-42493 > Making Python the first code example tab consistently across the > documentation, where applicable. > Pages being updated: > [https://spark.apache.org/docs/latest/rdd-programming-guide.html] > [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html] > [https://spark.apache.org/docs/latest/streaming-programming-guide.html] > [https://spark.apache.org/docs/latest/ml-statistics.html] > [https://spark.apache.org/docs/latest/ml-datasource.html] > [https://spark.apache.org/docs/latest/ml-pipeline.html] > [https://spark.apache.org/docs/latest/ml-features.html] > [https://spark.apache.org/docs/latest/ml-classification-regression.html] > [https://spark.apache.org/docs/latest/ml-clustering.html] > [https://spark.apache.org/docs/latest/ml-collaborative-filtering.html] > [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] > [https://spark.apache.org/docs/latest/ml-tuning.html] > [https://spark.apache.org/docs/latest/mllib-data-types.html] > [https://spark.apache.org/docs/latest/mllib-statistics.html] > [https://spark.apache.org/docs/latest/mllib-linear-methods.html] > [https://spark.apache.org/docs/latest/mllib-naive-bayes.html] > [https://spark.apache.org/docs/latest/mllib-decision-tree.html] > [https://spark.apache.org/docs/latest/mllib-ensembles.html] > [https://spark.apache.org/docs/latest/mllib-isotonic-regression.html] > [https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html] > [https://spark.apache.org/docs/latest/mllib-clustering.html] > [https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html] > [https://spark.apache.org/docs/latest/mllib-feature-extraction.html] > [https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html] > [https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html] > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42642) Make Python the first code example tab
[ https://issues.apache.org/jira/browse/SPARK-42642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allan Folting updated SPARK-42642: -- Description: Python is the most approachable and most popular language so it should be the default language in code examples. Continuing the work started with: https://issues.apache.org/jira/browse/SPARK-42493 Making Python the first code example tab consistently across the documentation, where applicable. Pages being updated: [https://spark.apache.org/docs/latest/rdd-programming-guide.html] [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html] [https://spark.apache.org/docs/latest/streaming-programming-guide.html] [https://spark.apache.org/docs/latest/ml-statistics.html] [https://spark.apache.org/docs/latest/ml-datasource.html] [https://spark.apache.org/docs/latest/ml-pipeline.html] [https://spark.apache.org/docs/latest/ml-features.html] [https://spark.apache.org/docs/latest/ml-classification-regression.html] [https://spark.apache.org/docs/latest/ml-clustering.html] [https://spark.apache.org/docs/latest/ml-collaborative-filtering.html] [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] [https://spark.apache.org/docs/latest/ml-tuning.html] [https://spark.apache.org/docs/latest/mllib-data-types.html] [https://spark.apache.org/docs/latest/mllib-statistics.html] [https://spark.apache.org/docs/latest/mllib-linear-methods.html] [https://spark.apache.org/docs/latest/mllib-naive-bayes.html] [https://spark.apache.org/docs/latest/mllib-decision-tree.html] [https://spark.apache.org/docs/latest/mllib-ensembles.html] [https://spark.apache.org/docs/latest/mllib-isotonic-regression.html] [https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html] [https://spark.apache.org/docs/latest/mllib-clustering.html] [https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html] [https://spark.apache.org/docs/latest/mllib-feature-extraction.html] [https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html] [https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html] was: Python is the most approachable and most popular language so it should be the default language in code examples. Continuing the work started with: https://issues.apache.org/jira/browse/SPARK-42493 Making Python the first code example tab consistently across the documentation, where applicable. Pages being updated: [https://spark.apache.org/docs/latest/rdd-programming-guide.html] [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html] [https://spark.apache.org/docs/latest/streaming-programming-guide.html] [https://spark.apache.org/docs/latest/ml-statistics.html] [https://spark.apache.org/docs/latest/ml-datasource.html] [https://spark.apache.org/docs/latest/ml-pipeline.html] [https://spark.apache.org/docs/latest/ml-features.html] [https://spark.apache.org/docs/latest/ml-classification-regression.html] [https://spark.apache.org/docs/latest/ml-clustering.html] [https://spark.apache.org/docs/latest/ml-collaborative-filtering.html] [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] [https://spark.apache.org/docs/latest/ml-tuning.html] [https://spark.apache.org/docs/latest/mllib-data-types.html] [https://spark.apache.org/docs/latest/mllib-statistics.html] [https://spark.apache.org/docs/latest/mllib-linear-methods.html] [https://spark.apache.org/docs/latest/mllib-naive-bayes.html] [https://spark.apache.org/docs/latest/mllib-decision-tree.html] [https://spark.apache.org/docs/latest/mllib-ensembles.html] [https://spark.apache.org/docs/latest/mllib-isotonic-regression.html] [https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html] [https://spark.apache.org/docs/latest/mllib-clustering.html] [https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html] [https://spark.apache.org/docs/latest/mllib-feature-extraction.html] [https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html] [https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html] > Make Python the first code example tab > -- > > Key: SPARK-42642 > URL: https://issues.apache.org/jira/browse/SPARK-42642 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Allan Folting >Priority: Major > > Python is the most approachable and most popular language so it should be the > default language in code examples. > Continuing the work started with: > https://issues.apache.org/jira/browse/SPARK-42493 > Making Python the first code example tab consistently across the > documentation, where applicable. > Pages being updated: >
[jira] [Updated] (SPARK-42642) Make Python the first code example tab
[ https://issues.apache.org/jira/browse/SPARK-42642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allan Folting updated SPARK-42642: -- Description: Python is the most approachable and most popular language so it should be the default language in code examples. Continuing the work started with: https://issues.apache.org/jira/browse/SPARK-42493 Making Python the first code example tab consistently across the documentation, where applicable. Pages being updated: [https://spark.apache.org/docs/latest/rdd-programming-guide.html] [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html] [https://spark.apache.org/docs/latest/streaming-programming-guide.html] [https://spark.apache.org/docs/latest/ml-statistics.html] [https://spark.apache.org/docs/latest/ml-datasource.html] [https://spark.apache.org/docs/latest/ml-pipeline.html] [https://spark.apache.org/docs/latest/ml-features.html] [https://spark.apache.org/docs/latest/ml-classification-regression.html] [https://spark.apache.org/docs/latest/ml-clustering.html] [https://spark.apache.org/docs/latest/ml-collaborative-filtering.html] [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] [https://spark.apache.org/docs/latest/ml-tuning.html] [https://spark.apache.org/docs/latest/mllib-data-types.html] [https://spark.apache.org/docs/latest/mllib-statistics.html] [https://spark.apache.org/docs/latest/mllib-linear-methods.html] [https://spark.apache.org/docs/latest/mllib-naive-bayes.html] [https://spark.apache.org/docs/latest/mllib-decision-tree.html] [https://spark.apache.org/docs/latest/mllib-ensembles.html] [https://spark.apache.org/docs/latest/mllib-isotonic-regression.html] [https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html] [https://spark.apache.org/docs/latest/mllib-clustering.html] [https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html] [https://spark.apache.org/docs/latest/mllib-feature-extraction.html] [https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html] [https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html] was: Python is the most approachable and most popular language so it should be the default language in code examples. Continuing the work started with: https://issues.apache.org/jira/browse/SPARK-42493 Making Python the first code example tab consistently across the documentation, where applicable. > Make Python the first code example tab > -- > > Key: SPARK-42642 > URL: https://issues.apache.org/jira/browse/SPARK-42642 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Allan Folting >Priority: Major > > Python is the most approachable and most popular language so it should be the > default language in code examples. > Continuing the work started with: > https://issues.apache.org/jira/browse/SPARK-42493 > Making Python the first code example tab consistently across the > documentation, where applicable. > Pages being updated: > [https://spark.apache.org/docs/latest/rdd-programming-guide.html] > [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html] > [https://spark.apache.org/docs/latest/streaming-programming-guide.html] > [https://spark.apache.org/docs/latest/ml-statistics.html] > [https://spark.apache.org/docs/latest/ml-datasource.html] > [https://spark.apache.org/docs/latest/ml-pipeline.html] > [https://spark.apache.org/docs/latest/ml-features.html] > [https://spark.apache.org/docs/latest/ml-classification-regression.html] > [https://spark.apache.org/docs/latest/ml-clustering.html] > [https://spark.apache.org/docs/latest/ml-collaborative-filtering.html] > [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] > [https://spark.apache.org/docs/latest/ml-tuning.html] > [https://spark.apache.org/docs/latest/mllib-data-types.html] > [https://spark.apache.org/docs/latest/mllib-statistics.html] > [https://spark.apache.org/docs/latest/mllib-linear-methods.html] > [https://spark.apache.org/docs/latest/mllib-naive-bayes.html] > [https://spark.apache.org/docs/latest/mllib-decision-tree.html] > [https://spark.apache.org/docs/latest/mllib-ensembles.html] > [https://spark.apache.org/docs/latest/mllib-isotonic-regression.html] > [https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html] > [https://spark.apache.org/docs/latest/mllib-clustering.html] > [https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html] > [https://spark.apache.org/docs/latest/mllib-feature-extraction.html] > [https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html] > [https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html] > > -- This message was sent by Atlassian Jira
[jira] [Commented] (SPARK-42643) Implement `spark.udf.registerJavaFunction`
[ https://issues.apache.org/jira/browse/SPARK-42643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695429#comment-17695429 ] Apache Spark commented on SPARK-42643: -- User 'xinrong-meng' has created a pull request for this issue: https://github.com/apache/spark/pull/40244 > Implement `spark.udf.registerJavaFunction` > -- > > Key: SPARK-42643 > URL: https://issues.apache.org/jira/browse/SPARK-42643 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Implement `spark.udf.registerJavaFunction`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42643) Implement `spark.udf.registerJavaFunction`
[ https://issues.apache.org/jira/browse/SPARK-42643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42643: Assignee: Apache Spark > Implement `spark.udf.registerJavaFunction` > -- > > Key: SPARK-42643 > URL: https://issues.apache.org/jira/browse/SPARK-42643 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > Implement `spark.udf.registerJavaFunction`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42643) Implement `spark.udf.registerJavaFunction`
[ https://issues.apache.org/jira/browse/SPARK-42643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42643: Assignee: (was: Apache Spark) > Implement `spark.udf.registerJavaFunction` > -- > > Key: SPARK-42643 > URL: https://issues.apache.org/jira/browse/SPARK-42643 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Implement `spark.udf.registerJavaFunction`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41823) DataFrame.join creating ambiguous column names
[ https://issues.apache.org/jira/browse/SPARK-41823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695428#comment-17695428 ] Apache Spark commented on SPARK-41823: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/40245 > DataFrame.join creating ambiguous column names > -- > > Key: SPARK-41823 > URL: https://issues.apache.org/jira/browse/SPARK-41823 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Ruifeng Zheng >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 254, in pyspark.sql.connect.dataframe.DataFrame.drop > Failed example: > df.join(df2, df.name == df2.name, 'inner').drop('name').show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > df.join(df2, df.name == df2.name, 'inner').drop('name').show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 534, in show > print(self._show_string(n, truncate, vertical)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 423, in _show_string > ).toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 619, in _handle_error > raise SparkConnectAnalysisException( > pyspark.sql.connect.client.SparkConnectAnalysisException: > [AMBIGUOUS_REFERENCE] Reference `name` is ambiguous, could be: [`name`, > `name`]. > Plan: {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42643) Implement `spark.udf.registerJavaFunction`
Xinrong Meng created SPARK-42643: Summary: Implement `spark.udf.registerJavaFunction` Key: SPARK-42643 URL: https://issues.apache.org/jira/browse/SPARK-42643 Project: Spark Issue Type: Improvement Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Implement `spark.udf.registerJavaFunction`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42642) Make Python the first code example tab
[ https://issues.apache.org/jira/browse/SPARK-42642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allan Folting updated SPARK-42642: -- Summary: Make Python the first code example tab (was: Make Python the first code example tab - ) > Make Python the first code example tab > -- > > Key: SPARK-42642 > URL: https://issues.apache.org/jira/browse/SPARK-42642 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Allan Folting >Priority: Major > > Python is the most approachable and most popular language so it should be the > default language in code examples. > Continuing the work started with: > https://issues.apache.org/jira/browse/SPARK-42493 > Making Python the first code example tab consistently across the > documentation, where applicable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42642) Make Python the first code example tab -
Allan Folting created SPARK-42642: - Summary: Make Python the first code example tab - Key: SPARK-42642 URL: https://issues.apache.org/jira/browse/SPARK-42642 Project: Spark Issue Type: Documentation Components: Spark Core Affects Versions: 3.5.0 Reporter: Allan Folting Python is the most approachable and most popular language so it should be the default language in code examples. Continuing the work started with: https://issues.apache.org/jira/browse/SPARK-42493 Making Python the first code example tab consistently across the documentation, where applicable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39316) Merge PromotePrecision and CheckOverflow into decimal binary arithmetic
[ https://issues.apache.org/jira/browse/SPARK-39316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-39316: -- Description: Fix the bug of `TypeCoercion`, for example: {code:java} SELECT CAST(1 AS DECIMAL(28, 2)) UNION ALL SELECT CAST(1 AS DECIMAL(18, 2)) / CAST(1 AS DECIMAL(18, 2)); {code} The union result data type is not correct according to the formula: || Operation || Result Precision || Result Scale || |e1 union e2 | max(s1, s2) + max(p1-s1, p2-s2) | max(s1, s2) | {code:java} -- before -- query schema decimal(28,2) -- query output 1.00 1.00 -- after -- query schema decimal(38,20) -- query output 1. 1. {code} was: Merge {{PromotePrecision}} into {{{}dataType{}}}, for example, {{{}Add{}}}: {code:java} override def dataType: DataType = (left, right) match { case (DecimalType.Expression(p1, s1), DecimalType.Expression(p2, s2)) => val resultScale = max(s1, s2) if (allowPrecisionLoss) { DecimalType.adjustPrecisionScale(max(p1 - s1, p2 - s2) + resultScale + 1, resultScale) } else { DecimalType.bounded(max(p1 - s1, p2 - s2) + resultScale + 1, resultScale) } case _ => super.dataType } {code} Merge {{{}CheckOverflow{}}}, for example, {{Add}} eval: {code:java} dataType match { case decimalType: DecimalType => val value = numeric.plus(input1, input2) checkOverflow(value.asInstanceOf[Decimal], decimalType) ... } {code} > Merge PromotePrecision and CheckOverflow into decimal binary arithmetic > --- > > Key: SPARK-39316 > URL: https://issues.apache.org/jira/browse/SPARK-39316 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > Fix the bug of `TypeCoercion`, for example: > {code:java} > SELECT CAST(1 AS DECIMAL(28, 2)) > UNION ALL > SELECT CAST(1 AS DECIMAL(18, 2)) / CAST(1 AS DECIMAL(18, 2)); > {code} > The union result data type is not correct according to the formula: > || Operation || Result Precision || Result Scale || > |e1 union e2 | max(s1, s2) + max(p1-s1, p2-s2) | max(s1, s2) | > {code:java} > -- before > -- query schema > decimal(28,2) > -- query output > 1.00 > 1.00 > -- after > -- query schema > decimal(38,20) > -- query output > 1. > 1. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42641) Upgrade buf to v1.15.0
[ https://issues.apache.org/jira/browse/SPARK-42641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42641: Assignee: (was: Apache Spark) > Upgrade buf to v1.15.0 > -- > > Key: SPARK-42641 > URL: https://issues.apache.org/jira/browse/SPARK-42641 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42641) Upgrade buf to v1.15.0
[ https://issues.apache.org/jira/browse/SPARK-42641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42641: Assignee: Apache Spark > Upgrade buf to v1.15.0 > -- > > Key: SPARK-42641 > URL: https://issues.apache.org/jira/browse/SPARK-42641 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42641) Upgrade buf to v1.15.0
[ https://issues.apache.org/jira/browse/SPARK-42641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695404#comment-17695404 ] Apache Spark commented on SPARK-42641: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/40243 > Upgrade buf to v1.15.0 > -- > > Key: SPARK-42641 > URL: https://issues.apache.org/jira/browse/SPARK-42641 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42641) Upgrade buf to v1.15.0
Ruifeng Zheng created SPARK-42641: - Summary: Upgrade buf to v1.15.0 Key: SPARK-42641 URL: https://issues.apache.org/jira/browse/SPARK-42641 Project: Spark Issue Type: Sub-task Components: Build Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42631) Support custom extensions in Spark Connect Scala client
[ https://issues.apache.org/jira/browse/SPARK-42631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell updated SPARK-42631: -- Epic Link: SPARK-42554 > Support custom extensions in Spark Connect Scala client > --- > > Key: SPARK-42631 > URL: https://issues.apache.org/jira/browse/SPARK-42631 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Tom van Bussel >Assignee: Tom van Bussel >Priority: Major > Fix For: 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42631) Support custom extensions in Spark Connect Scala client
[ https://issues.apache.org/jira/browse/SPARK-42631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-42631. --- Fix Version/s: 3.4.1 Assignee: Tom van Bussel Resolution: Fixed > Support custom extensions in Spark Connect Scala client > --- > > Key: SPARK-42631 > URL: https://issues.apache.org/jira/browse/SPARK-42631 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Tom van Bussel >Assignee: Tom van Bussel >Priority: Major > Fix For: 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42640) Remove stale entries from the excluding rules for CompabilitySuite
[ https://issues.apache.org/jira/browse/SPARK-42640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42640: Assignee: Apache Spark (was: Rui Wang) > Remove stale entries from the excluding rules for CompabilitySuite > -- > > Key: SPARK-42640 > URL: https://issues.apache.org/jira/browse/SPARK-42640 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.1 >Reporter: Rui Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42640) Remove stale entries from the excluding rules for CompabilitySuite
[ https://issues.apache.org/jira/browse/SPARK-42640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42640: Assignee: Rui Wang (was: Apache Spark) > Remove stale entries from the excluding rules for CompabilitySuite > -- > > Key: SPARK-42640 > URL: https://issues.apache.org/jira/browse/SPARK-42640 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.1 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42640) Remove stale entries from the excluding rules for CompabilitySuite
[ https://issues.apache.org/jira/browse/SPARK-42640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695387#comment-17695387 ] Apache Spark commented on SPARK-42640: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/40241 > Remove stale entries from the excluding rules for CompabilitySuite > -- > > Key: SPARK-42640 > URL: https://issues.apache.org/jira/browse/SPARK-42640 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.1 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42639) Add createDataFrame/createDataset to SparkSession
[ https://issues.apache.org/jira/browse/SPARK-42639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695386#comment-17695386 ] Apache Spark commented on SPARK-42639: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/40242 > Add createDataFrame/createDataset to SparkSession > - > > Key: SPARK-42639 > URL: https://issues.apache.org/jira/browse/SPARK-42639 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > > Add createDataFrame/createDataset to SparkSession -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42639) Add createDataFrame/createDataset to SparkSession
[ https://issues.apache.org/jira/browse/SPARK-42639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42639: Assignee: Apache Spark (was: Herman van Hövell) > Add createDataFrame/createDataset to SparkSession > - > > Key: SPARK-42639 > URL: https://issues.apache.org/jira/browse/SPARK-42639 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Apache Spark >Priority: Major > > Add createDataFrame/createDataset to SparkSession -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42639) Add createDataFrame/createDataset to SparkSession
[ https://issues.apache.org/jira/browse/SPARK-42639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42639: Assignee: Herman van Hövell (was: Apache Spark) > Add createDataFrame/createDataset to SparkSession > - > > Key: SPARK-42639 > URL: https://issues.apache.org/jira/browse/SPARK-42639 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > > Add createDataFrame/createDataset to SparkSession -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42640) Remove stale entries from the excluding rules for CompabilitySuite
Rui Wang created SPARK-42640: Summary: Remove stale entries from the excluding rules for CompabilitySuite Key: SPARK-42640 URL: https://issues.apache.org/jira/browse/SPARK-42640 Project: Spark Issue Type: Task Components: Connect Affects Versions: 3.4.1 Reporter: Rui Wang Assignee: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42639) Add createDataFrame/createDataset to SparkSession
Herman van Hövell created SPARK-42639: - Summary: Add createDataFrame/createDataset to SparkSession Key: SPARK-42639 URL: https://issues.apache.org/jira/browse/SPARK-42639 Project: Spark Issue Type: New Feature Components: Connect Affects Versions: 3.4.0 Reporter: Herman van Hövell Add createDataFrame/createDataset to SparkSession -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42639) Add createDataFrame/createDataset to SparkSession
[ https://issues.apache.org/jira/browse/SPARK-42639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell reassigned SPARK-42639: - Assignee: Herman van Hövell > Add createDataFrame/createDataset to SparkSession > - > > Key: SPARK-42639 > URL: https://issues.apache.org/jira/browse/SPARK-42639 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > > Add createDataFrame/createDataset to SparkSession -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42493) Spark SQL, DataFrames and Datasets Guide - make Python the first code example tab
[ https://issues.apache.org/jira/browse/SPARK-42493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42493. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40087 [https://github.com/apache/spark/pull/40087] > Spark SQL, DataFrames and Datasets Guide - make Python the first code example > tab > - > > Key: SPARK-42493 > URL: https://issues.apache.org/jira/browse/SPARK-42493 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Allan Folting >Assignee: Allan Folting >Priority: Major > Fix For: 3.5.0 > > > Python is the easiest approachable and most popular language so it should be > the primary language in examples etc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42493) Spark SQL, DataFrames and Datasets Guide - make Python the first code example tab
[ https://issues.apache.org/jira/browse/SPARK-42493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42493: Assignee: Allan Folting > Spark SQL, DataFrames and Datasets Guide - make Python the first code example > tab > - > > Key: SPARK-42493 > URL: https://issues.apache.org/jira/browse/SPARK-42493 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Allan Folting >Assignee: Allan Folting >Priority: Major > > Python is the easiest approachable and most popular language so it should be > the primary language in examples etc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42613. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40212 [https://github.com/apache/spark/pull/40212] > PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor > cores by default > - > > Key: SPARK-42613 > URL: https://issues.apache.org/jira/browse/SPARK-42613 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 3.3.0 >Reporter: John Zhuge >Assignee: John Zhuge >Priority: Major > Fix For: 3.5.0 > > > Follow up from > [https://github.com/apache/spark/pull/40199#discussion_r1119453996] > If OMP_NUM_THREADS is not set explicitly, we should set it to > `spark.task.cpus` instead of `spark.executor.cores` as described in [PR > #38699|https://github.com/apache/spark/pull/38699]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42613: Assignee: John Zhuge > PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor > cores by default > - > > Key: SPARK-42613 > URL: https://issues.apache.org/jira/browse/SPARK-42613 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 3.3.0 >Reporter: John Zhuge >Assignee: John Zhuge >Priority: Major > > Follow up from > [https://github.com/apache/spark/pull/40199#discussion_r1119453996] > If OMP_NUM_THREADS is not set explicitly, we should set it to > `spark.task.cpus` instead of `spark.executor.cores` as described in [PR > #38699|https://github.com/apache/spark/pull/38699]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42632) Fix scala paths in tests
[ https://issues.apache.org/jira/browse/SPARK-42632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42632. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40235 [https://github.com/apache/spark/pull/40235] > Fix scala paths in tests > > > Key: SPARK-42632 > URL: https://issues.apache.org/jira/browse/SPARK-42632 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.1 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.4.0 > > > The jar resolution in the connect client tests can resolve the jar for the > wrong scala version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42637) Add SparkSession.stop
[ https://issues.apache.org/jira/browse/SPARK-42637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42637. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40239 [https://github.com/apache/spark/pull/40239] > Add SparkSession.stop > - > > Key: SPARK-42637 > URL: https://issues.apache.org/jira/browse/SPARK-42637 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.4.0 > > > Add SparkSession.stop() -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42458) createDataFrame should support DDL string as schema
[ https://issues.apache.org/jira/browse/SPARK-42458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42458: Assignee: Takuya Ueshin > createDataFrame should support DDL string as schema > --- > > Key: SPARK-42458 > URL: https://issues.apache.org/jira/browse/SPARK-42458 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > > {code:python} > File "/.../python/pyspark/sql/connect/readwriter.py", line 393, in > pyspark.sql.connect.readwriter.DataFrameWriter.option > Failed example: > with tempfile.TemporaryDirectory() as d: > # Write a DataFrame into a CSV file with 'nullValue' option set to > 'Hyukjin Kwon'. > df = spark.createDataFrame([(100, None)], "age INT, name STRING") > df.write.option("nullValue", "Hyukjin > Kwon").mode("overwrite").format("csv").save(d) > # Read the CSV file as a DataFrame. > spark.read.schema(df.schema).format('csv').load(d).show() > Exception raised: > Traceback (most recent call last): > File "/.../lib/python3.9/doctest.py", line 1334, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.readwriter.DataFrameWriter.option[2]>", line 3, in > > df = spark.createDataFrame([(100, None)], "age INT, name STRING") > File "/.../python/pyspark/sql/connect/session.py", line 312, in > createDataFrame > raise ValueError( > ValueError: Some of types cannot be determined after inferring, a > StructType Schema is required in this case > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42458) createDataFrame should support DDL string as schema
[ https://issues.apache.org/jira/browse/SPARK-42458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42458. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40240 [https://github.com/apache/spark/pull/40240] > createDataFrame should support DDL string as schema > --- > > Key: SPARK-42458 > URL: https://issues.apache.org/jira/browse/SPARK-42458 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.4.0 > > > {code:python} > File "/.../python/pyspark/sql/connect/readwriter.py", line 393, in > pyspark.sql.connect.readwriter.DataFrameWriter.option > Failed example: > with tempfile.TemporaryDirectory() as d: > # Write a DataFrame into a CSV file with 'nullValue' option set to > 'Hyukjin Kwon'. > df = spark.createDataFrame([(100, None)], "age INT, name STRING") > df.write.option("nullValue", "Hyukjin > Kwon").mode("overwrite").format("csv").save(d) > # Read the CSV file as a DataFrame. > spark.read.schema(df.schema).format('csv').load(d).show() > Exception raised: > Traceback (most recent call last): > File "/.../lib/python3.9/doctest.py", line 1334, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.readwriter.DataFrameWriter.option[2]>", line 3, in > > df = spark.createDataFrame([(100, None)], "age INT, name STRING") > File "/.../python/pyspark/sql/connect/session.py", line 312, in > createDataFrame > raise ValueError( > ValueError: Some of types cannot be determined after inferring, a > StructType Schema is required in this case > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42458) createDataFrame should support DDL string as schema
[ https://issues.apache.org/jira/browse/SPARK-42458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42458: Assignee: Apache Spark > createDataFrame should support DDL string as schema > --- > > Key: SPARK-42458 > URL: https://issues.apache.org/jira/browse/SPARK-42458 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > > {code:python} > File "/.../python/pyspark/sql/connect/readwriter.py", line 393, in > pyspark.sql.connect.readwriter.DataFrameWriter.option > Failed example: > with tempfile.TemporaryDirectory() as d: > # Write a DataFrame into a CSV file with 'nullValue' option set to > 'Hyukjin Kwon'. > df = spark.createDataFrame([(100, None)], "age INT, name STRING") > df.write.option("nullValue", "Hyukjin > Kwon").mode("overwrite").format("csv").save(d) > # Read the CSV file as a DataFrame. > spark.read.schema(df.schema).format('csv').load(d).show() > Exception raised: > Traceback (most recent call last): > File "/.../lib/python3.9/doctest.py", line 1334, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.readwriter.DataFrameWriter.option[2]>", line 3, in > > df = spark.createDataFrame([(100, None)], "age INT, name STRING") > File "/.../python/pyspark/sql/connect/session.py", line 312, in > createDataFrame > raise ValueError( > ValueError: Some of types cannot be determined after inferring, a > StructType Schema is required in this case > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42458) createDataFrame should support DDL string as schema
[ https://issues.apache.org/jira/browse/SPARK-42458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695351#comment-17695351 ] Apache Spark commented on SPARK-42458: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/40240 > createDataFrame should support DDL string as schema > --- > > Key: SPARK-42458 > URL: https://issues.apache.org/jira/browse/SPARK-42458 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Major > > {code:python} > File "/.../python/pyspark/sql/connect/readwriter.py", line 393, in > pyspark.sql.connect.readwriter.DataFrameWriter.option > Failed example: > with tempfile.TemporaryDirectory() as d: > # Write a DataFrame into a CSV file with 'nullValue' option set to > 'Hyukjin Kwon'. > df = spark.createDataFrame([(100, None)], "age INT, name STRING") > df.write.option("nullValue", "Hyukjin > Kwon").mode("overwrite").format("csv").save(d) > # Read the CSV file as a DataFrame. > spark.read.schema(df.schema).format('csv').load(d).show() > Exception raised: > Traceback (most recent call last): > File "/.../lib/python3.9/doctest.py", line 1334, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.readwriter.DataFrameWriter.option[2]>", line 3, in > > df = spark.createDataFrame([(100, None)], "age INT, name STRING") > File "/.../python/pyspark/sql/connect/session.py", line 312, in > createDataFrame > raise ValueError( > ValueError: Some of types cannot be determined after inferring, a > StructType Schema is required in this case > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42458) createDataFrame should support DDL string as schema
[ https://issues.apache.org/jira/browse/SPARK-42458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42458: Assignee: (was: Apache Spark) > createDataFrame should support DDL string as schema > --- > > Key: SPARK-42458 > URL: https://issues.apache.org/jira/browse/SPARK-42458 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Major > > {code:python} > File "/.../python/pyspark/sql/connect/readwriter.py", line 393, in > pyspark.sql.connect.readwriter.DataFrameWriter.option > Failed example: > with tempfile.TemporaryDirectory() as d: > # Write a DataFrame into a CSV file with 'nullValue' option set to > 'Hyukjin Kwon'. > df = spark.createDataFrame([(100, None)], "age INT, name STRING") > df.write.option("nullValue", "Hyukjin > Kwon").mode("overwrite").format("csv").save(d) > # Read the CSV file as a DataFrame. > spark.read.schema(df.schema).format('csv').load(d).show() > Exception raised: > Traceback (most recent call last): > File "/.../lib/python3.9/doctest.py", line 1334, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.readwriter.DataFrameWriter.option[2]>", line 3, in > > df = spark.createDataFrame([(100, None)], "age INT, name STRING") > File "/.../python/pyspark/sql/connect/session.py", line 312, in > createDataFrame > raise ValueError( > ValueError: Some of types cannot be determined after inferring, a > StructType Schema is required in this case > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40159) Aggregate should be group only after collapse project to aggregate
[ https://issues.apache.org/jira/browse/SPARK-40159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695327#comment-17695327 ] Ritika Maheshwari commented on SPARK-40159: --- This issue seems to have been resolved by SPARK-38489 > Aggregate should be group only after collapse project to aggregate > -- > > Key: SPARK-40159 > URL: https://issues.apache.org/jira/browse/SPARK-40159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wan Kun >Priority: Major > > CollapseProject rule will merge project expressions into AggregateExpressions > in aggregate, which will make the *aggregate.groupOnly* to false. > {code} > val df = testData.distinct().select('key + 1, ('key + 1).cast("long")) > df.queryExecution.optimizedPlan.collect { > case a: Aggregate => a > }.foreach(agg => assert(agg.groupOnly === true)) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42638) current_user() is blocked from VALUES, but current_timestamp() is not
Serge Rielau created SPARK-42638: Summary: current_user() is blocked from VALUES, but current_timestamp() is not Key: SPARK-42638 URL: https://issues.apache.org/jira/browse/SPARK-42638 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.0 Reporter: Serge Rielau VALUES(current_user()); returns: cannot evaluate expression current_user() in inline table definition.; line 1 pos 8 The same with current_timestamp() works. It appears current_user() is recognized as non-deterministic. But it is constant within the statement, just like current_timestanmp(). PS: It's not clear why we block non-deterministic functions to begin with -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42633) Use the actual schema in a LocalRelation
[ https://issues.apache.org/jira/browse/SPARK-42633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695280#comment-17695280 ] Apache Spark commented on SPARK-42633: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/40238 > Use the actual schema in a LocalRelation > > > Key: SPARK-42633 > URL: https://issues.apache.org/jira/browse/SPARK-42633 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > > Make the LocalRelation proto take an actual schema message instead of a > string. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42637) Add SparkSession.stop
[ https://issues.apache.org/jira/browse/SPARK-42637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695279#comment-17695279 ] Apache Spark commented on SPARK-42637: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/40239 > Add SparkSession.stop > - > > Key: SPARK-42637 > URL: https://issues.apache.org/jira/browse/SPARK-42637 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > > Add SparkSession.stop() -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression
[ https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42635: Assignee: (was: Apache Spark) > Several counter-intuitive behaviours in the TimestampAdd expression > --- > > Key: SPARK-42635 > URL: https://issues.apache.org/jira/browse/SPARK-42635 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.3.2 >Reporter: Chenhao Li >Priority: Major > > # When the time is close to daylight saving time transition, the result may > be discontinuous and not monotonic. > We currently have: > {code:scala} > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, > timestamp'2011-03-12 03:00:00')").show > ++ > |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| > ++ > | 2011-03-13 03:59:59| > ++ > scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 > 03:00:00')").show > +--+ > |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| > +--+ > | 2011-03-13 03:00:00| > +--+ {code} > > In the second query, adding one more second will set the time back one hour > instead. Plus, there are only {{23 * 3600}} seconds from {{2011-03-12 > 03:00:00}} to {{2011-03-13 03:00:00}}, instead of {{24 * 3600}} seconds, due > to the daylight saving time transition. > The root cause of the problem is the Spark code at > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790] > wrongly assumes every day has {{MICROS_PER_DAY}} seconds, and does the day > and time-in-day split before looking at the timezone. > 2. Adding month, quarter, and year silently ignores Int overflow during unit > conversion. > The root cause is > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246]. > {{quantity}} is multiplied by {{3}} or {{MONTHS_PER_YEAR}} without checking > overflow. Note that we do have overflow checking in adding the amount to the > timestamp, so the behavior is inconsistent. > This can cause counter-intuitive results like this: > {code:scala} > scala> spark.sql("select timestampadd(quarter, 1431655764, > timestamp'1970-01-01')").show > +--+ > |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')| > +--+ > | 1969-09-01 00:00:00| > +--+{code} > 3. Adding sub-month units (week, day, hour, minute, second, millisecond, > microsecond)silently ignores Long overflow during unit conversion. > This is similar to the previous problem: > {code:scala} > scala> spark.sql("select timestampadd(day, 106751992, > timestamp'1970-01-01')").show(false) > +-+ > |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')| > +-+ > |-290308-12-22 15:58:10.448384| > +-+{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression
[ https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695278#comment-17695278 ] Apache Spark commented on SPARK-42635: -- User 'chenhao-db' has created a pull request for this issue: https://github.com/apache/spark/pull/40237 > Several counter-intuitive behaviours in the TimestampAdd expression > --- > > Key: SPARK-42635 > URL: https://issues.apache.org/jira/browse/SPARK-42635 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.3.2 >Reporter: Chenhao Li >Priority: Major > > # When the time is close to daylight saving time transition, the result may > be discontinuous and not monotonic. > We currently have: > {code:scala} > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, > timestamp'2011-03-12 03:00:00')").show > ++ > |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| > ++ > | 2011-03-13 03:59:59| > ++ > scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 > 03:00:00')").show > +--+ > |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| > +--+ > | 2011-03-13 03:00:00| > +--+ {code} > > In the second query, adding one more second will set the time back one hour > instead. Plus, there are only {{23 * 3600}} seconds from {{2011-03-12 > 03:00:00}} to {{2011-03-13 03:00:00}}, instead of {{24 * 3600}} seconds, due > to the daylight saving time transition. > The root cause of the problem is the Spark code at > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790] > wrongly assumes every day has {{MICROS_PER_DAY}} seconds, and does the day > and time-in-day split before looking at the timezone. > 2. Adding month, quarter, and year silently ignores Int overflow during unit > conversion. > The root cause is > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246]. > {{quantity}} is multiplied by {{3}} or {{MONTHS_PER_YEAR}} without checking > overflow. Note that we do have overflow checking in adding the amount to the > timestamp, so the behavior is inconsistent. > This can cause counter-intuitive results like this: > {code:scala} > scala> spark.sql("select timestampadd(quarter, 1431655764, > timestamp'1970-01-01')").show > +--+ > |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')| > +--+ > | 1969-09-01 00:00:00| > +--+{code} > 3. Adding sub-month units (week, day, hour, minute, second, millisecond, > microsecond)silently ignores Long overflow during unit conversion. > This is similar to the previous problem: > {code:scala} > scala> spark.sql("select timestampadd(day, 106751992, > timestamp'1970-01-01')").show(false) > +-+ > |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')| > +-+ > |-290308-12-22 15:58:10.448384| > +-+{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression
[ https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42635: Assignee: Apache Spark > Several counter-intuitive behaviours in the TimestampAdd expression > --- > > Key: SPARK-42635 > URL: https://issues.apache.org/jira/browse/SPARK-42635 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.3.2 >Reporter: Chenhao Li >Assignee: Apache Spark >Priority: Major > > # When the time is close to daylight saving time transition, the result may > be discontinuous and not monotonic. > We currently have: > {code:scala} > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, > timestamp'2011-03-12 03:00:00')").show > ++ > |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| > ++ > | 2011-03-13 03:59:59| > ++ > scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 > 03:00:00')").show > +--+ > |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| > +--+ > | 2011-03-13 03:00:00| > +--+ {code} > > In the second query, adding one more second will set the time back one hour > instead. Plus, there are only {{23 * 3600}} seconds from {{2011-03-12 > 03:00:00}} to {{2011-03-13 03:00:00}}, instead of {{24 * 3600}} seconds, due > to the daylight saving time transition. > The root cause of the problem is the Spark code at > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790] > wrongly assumes every day has {{MICROS_PER_DAY}} seconds, and does the day > and time-in-day split before looking at the timezone. > 2. Adding month, quarter, and year silently ignores Int overflow during unit > conversion. > The root cause is > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246]. > {{quantity}} is multiplied by {{3}} or {{MONTHS_PER_YEAR}} without checking > overflow. Note that we do have overflow checking in adding the amount to the > timestamp, so the behavior is inconsistent. > This can cause counter-intuitive results like this: > {code:scala} > scala> spark.sql("select timestampadd(quarter, 1431655764, > timestamp'1970-01-01')").show > +--+ > |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')| > +--+ > | 1969-09-01 00:00:00| > +--+{code} > 3. Adding sub-month units (week, day, hour, minute, second, millisecond, > microsecond)silently ignores Long overflow during unit conversion. > This is similar to the previous problem: > {code:scala} > scala> spark.sql("select timestampadd(day, 106751992, > timestamp'1970-01-01')").show(false) > +-+ > |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')| > +-+ > |-290308-12-22 15:58:10.448384| > +-+{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42637) Add SparkSession.stop
Herman van Hövell created SPARK-42637: - Summary: Add SparkSession.stop Key: SPARK-42637 URL: https://issues.apache.org/jira/browse/SPARK-42637 Project: Spark Issue Type: New Feature Components: Connect Affects Versions: 3.4.0 Reporter: Herman van Hövell Assignee: Herman van Hövell Add SparkSession.stop() -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42636) Audit annotation usage
Herman van Hövell created SPARK-42636: - Summary: Audit annotation usage Key: SPARK-42636 URL: https://issues.apache.org/jira/browse/SPARK-42636 Project: Spark Issue Type: New Feature Components: Connect Affects Versions: 3.4.0 Reporter: Herman van Hövell Annotation usage is not entirely consistent in the client. We should probably remove all Stable annotations and add a few DevelopApi ones. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42634) Several counter-intuitive behaviours in the TimestampAdd expression
[ https://issues.apache.org/jira/browse/SPARK-42634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenhao Li updated SPARK-42634: --- Description: (was: # When the time is close to daylight saving time transition, the result may be discontinuous and not monotonic. We currently have: {{scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, timestamp'2011-03-12 03:00:00')").show ++ |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| ++ | 2011-03-13 03:59:59| ++ scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 03:00:00')").show +--+ |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| +--+ | 2011-03-13 03:00:00| +--+}} In the second query, adding one more second will set the time back one hour instead. Plus, there is only 23 * 3600seconds from 2011-03-12 03:00:00 to 2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving time transition. The root cause of the problem is the Spark code at https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790 wrongly assumes every day has MICROS_PER_DAY seconds, and does the day and time-in-day split before looking at the timezone. 2. Adding month, quarter, and year silently ignores Int overflow during unit conversion. The root cause is https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246. quantity is multiplied by 3 or MONTHS_PER_YEARwithout checking overflow. Note that we do have overflow checking in adding the amount to the timestamp, so the behavior is inconsistent. This can cause counter-intuitive results like this: {{scala> spark.sql("select timestampadd(quarter, 1431655764, timestamp'1970-01-01')").show +--+ |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')| +--+ | 1969-09-01 00:00:00| +--+}} 3. Adding sub-month units (week, day, hour, minute, second, millisecond, microsecond)silently ignores Long overflow during unit conversion. This is similar to the previous problem: {{scala> spark.sql("select timestampadd(day, 106751992, timestamp'1970-01-01')").show(false) +-+ |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')| +-+ |-290308-12-22 15:58:10.448384| +-+}} ) > Several counter-intuitive behaviours in the TimestampAdd expression > --- > > Key: SPARK-42634 > URL: https://issues.apache.org/jira/browse/SPARK-42634 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.3.0, 3.3.1, 3.3.2 >Reporter: Chenhao Li >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-42634) Several counter-intuitive behaviours in the TimestampAdd expression
[ https://issues.apache.org/jira/browse/SPARK-42634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenhao Li closed SPARK-42634. -- This is a duplicate, created by mistake. > Several counter-intuitive behaviours in the TimestampAdd expression > --- > > Key: SPARK-42634 > URL: https://issues.apache.org/jira/browse/SPARK-42634 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.3.0, 3.3.1, 3.3.2 >Reporter: Chenhao Li >Priority: Major > > # When the time is close to daylight saving time transition, the result may > be discontinuous and not monotonic. > We currently have: > > {{scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, > timestamp'2011-03-12 03:00:00')").show > ++ > |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| > ++ > | 2011-03-13 03:59:59| > ++ > scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 > 03:00:00')").show > +--+ > |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| > +--+ > | 2011-03-13 03:00:00| > +--+}} > In the second query, adding one more second will set the time back one hour > instead. Plus, there is only 23 * 3600seconds from 2011-03-12 03:00:00 to > 2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving > time transition. > The root cause of the problem is the Spark code at > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790 > wrongly assumes every day has MICROS_PER_DAY seconds, and does the day and > time-in-day split before looking at the timezone. > 2. Adding month, quarter, and year silently ignores Int overflow during unit > conversion. > The root cause is > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246. > quantity is multiplied by 3 or MONTHS_PER_YEARwithout checking overflow. > Note that we do have overflow checking in adding the amount to the timestamp, > so the behavior is inconsistent. > This can cause counter-intuitive results like this: > > {{scala> spark.sql("select timestampadd(quarter, 1431655764, > timestamp'1970-01-01')").show > +--+ > |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')| > +--+ > | 1969-09-01 00:00:00| > +--+}} > 3. Adding sub-month units (week, day, hour, minute, second, millisecond, > microsecond)silently ignores Long overflow during unit conversion. > This is similar to the previous problem: > > {{scala> spark.sql("select timestampadd(day, 106751992, > timestamp'1970-01-01')").show(false) > +-+ > |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')| > +-+ > |-290308-12-22 15:58:10.448384| > +-+}} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42634) Several counter-intuitive behaviours in the TimestampAdd expression
[ https://issues.apache.org/jira/browse/SPARK-42634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenhao Li resolved SPARK-42634. Resolution: Fixed duplicate > Several counter-intuitive behaviours in the TimestampAdd expression > --- > > Key: SPARK-42634 > URL: https://issues.apache.org/jira/browse/SPARK-42634 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.3.0, 3.3.1, 3.3.2 >Reporter: Chenhao Li >Priority: Major > > # When the time is close to daylight saving time transition, the result may > be discontinuous and not monotonic. > We currently have: > > {{scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, > timestamp'2011-03-12 03:00:00')").show > ++ > |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| > ++ > | 2011-03-13 03:59:59| > ++ > scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 > 03:00:00')").show > +--+ > |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| > +--+ > | 2011-03-13 03:00:00| > +--+}} > In the second query, adding one more second will set the time back one hour > instead. Plus, there is only 23 * 3600seconds from 2011-03-12 03:00:00 to > 2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving > time transition. > The root cause of the problem is the Spark code at > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790 > wrongly assumes every day has MICROS_PER_DAY seconds, and does the day and > time-in-day split before looking at the timezone. > 2. Adding month, quarter, and year silently ignores Int overflow during unit > conversion. > The root cause is > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246. > quantity is multiplied by 3 or MONTHS_PER_YEARwithout checking overflow. > Note that we do have overflow checking in adding the amount to the timestamp, > so the behavior is inconsistent. > This can cause counter-intuitive results like this: > > {{scala> spark.sql("select timestampadd(quarter, 1431655764, > timestamp'1970-01-01')").show > +--+ > |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')| > +--+ > | 1969-09-01 00:00:00| > +--+}} > 3. Adding sub-month units (week, day, hour, minute, second, millisecond, > microsecond)silently ignores Long overflow during unit conversion. > This is similar to the previous problem: > > {{scala> spark.sql("select timestampadd(day, 106751992, > timestamp'1970-01-01')").show(false) > +-+ > |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')| > +-+ > |-290308-12-22 15:58:10.448384| > +-+}} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression
[ https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenhao Li updated SPARK-42635: --- Component/s: SQL (was: Spark Core) > Several counter-intuitive behaviours in the TimestampAdd expression > --- > > Key: SPARK-42635 > URL: https://issues.apache.org/jira/browse/SPARK-42635 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.3.2 >Reporter: Chenhao Li >Priority: Major > > # When the time is close to daylight saving time transition, the result may > be discontinuous and not monotonic. > We currently have: > {code:scala} > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, > timestamp'2011-03-12 03:00:00')").show > ++ > |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| > ++ > | 2011-03-13 03:59:59| > ++ > scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 > 03:00:00')").show > +--+ > |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| > +--+ > | 2011-03-13 03:00:00| > +--+ {code} > > In the second query, adding one more second will set the time back one hour > instead. Plus, there are only {{23 * 3600}} seconds from {{2011-03-12 > 03:00:00}} to {{2011-03-13 03:00:00}}, instead of {{24 * 3600}} seconds, due > to the daylight saving time transition. > The root cause of the problem is the Spark code at > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790] > wrongly assumes every day has {{MICROS_PER_DAY}} seconds, and does the day > and time-in-day split before looking at the timezone. > 2. Adding month, quarter, and year silently ignores Int overflow during unit > conversion. > The root cause is > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246]. > {{quantity}} is multiplied by {{3}} or {{MONTHS_PER_YEAR}} without checking > overflow. Note that we do have overflow checking in adding the amount to the > timestamp, so the behavior is inconsistent. > This can cause counter-intuitive results like this: > {code:scala} > scala> spark.sql("select timestampadd(quarter, 1431655764, > timestamp'1970-01-01')").show > +--+ > |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')| > +--+ > | 1969-09-01 00:00:00| > +--+{code} > 3. Adding sub-month units (week, day, hour, minute, second, millisecond, > microsecond)silently ignores Long overflow during unit conversion. > This is similar to the previous problem: > {code:scala} > scala> spark.sql("select timestampadd(day, 106751992, > timestamp'1970-01-01')").show(false) > +-+ > |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')| > +-+ > |-290308-12-22 15:58:10.448384| > +-+{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression
[ https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenhao Li updated SPARK-42635: --- Component/s: (was: SQL) > Several counter-intuitive behaviours in the TimestampAdd expression > --- > > Key: SPARK-42635 > URL: https://issues.apache.org/jira/browse/SPARK-42635 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0, 3.3.1, 3.3.2 >Reporter: Chenhao Li >Priority: Major > > # When the time is close to daylight saving time transition, the result may > be discontinuous and not monotonic. > We currently have: > {code:scala} > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, > timestamp'2011-03-12 03:00:00')").show > ++ > |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| > ++ > | 2011-03-13 03:59:59| > ++ > scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 > 03:00:00')").show > +--+ > |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| > +--+ > | 2011-03-13 03:00:00| > +--+ {code} > > In the second query, adding one more second will set the time back one hour > instead. Plus, there are only {{23 * 3600}} seconds from {{2011-03-12 > 03:00:00}} to {{2011-03-13 03:00:00}}, instead of {{24 * 3600}} seconds, due > to the daylight saving time transition. > The root cause of the problem is the Spark code at > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790] > wrongly assumes every day has {{MICROS_PER_DAY}} seconds, and does the day > and time-in-day split before looking at the timezone. > 2. Adding month, quarter, and year silently ignores Int overflow during unit > conversion. > The root cause is > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246]. > {{quantity}} is multiplied by {{3}} or {{MONTHS_PER_YEAR}} without checking > overflow. Note that we do have overflow checking in adding the amount to the > timestamp, so the behavior is inconsistent. > This can cause counter-intuitive results like this: > {code:scala} > scala> spark.sql("select timestampadd(quarter, 1431655764, > timestamp'1970-01-01')").show > +--+ > |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')| > +--+ > | 1969-09-01 00:00:00| > +--+{code} > 3. Adding sub-month units (week, day, hour, minute, second, millisecond, > microsecond)silently ignores Long overflow during unit conversion. > This is similar to the previous problem: > {code:scala} > scala> spark.sql("select timestampadd(day, 106751992, > timestamp'1970-01-01')").show(false) > +-+ > |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')| > +-+ > |-290308-12-22 15:58:10.448384| > +-+{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41843) Implement SparkSession.udf
[ https://issues.apache.org/jira/browse/SPARK-41843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-41843. --- Fix Version/s: 3.4.0 Resolution: Fixed > Implement SparkSession.udf > -- > > Key: SPARK-41843 > URL: https://issues.apache.org/jira/browse/SPARK-41843 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 2331, in pyspark.sql.connect.functions.call_udf > Failed example: > _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType()) > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType()) > AttributeError: 'SparkSession' object has no attribute 'udf'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38735) Test the error class: INTERNAL_ERROR
[ https://issues.apache.org/jira/browse/SPARK-38735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695259#comment-17695259 ] Apache Spark commented on SPARK-38735: -- User 'the8thC' has created a pull request for this issue: https://github.com/apache/spark/pull/40236 > Test the error class: INTERNAL_ERROR > > > Key: SPARK-38735 > URL: https://issues.apache.org/jira/browse/SPARK-38735 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Minor > Labels: starter > > Add tests for the error class *INTERNAL_ERROR* to QueryExecutionErrorsSuite. > The test should cover the exception throw in QueryExecutionErrors: > {code:scala} > def logicalHintOperatorNotRemovedDuringAnalysisError(): Throwable = { > new SparkIllegalStateException(errorClass = "INTERNAL_ERROR", > messageParameters = Array( > "Internal error: logical hint operator should have been removed > during analysis")) > } > def cannotEvaluateExpressionError(expression: Expression): Throwable = { > new SparkUnsupportedOperationException(errorClass = "INTERNAL_ERROR", > messageParameters = Array(s"Cannot evaluate expression: $expression")) > } > def cannotGenerateCodeForExpressionError(expression: Expression): Throwable > = { > new SparkUnsupportedOperationException(errorClass = "INTERNAL_ERROR", > messageParameters = Array(s"Cannot generate code for expression: > $expression")) > } > def cannotTerminateGeneratorError(generator: UnresolvedGenerator): > Throwable = { > new SparkUnsupportedOperationException(errorClass = "INTERNAL_ERROR", > messageParameters = Array(s"Cannot terminate expression: $generator")) > } > def methodNotDeclaredError(name: String): Throwable = { > new SparkNoSuchMethodException(errorClass = "INTERNAL_ERROR", > messageParameters = Array( > s"""A method named "$name" is not declared in any enclosing class nor > any supertype""")) > } > {code} > For example, here is a test for the error class *UNSUPPORTED_FEATURE*: > https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170 > +The test must have a check of:+ > # the entire error message > # sqlState if it is defined in the error-classes.json file > # the error class -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38735) Test the error class: INTERNAL_ERROR
[ https://issues.apache.org/jira/browse/SPARK-38735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38735: Assignee: (was: Apache Spark) > Test the error class: INTERNAL_ERROR > > > Key: SPARK-38735 > URL: https://issues.apache.org/jira/browse/SPARK-38735 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Minor > Labels: starter > > Add tests for the error class *INTERNAL_ERROR* to QueryExecutionErrorsSuite. > The test should cover the exception throw in QueryExecutionErrors: > {code:scala} > def logicalHintOperatorNotRemovedDuringAnalysisError(): Throwable = { > new SparkIllegalStateException(errorClass = "INTERNAL_ERROR", > messageParameters = Array( > "Internal error: logical hint operator should have been removed > during analysis")) > } > def cannotEvaluateExpressionError(expression: Expression): Throwable = { > new SparkUnsupportedOperationException(errorClass = "INTERNAL_ERROR", > messageParameters = Array(s"Cannot evaluate expression: $expression")) > } > def cannotGenerateCodeForExpressionError(expression: Expression): Throwable > = { > new SparkUnsupportedOperationException(errorClass = "INTERNAL_ERROR", > messageParameters = Array(s"Cannot generate code for expression: > $expression")) > } > def cannotTerminateGeneratorError(generator: UnresolvedGenerator): > Throwable = { > new SparkUnsupportedOperationException(errorClass = "INTERNAL_ERROR", > messageParameters = Array(s"Cannot terminate expression: $generator")) > } > def methodNotDeclaredError(name: String): Throwable = { > new SparkNoSuchMethodException(errorClass = "INTERNAL_ERROR", > messageParameters = Array( > s"""A method named "$name" is not declared in any enclosing class nor > any supertype""")) > } > {code} > For example, here is a test for the error class *UNSUPPORTED_FEATURE*: > https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170 > +The test must have a check of:+ > # the entire error message > # sqlState if it is defined in the error-classes.json file > # the error class -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38735) Test the error class: INTERNAL_ERROR
[ https://issues.apache.org/jira/browse/SPARK-38735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695261#comment-17695261 ] Apache Spark commented on SPARK-38735: -- User 'the8thC' has created a pull request for this issue: https://github.com/apache/spark/pull/40236 > Test the error class: INTERNAL_ERROR > > > Key: SPARK-38735 > URL: https://issues.apache.org/jira/browse/SPARK-38735 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Minor > Labels: starter > > Add tests for the error class *INTERNAL_ERROR* to QueryExecutionErrorsSuite. > The test should cover the exception throw in QueryExecutionErrors: > {code:scala} > def logicalHintOperatorNotRemovedDuringAnalysisError(): Throwable = { > new SparkIllegalStateException(errorClass = "INTERNAL_ERROR", > messageParameters = Array( > "Internal error: logical hint operator should have been removed > during analysis")) > } > def cannotEvaluateExpressionError(expression: Expression): Throwable = { > new SparkUnsupportedOperationException(errorClass = "INTERNAL_ERROR", > messageParameters = Array(s"Cannot evaluate expression: $expression")) > } > def cannotGenerateCodeForExpressionError(expression: Expression): Throwable > = { > new SparkUnsupportedOperationException(errorClass = "INTERNAL_ERROR", > messageParameters = Array(s"Cannot generate code for expression: > $expression")) > } > def cannotTerminateGeneratorError(generator: UnresolvedGenerator): > Throwable = { > new SparkUnsupportedOperationException(errorClass = "INTERNAL_ERROR", > messageParameters = Array(s"Cannot terminate expression: $generator")) > } > def methodNotDeclaredError(name: String): Throwable = { > new SparkNoSuchMethodException(errorClass = "INTERNAL_ERROR", > messageParameters = Array( > s"""A method named "$name" is not declared in any enclosing class nor > any supertype""")) > } > {code} > For example, here is a test for the error class *UNSUPPORTED_FEATURE*: > https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170 > +The test must have a check of:+ > # the entire error message > # sqlState if it is defined in the error-classes.json file > # the error class -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38735) Test the error class: INTERNAL_ERROR
[ https://issues.apache.org/jira/browse/SPARK-38735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38735: Assignee: Apache Spark > Test the error class: INTERNAL_ERROR > > > Key: SPARK-38735 > URL: https://issues.apache.org/jira/browse/SPARK-38735 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Minor > Labels: starter > > Add tests for the error class *INTERNAL_ERROR* to QueryExecutionErrorsSuite. > The test should cover the exception throw in QueryExecutionErrors: > {code:scala} > def logicalHintOperatorNotRemovedDuringAnalysisError(): Throwable = { > new SparkIllegalStateException(errorClass = "INTERNAL_ERROR", > messageParameters = Array( > "Internal error: logical hint operator should have been removed > during analysis")) > } > def cannotEvaluateExpressionError(expression: Expression): Throwable = { > new SparkUnsupportedOperationException(errorClass = "INTERNAL_ERROR", > messageParameters = Array(s"Cannot evaluate expression: $expression")) > } > def cannotGenerateCodeForExpressionError(expression: Expression): Throwable > = { > new SparkUnsupportedOperationException(errorClass = "INTERNAL_ERROR", > messageParameters = Array(s"Cannot generate code for expression: > $expression")) > } > def cannotTerminateGeneratorError(generator: UnresolvedGenerator): > Throwable = { > new SparkUnsupportedOperationException(errorClass = "INTERNAL_ERROR", > messageParameters = Array(s"Cannot terminate expression: $generator")) > } > def methodNotDeclaredError(name: String): Throwable = { > new SparkNoSuchMethodException(errorClass = "INTERNAL_ERROR", > messageParameters = Array( > s"""A method named "$name" is not declared in any enclosing class nor > any supertype""")) > } > {code} > For example, here is a test for the error class *UNSUPPORTED_FEATURE*: > https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170 > +The test must have a check of:+ > # the entire error message > # sqlState if it is defined in the error-classes.json file > # the error class -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression
[ https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenhao Li updated SPARK-42635: --- Description: # When the time is close to daylight saving time transition, the result may be discontinuous and not monotonic. We currently have: {code:scala} scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, timestamp'2011-03-12 03:00:00')").show ++ |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| ++ | 2011-03-13 03:59:59| ++ scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 03:00:00')").show +--+ |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| +--+ | 2011-03-13 03:00:00| +--+ {code} In the second query, adding one more second will set the time back one hour instead. Plus, there are only {{23 * 3600}} seconds from {{2011-03-12 03:00:00}} to {{2011-03-13 03:00:00}}, instead of {{24 * 3600}} seconds, due to the daylight saving time transition. The root cause of the problem is the Spark code at [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790] wrongly assumes every day has {{MICROS_PER_DAY}} seconds, and does the day and time-in-day split before looking at the timezone. 2. Adding month, quarter, and year silently ignores Int overflow during unit conversion. The root cause is [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246]. {{quantity}} is multiplied by {{3}} or {{MONTHS_PER_YEAR}} without checking overflow. Note that we do have overflow checking in adding the amount to the timestamp, so the behavior is inconsistent. This can cause counter-intuitive results like this: {code:scala} scala> spark.sql("select timestampadd(quarter, 1431655764, timestamp'1970-01-01')").show +--+ |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')| +--+ | 1969-09-01 00:00:00| +--+{code} 3. Adding sub-month units (week, day, hour, minute, second, millisecond, microsecond)silently ignores Long overflow during unit conversion. This is similar to the previous problem: {code:scala} scala> spark.sql("select timestampadd(day, 106751992, timestamp'1970-01-01')").show(false) +-+ |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')| +-+ |-290308-12-22 15:58:10.448384| +-+{code} was: # When the time is close to daylight saving time transition, the result may be discontinuous and not monotonic. We currently have: {code:java} scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, timestamp'2011-03-12 03:00:00')").show ++ |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| ++ | 2011-03-13 03:59:59| ++ scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 03:00:00')").show +--+ |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| +--+ | 2011-03-13 03:00:00| +--+ {code} In the second query, adding one more second will set the time back one hour instead. Plus, there are only {{23 * 3600}} seconds from {{2011-03-12 03:00:00}} to {{2011-03-13 03:00:00}}, instead of {{24 * 3600}} seconds, due to the daylight saving time transition. The root cause of the problem is the Spark code at
[jira] [Updated] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression
[ https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenhao Li updated SPARK-42635: --- Description: # When the time is close to daylight saving time transition, the result may be discontinuous and not monotonic. We currently have: {code:java} scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, timestamp'2011-03-12 03:00:00')").show ++ |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| ++ | 2011-03-13 03:59:59| ++ scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 03:00:00')").show +--+ |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| +--+ | 2011-03-13 03:00:00| +--+ {code} In the second query, adding one more second will set the time back one hour instead. Plus, there are only {{23 * 3600}} seconds from {{2011-03-12 03:00:00}} to {{2011-03-13 03:00:00}}, instead of {{24 * 3600}} seconds, due to the daylight saving time transition. The root cause of the problem is the Spark code at wrongly assumes every day has {{MICROS_PER_DAY}} seconds, and does the day and time-in-day split before looking at the timezone. 2. Adding month, quarter, and year silently ignores Int overflow during unit conversion. The root cause is here . quantity is multiplied by 3 or MONTHS_PER_YEARwithout checking overflow. Note that we do have overflow checking in adding the amount to the timestamp, so the behavior is inconsistent. This can cause counter-intuitive results like this: {{scala> spark.sql("select timestampadd(quarter, 1431655764, timestamp'1970-01-01')").show +--+ |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')| +--+ |1969-09-01 00:00:00| {+}--{+}}} 3. Adding sub-month units (week, day, hour, minute, second, millisecond, microsecond)silently ignores Long overflow during unit conversion. This is similar to the previous problem: {{scala> spark.sql("select timestampadd(day, 106751992, timestamp'1970-01-01')").show(false) +-+ |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')| +-+ |-290308-12-22 15:58:10.448384| {+}-{+}}} was: # When the time is close to daylight saving time transition, the result may be discontinuous and not monotonic. We currently have: {code:java} scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, timestamp'2011-03-12 03:00:00')").show ++ |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| ++ | 2011-03-13 03:59:59| ++ scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 03:00:00')").show +--+ |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| +--+ | 2011-03-13 03:00:00| +--+ {code} In the second query, adding one more second will set the time back one hour instead. Plus, there is only 23 * 3600 seconds from 2011-03-12 03:00:00 to 2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving time transition. The root cause of the problem is the Spark code at [https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L797] wrongly assumes every day has MICROS_PER_DAYseconds, and does the day and time-in-day split before looking at the timezone. 2. Adding month, quarter, and year silently ignores Int overflow during unit conversion. The root cause is here
[jira] [Commented] (SPARK-42631) Support custom extensions in Spark Connect Scala client
[ https://issues.apache.org/jira/browse/SPARK-42631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695254#comment-17695254 ] Apache Spark commented on SPARK-42631: -- User 'tomvanbussel' has created a pull request for this issue: https://github.com/apache/spark/pull/40234 > Support custom extensions in Spark Connect Scala client > --- > > Key: SPARK-42631 > URL: https://issues.apache.org/jira/browse/SPARK-42631 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Tom van Bussel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression
[ https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenhao Li updated SPARK-42635: --- Description: # When the time is close to daylight saving time transition, the result may be discontinuous and not monotonic. We currently have: {code:java} scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, timestamp'2011-03-12 03:00:00')").show ++ |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| ++ | 2011-03-13 03:59:59| ++ scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 03:00:00')").show +--+ |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| +--+ | 2011-03-13 03:00:00| +--+ {code} In the second query, adding one more second will set the time back one hour instead. Plus, there is only 23 * 3600 seconds from 2011-03-12 03:00:00 to 2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving time transition. The root cause of the problem is the Spark code at [https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L797] wrongly assumes every day has MICROS_PER_DAYseconds, and does the day and time-in-day split before looking at the timezone. 2. Adding month, quarter, and year silently ignores Int overflow during unit conversion. The root cause is here [https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L1254]. quantity is multiplied by 3 or MONTHS_PER_YEARwithout checking overflow. Note that we do have overflow checking in adding the amount to the timestamp, so the behavior is inconsistent. This can cause counter-intuitive results like this: {{scala> spark.sql("select timestampadd(quarter, 1431655764, timestamp'1970-01-01')").show +--+ |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')| +--+ |1969-09-01 00:00:00| {+}--{+}}} 3. Adding sub-month units (week, day, hour, minute, second, millisecond, microsecond)silently ignores Long overflow during unit conversion. This is similar to the previous problem: {{scala> spark.sql("select timestampadd(day, 106751992, timestamp'1970-01-01')").show(false) +-+ |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')| +-+ |-290308-12-22 15:58:10.448384| {+}-{+}}} was: # When the time is close to daylight saving time transition, the result may be discontinuous and not monotonic. We currently have: {code:java} scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, timestamp'2011-03-12 03:00:00')").show ++ |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| ++ | 2011-03-13 03:59:59| ++ scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 03:00:00')").show +--+ |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| +--+ | 2011-03-13 03:00:00| +--+ {code} In the second query, adding one more second will set the time back one hour instead. Plus, there is only `23 * 3600` seconds from 2011-03-12 03:00:00 to 2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving time transition. The root cause of the problem is the Spark code at
[jira] [Assigned] (SPARK-42631) Support custom extensions in Spark Connect Scala client
[ https://issues.apache.org/jira/browse/SPARK-42631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42631: Assignee: Apache Spark > Support custom extensions in Spark Connect Scala client > --- > > Key: SPARK-42631 > URL: https://issues.apache.org/jira/browse/SPARK-42631 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Tom van Bussel >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression
[ https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenhao Li updated SPARK-42635: --- Description: # When the time is close to daylight saving time transition, the result may be discontinuous and not monotonic. We currently have: {code:java} scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, timestamp'2011-03-12 03:00:00')").show ++ |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| ++ | 2011-03-13 03:59:59| ++ scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 03:00:00')").show +--+ |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| +--+ | 2011-03-13 03:00:00| +--+ {code} In the second query, adding one more second will set the time back one hour instead. Plus, there is only `23 * 3600` seconds from 2011-03-12 03:00:00 to 2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving time transition. The root cause of the problem is the Spark code at [https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L797] wrongly assumes every day has MICROS_PER_DAYseconds, and does the day and time-in-day split before looking at the timezone. 2. Adding month, quarter, and year silently ignores Int overflow during unit conversion. The root cause is here [https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L1254]. quantity is multiplied by 3 or MONTHS_PER_YEARwithout checking overflow. Note that we do have overflow checking in adding the amount to the timestamp, so the behavior is inconsistent. This can cause counter-intuitive results like this: {{scala> spark.sql("select timestampadd(quarter, 1431655764, timestamp'1970-01-01')").show +--+ |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')| +--+ |1969-09-01 00:00:00| {+}--{+}}} 3. Adding sub-month units (week, day, hour, minute, second, millisecond, microsecond)silently ignores Long overflow during unit conversion. This is similar to the previous problem: {{scala> spark.sql("select timestampadd(day, 106751992, timestamp'1970-01-01')").show(false) +-+ |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')| +-+ |-290308-12-22 15:58:10.448384| {+}-{+}}} was: # When the time is close to daylight saving time transition, the result may be discontinuous and not monotonic. We currently have: {code:java} scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, timestamp'2011-03-12 03:00:00')").show ++ |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| ++ | 2011-03-13 03:59:59| ++ scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 03:00:00')").show +--+ |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| +--+ | 2011-03-13 03:00:00| +--+ {code} In the second query, adding one more second will set the time back one hour instead. Plus, there is only 23 * 3600 seconds from 2011-03-12 03:00:00 to 2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving time transition. The root cause of the problem is the Spark code at
[jira] [Assigned] (SPARK-42631) Support custom extensions in Spark Connect Scala client
[ https://issues.apache.org/jira/browse/SPARK-42631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42631: Assignee: (was: Apache Spark) > Support custom extensions in Spark Connect Scala client > --- > > Key: SPARK-42631 > URL: https://issues.apache.org/jira/browse/SPARK-42631 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Tom van Bussel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42631) Support custom extensions in Spark Connect Scala client
[ https://issues.apache.org/jira/browse/SPARK-42631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695252#comment-17695252 ] Apache Spark commented on SPARK-42631: -- User 'tomvanbussel' has created a pull request for this issue: https://github.com/apache/spark/pull/40234 > Support custom extensions in Spark Connect Scala client > --- > > Key: SPARK-42631 > URL: https://issues.apache.org/jira/browse/SPARK-42631 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Tom van Bussel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression
[ https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenhao Li updated SPARK-42635: --- Description: # When the time is close to daylight saving time transition, the result may be discontinuous and not monotonic. We currently have: {code:java} scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, timestamp'2011-03-12 03:00:00')").show ++ |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| ++ | 2011-03-13 03:59:59| ++ scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 03:00:00')").show +--+ |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| +--+ | 2011-03-13 03:00:00| +--+ {code} In the second query, adding one more second will set the time back one hour instead. Plus, there is only 23 * 3600 seconds from 2011-03-12 03:00:00 to 2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving time transition. The root cause of the problem is the Spark code at [https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L797] wrongly assumes every day has MICROS_PER_DAYseconds, and does the day and time-in-day split before looking at the timezone. 2. Adding month, quarter, and year silently ignores Int overflow during unit conversion. The root cause is here [https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L1254]. quantity is multiplied by 3 or MONTHS_PER_YEARwithout checking overflow. Note that we do have overflow checking in adding the amount to the timestamp, so the behavior is inconsistent. This can cause counter-intuitive results like this: {{scala> spark.sql("select timestampadd(quarter, 1431655764, timestamp'1970-01-01')").show +--+ |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')| +--+ |1969-09-01 00:00:00| {+}--{+}}} 3. Adding sub-month units (week, day, hour, minute, second, millisecond, microsecond)silently ignores Long overflow during unit conversion. This is similar to the previous problem: {{scala> spark.sql("select timestampadd(day, 106751992, timestamp'1970-01-01')").show(false) +-+ |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')| +-+ |-290308-12-22 15:58:10.448384| {+}-{+}}} was: # When the time is close to daylight saving time transition, the result may be discontinuous and not monotonic. As pointed out by me and @Utkarsh Agarwal in [https://github.com/databricks/runtime/pull/54936/files#r1118047445], the result is conter-intuitive when the time is close to daylight saving time transition and the added amount is close to the multiple of days. We currently have: ``` scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, timestamp'2011-03-12 03:00:00')").show ++ |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| ++ | 2011-03-13 03:59:59| ++ scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 03:00:00')").show +--+ |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| +--+ | 2011-03-13 03:00:00| +--+ ``` In the second query, adding one more second will set the time back one hour instead. Plus, there is only 23 * 3600seconds from 2011-03-12 03:00:00 to 2011-03-13
[jira] [Updated] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression
[ https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenhao Li updated SPARK-42635: --- Description: # When the time is close to daylight saving time transition, the result may be discontinuous and not monotonic. As pointed out by me and @Utkarsh Agarwal in [https://github.com/databricks/runtime/pull/54936/files#r1118047445], the result is conter-intuitive when the time is close to daylight saving time transition and the added amount is close to the multiple of days. We currently have: ``` scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, timestamp'2011-03-12 03:00:00')").show ++ |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| ++ | 2011-03-13 03:59:59| ++ scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 03:00:00')").show +--+ |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| +--+ | 2011-03-13 03:00:00| +--+ ``` In the second query, adding one more second will set the time back one hour instead. Plus, there is only 23 * 3600seconds from 2011-03-12 03:00:00 to 2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving time transition. The root cause of the problem is the Spark code at [https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L797] wrongly assumes every day has MICROS_PER_DAYseconds, and does the day and time-in-day split before looking at the timezone. 2. Adding month, quarter, and year silently ignores Int overflow during unit conversion. The root cause is here [https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L1254]. quantity is multiplied by 3 or MONTHS_PER_YEARwithout checking overflow. Note that we do have overflow checking in adding the amount to the timestamp, so the behavior is inconsistent. This can cause counter-intuitive results like this: {{scala> spark.sql("select timestampadd(quarter, 1431655764, timestamp'1970-01-01')").show +--+ |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')| +--+ |1969-09-01 00:00:00| {+}--{+}}} 3. Adding sub-month units (week, day, hour, minute, second, millisecond, microsecond)silently ignores Long overflow during unit conversion. This is similar to the previous problem: {{scala> spark.sql("select timestampadd(day, 106751992, timestamp'1970-01-01')").show(false) +-+ |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')| +-+ |-290308-12-22 15:58:10.448384| {+}-{+}}} was: # When the time is close to daylight saving time transition, the result may be discontinuous and not monotonic. As pointed out by me and @Utkarsh Agarwal in [https://github.com/databricks/runtime/pull/54936/files#r1118047445], the result is conter-intuitive when the time is close to daylight saving time transition and the added amount is close to the multiple of days. We currently have: ``` scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, timestamp'2011-03-12 03:00:00')").show ++ |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| ++ |2011-03-13 03:59:59| ++ scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 03:00:00')").show +--+ |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| +--+ |2011-03-13 03:00:00| +--+ ``` In the
[jira] [Updated] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression
[ https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenhao Li updated SPARK-42635: --- Description: # When the time is close to daylight saving time transition, the result may be discontinuous and not monotonic. As pointed out by me and @Utkarsh Agarwal in [https://github.com/databricks/runtime/pull/54936/files#r1118047445], the result is conter-intuitive when the time is close to daylight saving time transition and the added amount is close to the multiple of days. We currently have: ``` scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, timestamp'2011-03-12 03:00:00')").show ++ |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| ++ |2011-03-13 03:59:59| ++ scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 03:00:00')").show +--+ |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| +--+ |2011-03-13 03:00:00| +--+ ``` In the second query, adding one more second will set the time back one hour instead. Plus, there is only 23 * 3600seconds from 2011-03-12 03:00:00 to 2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving time transition. The root cause of the problem is the Spark code at [https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L797] wrongly assumes every day has MICROS_PER_DAYseconds, and does the day and time-in-day split before looking at the timezone. 2. Adding month, quarter, and year silently ignores Int overflow during unit conversion. The root cause is here [https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L1254]. quantity is multiplied by 3 or MONTHS_PER_YEARwithout checking overflow. Note that we do have overflow checking in adding the amount to the timestamp, so the behavior is inconsistent. This can cause counter-intuitive results like this: {{scala> spark.sql("select timestampadd(quarter, 1431655764, timestamp'1970-01-01')").show +--+ |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')| +--+ |1969-09-01 00:00:00| {+}--{+}}} 3. Adding sub-month units (week, day, hour, minute, second, millisecond, microsecond)silently ignores Long overflow during unit conversion. This is similar to the previous problem: {{scala> spark.sql("select timestampadd(day, 106751992, timestamp'1970-01-01')").show(false) +-+ |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')| +-+ |-290308-12-22 15:58:10.448384| {+}-{+}}} was: # When the time is close to daylight saving time transition, the result may be discontinuous and not monotonic. As pointed out by me and @Utkarsh Agarwal in [https://github.com/databricks/runtime/pull/54936/files#r1118047445], the result is conter-intuitive when the time is close to daylight saving time transition and the added amount is close to the multiple of days. We currently have: {{scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, timestamp'2011-03-12 03:00:00')").show ++ |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| ++ | 2011-03-13 03:59:59| ++ scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 03:00:00')").show +--+ |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| +--+ | 2011-03-13 03:00:00| +--+}} In the second
[jira] [Updated] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression
[ https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenhao Li updated SPARK-42635: --- Description: # When the time is close to daylight saving time transition, the result may be discontinuous and not monotonic. As pointed out by me and @Utkarsh Agarwal in [https://github.com/databricks/runtime/pull/54936/files#r1118047445], the result is conter-intuitive when the time is close to daylight saving time transition and the added amount is close to the multiple of days. We currently have: {{scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, timestamp'2011-03-12 03:00:00')").show ++ |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| ++ | 2011-03-13 03:59:59| ++ scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 03:00:00')").show +--+ |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| +--+ | 2011-03-13 03:00:00| +--+}} In the second query, adding one more second will set the time back one hour instead. Plus, there is only 23 * 3600seconds from 2011-03-12 03:00:00 to 2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving time transition. The root cause of the problem is the Spark code at [https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L797] wrongly assumes every day has MICROS_PER_DAYseconds, and does the day and time-in-day split before looking at the timezone. 2. Adding month, quarter, and year silently ignores Int overflow during unit conversion. The root cause is here [https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L1254]. quantity is multiplied by 3 or MONTHS_PER_YEARwithout checking overflow. Note that we do have overflow checking in adding the amount to the timestamp, so the behavior is inconsistent. This can cause counter-intuitive results like this: {{scala> spark.sql("select timestampadd(quarter, 1431655764, timestamp'1970-01-01')").show +--+ |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')| +--+ | 1969-09-01 00:00:00| +--+}} 3. Adding sub-month units (week, day, hour, minute, second, millisecond, microsecond)silently ignores Long overflow during unit conversion. This is similar to the previous problem: {{scala> spark.sql("select timestampadd(day, 106751992, timestamp'1970-01-01')").show(false) +-+ |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')| +-+ |-290308-12-22 15:58:10.448384| +-+}} > Several counter-intuitive behaviours in the TimestampAdd expression > --- > > Key: SPARK-42635 > URL: https://issues.apache.org/jira/browse/SPARK-42635 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.3.0, 3.3.1, 3.3.2 >Reporter: Chenhao Li >Priority: Major > > # When the time is close to daylight saving time transition, the result may > be discontinuous and not monotonic. > As pointed out by me and @Utkarsh Agarwal in > [https://github.com/databricks/runtime/pull/54936/files#r1118047445], the > result is conter-intuitive when the time is close to daylight saving time > transition and the added amount is close to the multiple of days. > We currently have: > > {{scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, > timestamp'2011-03-12 03:00:00')").show > ++ > |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| >
[jira] [Updated] (SPARK-39993) Spark on Kubernetes doesn't filter data by date
[ https://issues.apache.org/jira/browse/SPARK-39993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hanna Liashchuk updated SPARK-39993: Affects Version/s: (was: 3.3.2) > Spark on Kubernetes doesn't filter data by date > --- > > Key: SPARK-39993 > URL: https://issues.apache.org/jira/browse/SPARK-39993 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.2 > Environment: Kubernetes v1.23.6 > Spark 3.2.2 > Java 1.8.0_312 > Python 3.9.13 > Aws dependencies: > aws-java-sdk-bundle-1.11.901.jar and hadoop-aws-3.3.1.jar >Reporter: Hanna Liashchuk >Priority: Major > Labels: kubernetes > > I'm creating a Dataset with type date and saving it into s3. When I read it > and try to use where() clause, I've noticed it doesn't return data even > though it's there > Below is the code snippet I'm running > > {code:java} > from pyspark.sql.types import Row > from pyspark.sql.functions import * > ds = spark.range(10).withColumn("date", lit("2022-01-01")).withColumn("date", > col("date").cast("date")) > ds.where("date = '2022-01-01'").show() > ds.write.mode("overwrite").parquet("s3a://bucket/test") > df = spark.read.format("parquet").load("s3a://bucket/test") > df.where("date = '2022-01-01'").show() > {code} > The first show() returns data, while the second one - no. > I've noticed that it's Kubernetes master related, as the same code snipped > works ok with master "local" > UPD: if the column is used as a partition and has the type "date" there is no > filtering problem. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org