[jira] [Assigned] (SPARK-39217) Makes DPP support the pruning side has Union
[ https://issues.apache.org/jira/browse/SPARK-39217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39217: Assignee: Apache Spark > Makes DPP support the pruning side has Union > > > Key: SPARK-39217 > URL: https://issues.apache.org/jira/browse/SPARK-39217 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > Support the following case: > {noformat} > SELECT f.store_id, >f.date_id, >s.state_province > FROM (SELECT 4 AS store_id, >date_id, >product_id > FROM fact_sk > WHERE date_id >= 1300 > UNION ALL > SELECT store_id, >date_id, >product_id > FROM fact_stats > WHERE date_id <= 1000) f > JOIN dim_store s > ON f.store_id = s.store_id > WHERE s.country IN ('US', 'NL') > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39217) Makes DPP support the pruning side has Union
[ https://issues.apache.org/jira/browse/SPARK-39217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538579#comment-17538579 ] Apache Spark commented on SPARK-39217: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/36588 > Makes DPP support the pruning side has Union > > > Key: SPARK-39217 > URL: https://issues.apache.org/jira/browse/SPARK-39217 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > Support the following case: > {noformat} > SELECT f.store_id, >f.date_id, >s.state_province > FROM (SELECT 4 AS store_id, >date_id, >product_id > FROM fact_sk > WHERE date_id >= 1300 > UNION ALL > SELECT store_id, >date_id, >product_id > FROM fact_stats > WHERE date_id <= 1000) f > JOIN dim_store s > ON f.store_id = s.store_id > WHERE s.country IN ('US', 'NL') > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39217) Makes DPP support the pruning side has Union
[ https://issues.apache.org/jira/browse/SPARK-39217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39217: Assignee: (was: Apache Spark) > Makes DPP support the pruning side has Union > > > Key: SPARK-39217 > URL: https://issues.apache.org/jira/browse/SPARK-39217 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > Support the following case: > {noformat} > SELECT f.store_id, >f.date_id, >s.state_province > FROM (SELECT 4 AS store_id, >date_id, >product_id > FROM fact_sk > WHERE date_id >= 1300 > UNION ALL > SELECT store_id, >date_id, >product_id > FROM fact_stats > WHERE date_id <= 1000) f > JOIN dim_store s > ON f.store_id = s.store_id > WHERE s.country IN ('US', 'NL') > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39214) Improve errors related to CAST
[ https://issues.apache.org/jira/browse/SPARK-39214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-39214. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36553 [https://github.com/apache/spark/pull/36553] > Improve errors related to CAST > -- > > Key: SPARK-39214 > URL: https://issues.apache.org/jira/browse/SPARK-39214 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > 1. Rename the error classes INVALID_SYNTAX_FOR_CAST and CAST_CAUSES_OVERFLOW > to make more precise and clear. > 2. Improve error messages of the error classes (use quotes for SQL config and > function names). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39217) Makes DPP support the pruning side has Union
Yuming Wang created SPARK-39217: --- Summary: Makes DPP support the pruning side has Union Key: SPARK-39217 URL: https://issues.apache.org/jira/browse/SPARK-39217 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Yuming Wang Support the following case: {noformat} SELECT f.store_id, f.date_id, s.state_province FROM (SELECT 4 AS store_id, date_id, product_id FROM fact_sk WHERE date_id >= 1300 UNION ALL SELECT store_id, date_id, product_id FROM fact_stats WHERE date_id <= 1000) f JOIN dim_store s ON f.store_id = s.store_id WHERE s.country IN ('US', 'NL') {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-38615) Provide error context for runtime ANSI failures
[ https://issues.apache.org/jira/browse/SPARK-38615 ] Gengliang Wang deleted comment on SPARK-38615: was (Author: gengliang.wang): [~maxgekk] I am targeting this one in 3.3 as well. Since it is an error message improvement, let's try to finish as much as we can in 3.3. What do you think? > Provide error context for runtime ANSI failures > --- > > Key: SPARK-38615 > URL: https://issues.apache.org/jira/browse/SPARK-38615 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Priority: Major > > Currently, there is not enough error context for runtime ANSI failures. > In the following example, the error message only tells that there is a > "divide by zero" error, without pointing out where the exact SQL statement is. > {code:java} > > SELECT > ss1.ca_county, > ss1.d_year, > ws2.web_sales / ws1.web_sales web_q1_q2_increase, > ss2.store_sales / ss1.store_sales store_q1_q2_increase, > ws3.web_sales / ws2.web_sales web_q2_q3_increase, > ss3.store_sales / ss2.store_sales store_q2_q3_increase > FROM > ss ss1, ss ss2, ss ss3, ws ws1, ws ws2, ws ws3 > WHERE > ss1.d_qoy = 1 > AND ss1.d_year = 2000 > AND ss1.ca_county = ss2.ca_county > AND ss2.d_qoy = 2 > AND ss2.d_year = 2000 > AND ss2.ca_county = ss3.ca_county > AND ss3.d_qoy = 3 > AND ss3.d_year = 2000 > AND ss1.ca_county = ws1.ca_county > AND ws1.d_qoy = 1 > AND ws1.d_year = 2000 > AND ws1.ca_county = ws2.ca_county > AND ws2.d_qoy = 2 > AND ws2.d_year = 2000 > AND ws1.ca_county = ws3.ca_county > AND ws3.d_qoy = 3 > AND ws3.d_year = 2000 > AND CASE WHEN ws1.web_sales > 0 > THEN ws2.web_sales / ws1.web_sales > ELSE NULL END > > CASE WHEN ss1.store_sales > 0 > THEN ss2.store_sales / ss1.store_sales > ELSE NULL END > AND CASE WHEN ws2.web_sales > 0 > THEN ws3.web_sales / ws2.web_sales > ELSE NULL END > > CASE WHEN ss2.store_sales > 0 > THEN ss3.store_sales / ss2.store_sales > ELSE NULL END > ORDER BY ss1.ca_county > {code} > {code:java} > org.apache.spark.SparkArithmeticException: divide by zero at > org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:140) > at > org.apache.spark.sql.catalyst.expressions.DivModLike.eval(arithmetic.scala:437) > at > org.apache.spark.sql.catalyst.expressions.DivModLike.eval$(arithmetic.scala:425) > at > org.apache.spark.sql.catalyst.expressions.Divide.eval(arithmetic.scala:534) > {code} > > I suggest that we provide details in the error message, including: > * the problematic expression from the original SQL query, e.g. > "ss3.store_sales / ss2.store_sales store_q2_q3_increase" > * the line number and starting char position of the problematic expression, > in case of queries like "select a + b from t1 union select a + b from t2" -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39193) Improve the performance of inferring Timestamp type in JSON/CSV data source
[ https://issues.apache.org/jira/browse/SPARK-39193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-39193. Fix Version/s: 3.3.1 Resolution: Fixed Issue resolved by pull request 36562 [https://github.com/apache/spark/pull/36562] > Improve the performance of inferring Timestamp type in JSON/CSV data source > --- > > Key: SPARK-39193 > URL: https://issues.apache.org/jira/browse/SPARK-39193 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.3.1 > > > When reading JSON/CSV files with inferring timestamp types > `.option("inferTimestamp", true)`, the Timestamp conversion will throw and > catch exceptions. As we are putting decent error messages in the exception, > the creation of the exceptions is actually not cheap. It consumes more than > 90% of the type inference time. > We can use the parsing methods which return optional results instead. > Before the change, it takes 166 seconds to infer a JSON file of 624MB with > inferring timestamp enabled. > After the change, it only 16 seconds. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39216) Issue with correlated subquery and Union
[ https://issues.apache.org/jira/browse/SPARK-39216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-39216: - Description: SPARK-37915 added CollapseProject in rule CombineUnions, but it shouldn't collapse projects that contain correlated subqueries since haven't been de-correlated (PullupCorrelatedPredicates). Here is a simple example to reproduce this issue {code:java} SELECT (SELECT IF(x, 1, 0)) AS a FROM (SELECT true) t(x) UNION SELECT 1 AS a {code} Exception: {code:java} java.lang.IllegalStateException: Couldn't find x#4 in [] {code} was: SPARK-37915 added CollapseProject in rule CombineUnions, but it shouldn't collapse projects that contain correlated subqueries since haven't been de-correlated (PullupCorrelatedPredicates). Here is a simple example to reproduce this issue {code:java} SELECT (SELECT IF(x, 1, 0)) AS a FROM (SELECT true) t(x) UNION SELECT 1 AS a {code} Exception: {code:java} java.lang.IllegalStateException: Couldn't find x#4 in [] {code} > Issue with correlated subquery and Union > > > Key: SPARK-39216 > URL: https://issues.apache.org/jira/browse/SPARK-39216 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Allison Wang >Priority: Major > > > SPARK-37915 added CollapseProject in rule CombineUnions, but it shouldn't > collapse projects that contain correlated subqueries since haven't been > de-correlated (PullupCorrelatedPredicates). > Here is a simple example to reproduce this issue > {code:java} > SELECT (SELECT IF(x, 1, 0)) AS a > FROM (SELECT true) t(x) > UNION > SELECT 1 AS a {code} > Exception: > {code:java} > java.lang.IllegalStateException: Couldn't find x#4 in [] {code} > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39216) Issue with correlated subquery and Union
Allison Wang created SPARK-39216: Summary: Issue with correlated subquery and Union Key: SPARK-39216 URL: https://issues.apache.org/jira/browse/SPARK-39216 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Allison Wang SPARK-37915 added CollapseProject in rule CombineUnions, but it shouldn't collapse projects that contain correlated subqueries since haven't been de-correlated (PullupCorrelatedPredicates). Here is a simple example to reproduce this issue {code:java} SELECT (SELECT IF(x, 1, 0)) AS a FROM (SELECT true) t(x) UNION SELECT 1 AS a {code} Exception: {code:java} java.lang.IllegalStateException: Couldn't find x#4 in [] {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39215) Reduce Py4J calls in pyspark.sql.utils.is_timestamp_ntz_preferred
[ https://issues.apache.org/jira/browse/SPARK-39215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39215: Assignee: Apache Spark > Reduce Py4J calls in pyspark.sql.utils.is_timestamp_ntz_preferred > - > > Key: SPARK-39215 > URL: https://issues.apache.org/jira/browse/SPARK-39215 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > Here > https://github.com/apache/spark/blob/master/python/pyspark/sql/utils.py#L296-L302 > It unnecessarily accesses to JVM too often. We can just have a single method > to avoid that. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39215) Reduce Py4J calls in pyspark.sql.utils.is_timestamp_ntz_preferred
[ https://issues.apache.org/jira/browse/SPARK-39215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39215: Assignee: (was: Apache Spark) > Reduce Py4J calls in pyspark.sql.utils.is_timestamp_ntz_preferred > - > > Key: SPARK-39215 > URL: https://issues.apache.org/jira/browse/SPARK-39215 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Here > https://github.com/apache/spark/blob/master/python/pyspark/sql/utils.py#L296-L302 > It unnecessarily accesses to JVM too often. We can just have a single method > to avoid that. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39215) Reduce Py4J calls in pyspark.sql.utils.is_timestamp_ntz_preferred
[ https://issues.apache.org/jira/browse/SPARK-39215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538529#comment-17538529 ] Apache Spark commented on SPARK-39215: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/36587 > Reduce Py4J calls in pyspark.sql.utils.is_timestamp_ntz_preferred > - > > Key: SPARK-39215 > URL: https://issues.apache.org/jira/browse/SPARK-39215 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Here > https://github.com/apache/spark/blob/master/python/pyspark/sql/utils.py#L296-L302 > It unnecessarily accesses to JVM too often. We can just have a single method > to avoid that. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39215) Reduce Py4J calls in pyspark.sql.utils.is_timestamp_ntz_preferred
Hyukjin Kwon created SPARK-39215: Summary: Reduce Py4J calls in pyspark.sql.utils.is_timestamp_ntz_preferred Key: SPARK-39215 URL: https://issues.apache.org/jira/browse/SPARK-39215 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3.0 Reporter: Hyukjin Kwon Here https://github.com/apache/spark/blob/master/python/pyspark/sql/utils.py#L296-L302 It unnecessarily accesses to JVM too often. We can just have a single method to avoid that. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39054) GroupByTest failed due to axis Length mismatch
[ https://issues.apache.org/jira/browse/SPARK-39054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39054. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36581 [https://github.com/apache/spark/pull/36581] > GroupByTest failed due to axis Length mismatch > -- > > Key: SPARK-39054 > URL: https://issues.apache.org/jira/browse/SPARK-39054 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Apache Spark >Priority: Major > Fix For: 3.4.0 > > > {code:java} > An error occurred while calling o27083.getResult. > : org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301) > at > org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:97) > at > org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:93) > at sun.reflect.GeneratedMethodAccessor91.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:282) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at > py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) > at py4j.ClientServerConnection.run(ClientServerConnection.java:106) > at java.lang.Thread.run(Thread.java:750) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 808.0 failed 1 times, most recent failure: Lost task 0.0 in > stage 808.0 (TID 650) (localhost executor driver): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 686, > in main > process() > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 678, > in process > serializer.dump_stream(out_iter, outfile) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 343, in dump_stream > return ArrowStreamSerializer.dump_stream(self, > init_stream_yield_batches(), stream) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 84, in dump_stream > for batch in iterator: > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 336, in init_stream_yield_batches > for series in iterator: > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 487, > in mapper > return f(keys, vals) > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 207, > in > return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))] > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 185, > in wrapped > result = f(pd.concat(value_series, axis=1)) > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 81, in > wrapper > return f(*args, **kwargs) > File "/__w/spark/spark/python/pyspark/pandas/groupby.py", line 1620, in > rename_output > pdf.columns = return_schema.names > File "/usr/local/lib/python3.9/dist-packages/pandas/core/generic.py", line > 5588, in __setattr__ > return object.__setattr__(self, name, value) > File "pandas/_libs/properties.pyx", line 70, in > pandas._libs.properties.AxisProperty.__set__ > File "/usr/local/lib/python3.9/dist-packages/pandas/core/generic.py", line > 769, in _set_axis > self._mgr.set_axis(axis, labels) > File > "/usr/local/lib/python3.9/dist-packages/pandas/core/internals/managers.py", > line 214, in set_axis > self._validate_set_axis(axis, new_labels) > File > "/usr/local/lib/python3.9/dist-packages/pandas/core/internals/base.py", line > 69, in _validate_set_axis > raise ValueError( > ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 > elements {code} > > GroupByTest.test_apply_with_new_dataframe_without_shortcut -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39054) GroupByTest failed due to axis Length mismatch
[ https://issues.apache.org/jira/browse/SPARK-39054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39054: Assignee: Apache Spark > GroupByTest failed due to axis Length mismatch > -- > > Key: SPARK-39054 > URL: https://issues.apache.org/jira/browse/SPARK-39054 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Apache Spark >Priority: Major > > {code:java} > An error occurred while calling o27083.getResult. > : org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301) > at > org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:97) > at > org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:93) > at sun.reflect.GeneratedMethodAccessor91.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:282) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at > py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) > at py4j.ClientServerConnection.run(ClientServerConnection.java:106) > at java.lang.Thread.run(Thread.java:750) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 808.0 failed 1 times, most recent failure: Lost task 0.0 in > stage 808.0 (TID 650) (localhost executor driver): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 686, > in main > process() > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 678, > in process > serializer.dump_stream(out_iter, outfile) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 343, in dump_stream > return ArrowStreamSerializer.dump_stream(self, > init_stream_yield_batches(), stream) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 84, in dump_stream > for batch in iterator: > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 336, in init_stream_yield_batches > for series in iterator: > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 487, > in mapper > return f(keys, vals) > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 207, > in > return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))] > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 185, > in wrapped > result = f(pd.concat(value_series, axis=1)) > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 81, in > wrapper > return f(*args, **kwargs) > File "/__w/spark/spark/python/pyspark/pandas/groupby.py", line 1620, in > rename_output > pdf.columns = return_schema.names > File "/usr/local/lib/python3.9/dist-packages/pandas/core/generic.py", line > 5588, in __setattr__ > return object.__setattr__(self, name, value) > File "pandas/_libs/properties.pyx", line 70, in > pandas._libs.properties.AxisProperty.__set__ > File "/usr/local/lib/python3.9/dist-packages/pandas/core/generic.py", line > 769, in _set_axis > self._mgr.set_axis(axis, labels) > File > "/usr/local/lib/python3.9/dist-packages/pandas/core/internals/managers.py", > line 214, in set_axis > self._validate_set_axis(axis, new_labels) > File > "/usr/local/lib/python3.9/dist-packages/pandas/core/internals/base.py", line > 69, in _validate_set_axis > raise ValueError( > ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 > elements {code} > > GroupByTest.test_apply_with_new_dataframe_without_shortcut -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39192) make pandas-on-spark's kurt consistent with pandas
[ https://issues.apache.org/jira/browse/SPARK-39192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39192: Assignee: zhengruifeng > make pandas-on-spark's kurt consistent with pandas > -- > > Key: SPARK-39192 > URL: https://issues.apache.org/jira/browse/SPARK-39192 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39192) make pandas-on-spark's kurt consistent with pandas
[ https://issues.apache.org/jira/browse/SPARK-39192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39192. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36560 [https://github.com/apache/spark/pull/36560] > make pandas-on-spark's kurt consistent with pandas > -- > > Key: SPARK-39192 > URL: https://issues.apache.org/jira/browse/SPARK-39192 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39143) Support CSV file scans with DEFAULT values
[ https://issues.apache.org/jira/browse/SPARK-39143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39143. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36501 [https://github.com/apache/spark/pull/36501] > Support CSV file scans with DEFAULT values > -- > > Key: SPARK-39143 > URL: https://issues.apache.org/jira/browse/SPARK-39143 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39143) Support CSV file scans with DEFAULT values
[ https://issues.apache.org/jira/browse/SPARK-39143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39143: Assignee: Daniel > Support CSV file scans with DEFAULT values > -- > > Key: SPARK-39143 > URL: https://issues.apache.org/jira/browse/SPARK-39143 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39104) Null Pointer Exeption on unpersist call
[ https://issues.apache.org/jira/browse/SPARK-39104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-39104. -- Fix Version/s: 3.3.1 3.2.2 3.4.0 Resolution: Fixed Issue resolved by pull request 36496 [https://github.com/apache/spark/pull/36496] > Null Pointer Exeption on unpersist call > --- > > Key: SPARK-39104 > URL: https://issues.apache.org/jira/browse/SPARK-39104 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Denis >Assignee: Cheng Pan >Priority: Major > Fix For: 3.3.1, 3.2.2, 3.4.0 > > > DataFrame.unpesist call fails wth NPE > > {code:java} > java.lang.NullPointerException > at > org.apache.spark.sql.execution.columnar.CachedRDDBuilder.isCachedRDDLoaded(InMemoryRelation.scala:247) > at > org.apache.spark.sql.execution.columnar.CachedRDDBuilder.isCachedColumnBuffersLoaded(InMemoryRelation.scala:241) > at > org.apache.spark.sql.execution.CacheManager.$anonfun$uncacheQuery$8(CacheManager.scala:189) > at > org.apache.spark.sql.execution.CacheManager.$anonfun$uncacheQuery$8$adapted(CacheManager.scala:176) > at > scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:304) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303) > at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297) > at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) > at scala.collection.TraversableLike.filter(TraversableLike.scala:395) > at scala.collection.TraversableLike.filter$(TraversableLike.scala:395) > at scala.collection.AbstractTraversable.filter(Traversable.scala:108) > at > org.apache.spark.sql.execution.CacheManager.recacheByCondition(CacheManager.scala:219) > at > org.apache.spark.sql.execution.CacheManager.uncacheQuery(CacheManager.scala:176) > at org.apache.spark.sql.Dataset.unpersist(Dataset.scala:3220) > at org.apache.spark.sql.Dataset.unpersist(Dataset.scala:3231){code} > Looks like syncronization in required for > org.apache.spark.sql.execution.columnar.CachedRDDBuilder#isCachedColumnBuffersLoaded > > {code:java} > def isCachedColumnBuffersLoaded: Boolean = { > _cachedColumnBuffers != null && isCachedRDDLoaded > } > def isCachedRDDLoaded: Boolean = { > _cachedColumnBuffersAreLoaded || { > val bmMaster = SparkEnv.get.blockManager.master > val rddLoaded = _cachedColumnBuffers.partitions.forall { partition => > bmMaster.getBlockStatus(RDDBlockId(_cachedColumnBuffers.id, > partition.index), false) > .exists { case(_, blockStatus) => blockStatus.isCached } > } > if (rddLoaded) { > _cachedColumnBuffersAreLoaded = rddLoaded > } > rddLoaded > } > } {code} > isCachedRDDLoaded relies on _cachedColumnBuffers != null check while it can > be changed concurrently from other thread. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39104) Null Pointer Exeption on unpersist call
[ https://issues.apache.org/jira/browse/SPARK-39104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-39104: Assignee: Cheng Pan > Null Pointer Exeption on unpersist call > --- > > Key: SPARK-39104 > URL: https://issues.apache.org/jira/browse/SPARK-39104 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Denis >Assignee: Cheng Pan >Priority: Major > > DataFrame.unpesist call fails wth NPE > > {code:java} > java.lang.NullPointerException > at > org.apache.spark.sql.execution.columnar.CachedRDDBuilder.isCachedRDDLoaded(InMemoryRelation.scala:247) > at > org.apache.spark.sql.execution.columnar.CachedRDDBuilder.isCachedColumnBuffersLoaded(InMemoryRelation.scala:241) > at > org.apache.spark.sql.execution.CacheManager.$anonfun$uncacheQuery$8(CacheManager.scala:189) > at > org.apache.spark.sql.execution.CacheManager.$anonfun$uncacheQuery$8$adapted(CacheManager.scala:176) > at > scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:304) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303) > at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297) > at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) > at scala.collection.TraversableLike.filter(TraversableLike.scala:395) > at scala.collection.TraversableLike.filter$(TraversableLike.scala:395) > at scala.collection.AbstractTraversable.filter(Traversable.scala:108) > at > org.apache.spark.sql.execution.CacheManager.recacheByCondition(CacheManager.scala:219) > at > org.apache.spark.sql.execution.CacheManager.uncacheQuery(CacheManager.scala:176) > at org.apache.spark.sql.Dataset.unpersist(Dataset.scala:3220) > at org.apache.spark.sql.Dataset.unpersist(Dataset.scala:3231){code} > Looks like syncronization in required for > org.apache.spark.sql.execution.columnar.CachedRDDBuilder#isCachedColumnBuffersLoaded > > {code:java} > def isCachedColumnBuffersLoaded: Boolean = { > _cachedColumnBuffers != null && isCachedRDDLoaded > } > def isCachedRDDLoaded: Boolean = { > _cachedColumnBuffersAreLoaded || { > val bmMaster = SparkEnv.get.blockManager.master > val rddLoaded = _cachedColumnBuffers.partitions.forall { partition => > bmMaster.getBlockStatus(RDDBlockId(_cachedColumnBuffers.id, > partition.index), false) > .exists { case(_, blockStatus) => blockStatus.isCached } > } > if (rddLoaded) { > _cachedColumnBuffersAreLoaded = rddLoaded > } > rddLoaded > } > } {code} > isCachedRDDLoaded relies on _cachedColumnBuffers != null check while it can > be changed concurrently from other thread. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39213) Create ANY_VALUE aggregate function
[ https://issues.apache.org/jira/browse/SPARK-39213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538447#comment-17538447 ] Apache Spark commented on SPARK-39213: -- User 'vli-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/36584 > Create ANY_VALUE aggregate function > --- > > Key: SPARK-39213 > URL: https://issues.apache.org/jira/browse/SPARK-39213 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Vitalii Li >Priority: Major > > This is a feature request to add an \{{ANY_VALUE}} aggregate function. This > would consume input values and quickly return any arbitrary element. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39213) Create ANY_VALUE aggregate function
[ https://issues.apache.org/jira/browse/SPARK-39213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39213: Assignee: (was: Apache Spark) > Create ANY_VALUE aggregate function > --- > > Key: SPARK-39213 > URL: https://issues.apache.org/jira/browse/SPARK-39213 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Vitalii Li >Priority: Major > > This is a feature request to add an \{{ANY_VALUE}} aggregate function. This > would consume input values and quickly return any arbitrary element. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39213) Create ANY_VALUE aggregate function
[ https://issues.apache.org/jira/browse/SPARK-39213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39213: Assignee: Apache Spark > Create ANY_VALUE aggregate function > --- > > Key: SPARK-39213 > URL: https://issues.apache.org/jira/browse/SPARK-39213 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Vitalii Li >Assignee: Apache Spark >Priority: Major > > This is a feature request to add an \{{ANY_VALUE}} aggregate function. This > would consume input values and quickly return any arbitrary element. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39213) Create ANY_VALUE aggregate function
[ https://issues.apache.org/jira/browse/SPARK-39213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538445#comment-17538445 ] Apache Spark commented on SPARK-39213: -- User 'vli-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/36584 > Create ANY_VALUE aggregate function > --- > > Key: SPARK-39213 > URL: https://issues.apache.org/jira/browse/SPARK-39213 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Vitalii Li >Priority: Major > > This is a feature request to add an \{{ANY_VALUE}} aggregate function. This > would consume input values and quickly return any arbitrary element. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39212) Use double quotes for values of SQL configs/DS options in error messages
[ https://issues.apache.org/jira/browse/SPARK-39212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39212: Assignee: Max Gekk (was: Apache Spark) > Use double quotes for values of SQL configs/DS options in error messages > > > Key: SPARK-39212 > URL: https://issues.apache.org/jira/browse/SPARK-39212 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > All SQL configs/DS option values should be printed in SQL style in error > messages, and wrapped by double quotes. For example, the value true of the > config spark.sql.ansi.enabled should be highlighted as "true" to make it more > visible in error messages. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39212) Use double quotes for values of SQL configs/DS options in error messages
[ https://issues.apache.org/jira/browse/SPARK-39212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538379#comment-17538379 ] Apache Spark commented on SPARK-39212: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/36579 > Use double quotes for values of SQL configs/DS options in error messages > > > Key: SPARK-39212 > URL: https://issues.apache.org/jira/browse/SPARK-39212 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > All SQL configs/DS option values should be printed in SQL style in error > messages, and wrapped by double quotes. For example, the value true of the > config spark.sql.ansi.enabled should be highlighted as "true" to make it more > visible in error messages. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39212) Use double quotes for values of SQL configs/DS options in error messages
[ https://issues.apache.org/jira/browse/SPARK-39212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39212: Assignee: Apache Spark (was: Max Gekk) > Use double quotes for values of SQL configs/DS options in error messages > > > Key: SPARK-39212 > URL: https://issues.apache.org/jira/browse/SPARK-39212 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > All SQL configs/DS option values should be printed in SQL style in error > messages, and wrapped by double quotes. For example, the value true of the > config spark.sql.ansi.enabled should be highlighted as "true" to make it more > visible in error messages. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39214) Improve errors related to CAST
[ https://issues.apache.org/jira/browse/SPARK-39214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39214: Assignee: Apache Spark (was: Max Gekk) > Improve errors related to CAST > -- > > Key: SPARK-39214 > URL: https://issues.apache.org/jira/browse/SPARK-39214 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > 1. Rename the error classes INVALID_SYNTAX_FOR_CAST and CAST_CAUSES_OVERFLOW > to make more precise and clear. > 2. Improve error messages of the error classes (use quotes for SQL config and > function names). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39214) Improve errors related to CAST
[ https://issues.apache.org/jira/browse/SPARK-39214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538378#comment-17538378 ] Apache Spark commented on SPARK-39214: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/36553 > Improve errors related to CAST > -- > > Key: SPARK-39214 > URL: https://issues.apache.org/jira/browse/SPARK-39214 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > 1. Rename the error classes INVALID_SYNTAX_FOR_CAST and CAST_CAUSES_OVERFLOW > to make more precise and clear. > 2. Improve error messages of the error classes (use quotes for SQL config and > function names). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39214) Improve errors related to CAST
[ https://issues.apache.org/jira/browse/SPARK-39214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39214: Assignee: Max Gekk (was: Apache Spark) > Improve errors related to CAST > -- > > Key: SPARK-39214 > URL: https://issues.apache.org/jira/browse/SPARK-39214 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > 1. Rename the error classes INVALID_SYNTAX_FOR_CAST and CAST_CAUSES_OVERFLOW > to make more precise and clear. > 2. Improve error messages of the error classes (use quotes for SQL config and > function names). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39214) Improve errors related to CAST
Max Gekk created SPARK-39214: Summary: Improve errors related to CAST Key: SPARK-39214 URL: https://issues.apache.org/jira/browse/SPARK-39214 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Assignee: Max Gekk 1. Rename the error classes INVALID_SYNTAX_FOR_CAST and CAST_CAUSES_OVERFLOW to make more precise and clear. 2. Improve error messages of the error classes (use quotes for SQL config and function names). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39213) Create ANY_VALUE aggregate function
Vitalii Li created SPARK-39213: -- Summary: Create ANY_VALUE aggregate function Key: SPARK-39213 URL: https://issues.apache.org/jira/browse/SPARK-39213 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Vitalii Li This is a feature request to add an \{{ANY_VALUE}} aggregate function. This would consume input values and quickly return any arbitrary element. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39212) Use double quotes for values of SQL configs/DS options in error messages
[ https://issues.apache.org/jira/browse/SPARK-39212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-39212: - Description: All SQL configs/DS option values should be printed in SQL style in error messages, and wrapped by double quotes. For example, the value true of the config spark.sql.ansi.enabled should be highlighted as "true" to make it more visible in error messages. (was: All SQL configs should be printed in SQL style in error messages, and wrapped by double quotes. For example, the config spark.sql.ansi.enabled should be highlighted as "spark.sql.ansi.enabled" to make it more visible in error messages.) > Use double quotes for values of SQL configs/DS options in error messages > > > Key: SPARK-39212 > URL: https://issues.apache.org/jira/browse/SPARK-39212 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > All SQL configs/DS option values should be printed in SQL style in error > messages, and wrapped by double quotes. For example, the value true of the > config spark.sql.ansi.enabled should be highlighted as "true" to make it more > visible in error messages. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39212) Use double quotes for values of SQL configs/DS options in error messages
[ https://issues.apache.org/jira/browse/SPARK-39212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-39212: - Fix Version/s: (was: 3.3.0) (was: 3.4.0) > Use double quotes for values of SQL configs/DS options in error messages > > > Key: SPARK-39212 > URL: https://issues.apache.org/jira/browse/SPARK-39212 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > All SQL configs should be printed in SQL style in error messages, and wrapped > by double quotes. For example, the config spark.sql.ansi.enabled should be > highlighted as "spark.sql.ansi.enabled" to make it more visible in error > messages. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39212) Use double quotes for values of SQL configs/DS options in error messages
[ https://issues.apache.org/jira/browse/SPARK-39212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-39212: - Affects Version/s: (was: 3.3.0) > Use double quotes for values of SQL configs/DS options in error messages > > > Key: SPARK-39212 > URL: https://issues.apache.org/jira/browse/SPARK-39212 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.3.0, 3.4.0 > > > All SQL configs should be printed in SQL style in error messages, and wrapped > by double quotes. For example, the config spark.sql.ansi.enabled should be > highlighted as "spark.sql.ansi.enabled" to make it more visible in error > messages. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39212) Use double quotes for values of SQL configs/DS options in error messages
Max Gekk created SPARK-39212: Summary: Use double quotes for values of SQL configs/DS options in error messages Key: SPARK-39212 URL: https://issues.apache.org/jira/browse/SPARK-39212 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0, 3.4.0 Reporter: Max Gekk Assignee: Max Gekk Fix For: 3.3.0, 3.4.0 All SQL configs should be printed in SQL style in error messages, and wrapped by double quotes. For example, the config spark.sql.ansi.enabled should be highlighted as "spark.sql.ansi.enabled" to make it more visible in error messages. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39211) Support JSON file scans with default values
[ https://issues.apache.org/jira/browse/SPARK-39211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39211: Assignee: Apache Spark > Support JSON file scans with default values > --- > > Key: SPARK-39211 > URL: https://issues.apache.org/jira/browse/SPARK-39211 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39211) Support JSON file scans with default values
[ https://issues.apache.org/jira/browse/SPARK-39211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39211: Assignee: (was: Apache Spark) > Support JSON file scans with default values > --- > > Key: SPARK-39211 > URL: https://issues.apache.org/jira/browse/SPARK-39211 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39211) Support JSON file scans with default values
[ https://issues.apache.org/jira/browse/SPARK-39211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538338#comment-17538338 ] Apache Spark commented on SPARK-39211: -- User 'dtenedor' has created a pull request for this issue: https://github.com/apache/spark/pull/36583 > Support JSON file scans with default values > --- > > Key: SPARK-39211 > URL: https://issues.apache.org/jira/browse/SPARK-39211 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39211) Support JSON file scans with default values
Daniel created SPARK-39211: -- Summary: Support JSON file scans with default values Key: SPARK-39211 URL: https://issues.apache.org/jira/browse/SPARK-39211 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Daniel -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39210) Provide query context of Decimal overflow in AVG when WSCG is off
[ https://issues.apache.org/jira/browse/SPARK-39210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538310#comment-17538310 ] Apache Spark commented on SPARK-39210: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/36582 > Provide query context of Decimal overflow in AVG when WSCG is off > - > > Key: SPARK-39210 > URL: https://issues.apache.org/jira/browse/SPARK-39210 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39210) Provide query context of Decimal overflow in AVG when WSCG is off
[ https://issues.apache.org/jira/browse/SPARK-39210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39210: Assignee: Gengliang Wang (was: Apache Spark) > Provide query context of Decimal overflow in AVG when WSCG is off > - > > Key: SPARK-39210 > URL: https://issues.apache.org/jira/browse/SPARK-39210 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39210) Provide query context of Decimal overflow in AVG when WSCG is off
[ https://issues.apache.org/jira/browse/SPARK-39210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39210: Assignee: Apache Spark (was: Gengliang Wang) > Provide query context of Decimal overflow in AVG when WSCG is off > - > > Key: SPARK-39210 > URL: https://issues.apache.org/jira/browse/SPARK-39210 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39054) GroupByTest failed due to axis Length mismatch
[ https://issues.apache.org/jira/browse/SPARK-39054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538309#comment-17538309 ] Apache Spark commented on SPARK-39054: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/36581 > GroupByTest failed due to axis Length mismatch > -- > > Key: SPARK-39054 > URL: https://issues.apache.org/jira/browse/SPARK-39054 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > {code:java} > An error occurred while calling o27083.getResult. > : org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301) > at > org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:97) > at > org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:93) > at sun.reflect.GeneratedMethodAccessor91.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:282) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at > py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) > at py4j.ClientServerConnection.run(ClientServerConnection.java:106) > at java.lang.Thread.run(Thread.java:750) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 808.0 failed 1 times, most recent failure: Lost task 0.0 in > stage 808.0 (TID 650) (localhost executor driver): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 686, > in main > process() > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 678, > in process > serializer.dump_stream(out_iter, outfile) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 343, in dump_stream > return ArrowStreamSerializer.dump_stream(self, > init_stream_yield_batches(), stream) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 84, in dump_stream > for batch in iterator: > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 336, in init_stream_yield_batches > for series in iterator: > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 487, > in mapper > return f(keys, vals) > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 207, > in > return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))] > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 185, > in wrapped > result = f(pd.concat(value_series, axis=1)) > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 81, in > wrapper > return f(*args, **kwargs) > File "/__w/spark/spark/python/pyspark/pandas/groupby.py", line 1620, in > rename_output > pdf.columns = return_schema.names > File "/usr/local/lib/python3.9/dist-packages/pandas/core/generic.py", line > 5588, in __setattr__ > return object.__setattr__(self, name, value) > File "pandas/_libs/properties.pyx", line 70, in > pandas._libs.properties.AxisProperty.__set__ > File "/usr/local/lib/python3.9/dist-packages/pandas/core/generic.py", line > 769, in _set_axis > self._mgr.set_axis(axis, labels) > File > "/usr/local/lib/python3.9/dist-packages/pandas/core/internals/managers.py", > line 214, in set_axis > self._validate_set_axis(axis, new_labels) > File > "/usr/local/lib/python3.9/dist-packages/pandas/core/internals/base.py", line > 69, in _validate_set_axis > raise ValueError( > ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 > elements {code} > > GroupByTest.test_apply_with_new_dataframe_without_shortcut -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39054) GroupByTest failed due to axis Length mismatch
[ https://issues.apache.org/jira/browse/SPARK-39054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39054: Assignee: (was: Apache Spark) > GroupByTest failed due to axis Length mismatch > -- > > Key: SPARK-39054 > URL: https://issues.apache.org/jira/browse/SPARK-39054 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > {code:java} > An error occurred while calling o27083.getResult. > : org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301) > at > org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:97) > at > org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:93) > at sun.reflect.GeneratedMethodAccessor91.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:282) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at > py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) > at py4j.ClientServerConnection.run(ClientServerConnection.java:106) > at java.lang.Thread.run(Thread.java:750) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 808.0 failed 1 times, most recent failure: Lost task 0.0 in > stage 808.0 (TID 650) (localhost executor driver): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 686, > in main > process() > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 678, > in process > serializer.dump_stream(out_iter, outfile) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 343, in dump_stream > return ArrowStreamSerializer.dump_stream(self, > init_stream_yield_batches(), stream) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 84, in dump_stream > for batch in iterator: > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 336, in init_stream_yield_batches > for series in iterator: > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 487, > in mapper > return f(keys, vals) > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 207, > in > return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))] > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 185, > in wrapped > result = f(pd.concat(value_series, axis=1)) > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 81, in > wrapper > return f(*args, **kwargs) > File "/__w/spark/spark/python/pyspark/pandas/groupby.py", line 1620, in > rename_output > pdf.columns = return_schema.names > File "/usr/local/lib/python3.9/dist-packages/pandas/core/generic.py", line > 5588, in __setattr__ > return object.__setattr__(self, name, value) > File "pandas/_libs/properties.pyx", line 70, in > pandas._libs.properties.AxisProperty.__set__ > File "/usr/local/lib/python3.9/dist-packages/pandas/core/generic.py", line > 769, in _set_axis > self._mgr.set_axis(axis, labels) > File > "/usr/local/lib/python3.9/dist-packages/pandas/core/internals/managers.py", > line 214, in set_axis > self._validate_set_axis(axis, new_labels) > File > "/usr/local/lib/python3.9/dist-packages/pandas/core/internals/base.py", line > 69, in _validate_set_axis > raise ValueError( > ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 > elements {code} > > GroupByTest.test_apply_with_new_dataframe_without_shortcut -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39054) GroupByTest failed due to axis Length mismatch
[ https://issues.apache.org/jira/browse/SPARK-39054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39054: Assignee: Apache Spark > GroupByTest failed due to axis Length mismatch > -- > > Key: SPARK-39054 > URL: https://issues.apache.org/jira/browse/SPARK-39054 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Apache Spark >Priority: Major > > {code:java} > An error occurred while calling o27083.getResult. > : org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301) > at > org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:97) > at > org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:93) > at sun.reflect.GeneratedMethodAccessor91.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:282) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at > py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) > at py4j.ClientServerConnection.run(ClientServerConnection.java:106) > at java.lang.Thread.run(Thread.java:750) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 808.0 failed 1 times, most recent failure: Lost task 0.0 in > stage 808.0 (TID 650) (localhost executor driver): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 686, > in main > process() > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 678, > in process > serializer.dump_stream(out_iter, outfile) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 343, in dump_stream > return ArrowStreamSerializer.dump_stream(self, > init_stream_yield_batches(), stream) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 84, in dump_stream > for batch in iterator: > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 336, in init_stream_yield_batches > for series in iterator: > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 487, > in mapper > return f(keys, vals) > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 207, > in > return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))] > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 185, > in wrapped > result = f(pd.concat(value_series, axis=1)) > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 81, in > wrapper > return f(*args, **kwargs) > File "/__w/spark/spark/python/pyspark/pandas/groupby.py", line 1620, in > rename_output > pdf.columns = return_schema.names > File "/usr/local/lib/python3.9/dist-packages/pandas/core/generic.py", line > 5588, in __setattr__ > return object.__setattr__(self, name, value) > File "pandas/_libs/properties.pyx", line 70, in > pandas._libs.properties.AxisProperty.__set__ > File "/usr/local/lib/python3.9/dist-packages/pandas/core/generic.py", line > 769, in _set_axis > self._mgr.set_axis(axis, labels) > File > "/usr/local/lib/python3.9/dist-packages/pandas/core/internals/managers.py", > line 214, in set_axis > self._validate_set_axis(axis, new_labels) > File > "/usr/local/lib/python3.9/dist-packages/pandas/core/internals/base.py", line > 69, in _validate_set_axis > raise ValueError( > ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 > elements {code} > > GroupByTest.test_apply_with_new_dataframe_without_shortcut -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39210) Provide query context of Decimal overflow in AVG when WSCG is off
Gengliang Wang created SPARK-39210: -- Summary: Provide query context of Decimal overflow in AVG when WSCG is off Key: SPARK-39210 URL: https://issues.apache.org/jira/browse/SPARK-39210 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.1 Reporter: Gengliang Wang Assignee: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39208) Fix query context bugs in decimal overflow under codegen mode
[ https://issues.apache.org/jira/browse/SPARK-39208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-39208. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 36577 [https://github.com/apache/spark/pull/36577] > Fix query context bugs in decimal overflow under codegen mode > - > > Key: SPARK-39208 > URL: https://issues.apache.org/jira/browse/SPARK-39208 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39209) Error occurs when cast a big enough long to timestamp in ANSI mode
chong created SPARK-39209: - Summary: Error occurs when cast a big enough long to timestamp in ANSI mode Key: SPARK-39209 URL: https://issues.apache.org/jira/browse/SPARK-39209 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Environment: Spark 3.3.0 Reporter: chong Got Error when cast a big enough long to a timestamp in ANSI mode, should get the max timestamp according to the code in Cast.scala: {code:java} private[this] def longToTimestamp(t: Long): Long = SECONDS.toMicros(t) // the logic of SECONDS.toMicros is: static long x(long d, long m, long over) { if (d > Long.MAX_VALUE / 100L) return Long.MAX_VALUE; if (d < -(Long.MAX_VALUE / 100L)) return Long.MIN_VALUE; return d * m; }{code} Reproduce steps: {code:java} $SPARK_HOME/bin/spark-shell import spark.implicits._ val df = Seq((Long.MaxValue / 100) + 1).toDF("a") df.selectExpr("cast(a as timestamp)").collect() // the result is right Array[org.apache.spark.sql.Row] = Array([294247-01-10 12:00:54.775807]) import org.apache.spark.sql.types._ import org.apache.spark.sql.Row val schema = StructType(Array(StructField("a", LongType))) val data = Seq(Row((Long.MaxValue / 100) + 1)) val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) df.selectExpr("cast(a as timestamp)").collect() // error occurs: java.lang.RuntimeException: Error while decoding: java.lang.ArithmeticException: long overflow createexternalrow(staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, ObjectType(class java.sql.Timestamp), toJavaTimestamp, input[0, timestamp, true], true, false, true), StructField(a,TimestampType,true)) at org.apache.spark.sql.errors.QueryExecutionErrors$.expressionDecodingError(QueryExecutionErrors.scala:1157) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Deserializer.apply(ExpressionEncoder.scala:184) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Deserializer.apply(ExpressionEncoder.scala:172) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3864) at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:3119) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3855) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3853) at org.apache.spark.sql.Dataset.collect(Dataset.scala:3119) ... 55 elided Caused by: java.lang.ArithmeticException: long overflow at java.lang.Math.multiplyExact(Math.java:892) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.millisToMicros(DateTimeUtils.scala:240) at org.apache.spark.sql.catalyst.util.RebaseDateTime$.rebaseGregorianToJulianMicros(RebaseDateTime.scala:370) at org.apache.spark.sql.catalyst.util.RebaseDateTime$.rebaseGregorianToJulianMicros(RebaseDateTime.scala:390) at org.apache.spark.sql.catalyst.util.RebaseDateTime$.rebaseGregorianToJulianMicros(RebaseDateTime.scala:411) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.toJavaTimestamp(DateTimeUtils.scala:162) at org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaTimestamp(DateTimeUtils.scala) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Deserializer.apply(ExpressionEncoder.scala:181) ... 73 more {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39102) Replace the usage of guava's Files.createTempDir() with java.nio.file.Files.createTempDirectory()
[ https://issues.apache.org/jira/browse/SPARK-39102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-39102: - Issue Type: Improvement (was: Bug) > Replace the usage of guava's Files.createTempDir() with > java.nio.file.Files.createTempDirectory() > -- > > Key: SPARK-39102 > URL: https://issues.apache.org/jira/browse/SPARK-39102 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0, 3.2.1, 3.4.0 >Reporter: pralabhkumar >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > Hi > There are several classes where Spark is using guava's Files.createTempDir() > which have vulnerabilities. I think its better to move to > java.nio.file.Files.createTempDirectory() for those classes. > Classes > Java8RDDAPISuite > JavaAPISuite.java > RPackageUtilsSuite > StreamTestHelper > TestShuffleDataContext > ExternalBlockHandlerSuite > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39102) Replace the usage of guava's Files.createTempDir() with java.nio.file.Files.createTempDirectory()
[ https://issues.apache.org/jira/browse/SPARK-39102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-39102: Assignee: Yang Jie > Replace the usage of guava's Files.createTempDir() with > java.nio.file.Files.createTempDirectory() > -- > > Key: SPARK-39102 > URL: https://issues.apache.org/jira/browse/SPARK-39102 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0, 3.2.1, 3.4.0 >Reporter: pralabhkumar >Assignee: Yang Jie >Priority: Minor > > Hi > There are several classes where Spark is using guava's Files.createTempDir() > which have vulnerabilities. I think its better to move to > java.nio.file.Files.createTempDirectory() for those classes. > Classes > Java8RDDAPISuite > JavaAPISuite.java > RPackageUtilsSuite > StreamTestHelper > TestShuffleDataContext > ExternalBlockHandlerSuite > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39102) Replace the usage of guava's Files.createTempDir() with java.nio.file.Files.createTempDirectory()
[ https://issues.apache.org/jira/browse/SPARK-39102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-39102. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36529 [https://github.com/apache/spark/pull/36529] > Replace the usage of guava's Files.createTempDir() with > java.nio.file.Files.createTempDirectory() > -- > > Key: SPARK-39102 > URL: https://issues.apache.org/jira/browse/SPARK-39102 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0, 3.2.1, 3.4.0 >Reporter: pralabhkumar >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > Hi > There are several classes where Spark is using guava's Files.createTempDir() > which have vulnerabilities. I think its better to move to > java.nio.file.Files.createTempDirectory() for those classes. > Classes > Java8RDDAPISuite > JavaAPISuite.java > RPackageUtilsSuite > StreamTestHelper > TestShuffleDataContext > ExternalBlockHandlerSuite > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39196) Replace getOrElse(null) with orNull
[ https://issues.apache.org/jira/browse/SPARK-39196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-39196: Assignee: qian > Replace getOrElse(null) with orNull > --- > > Key: SPARK-39196 > URL: https://issues.apache.org/jira/browse/SPARK-39196 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 3.3.0 >Reporter: qian >Assignee: qian >Priority: Major > > Code Simplification. Replace _getOrElse(null)_ with _orNull_ -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39196) Replace getOrElse(null) with orNull
[ https://issues.apache.org/jira/browse/SPARK-39196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-39196. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36567 [https://github.com/apache/spark/pull/36567] > Replace getOrElse(null) with orNull > --- > > Key: SPARK-39196 > URL: https://issues.apache.org/jira/browse/SPARK-39196 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 3.3.0 >Reporter: qian >Assignee: qian >Priority: Major > Fix For: 3.4.0 > > > Code Simplification. Replace _getOrElse(null)_ with _orNull_ -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39196) Replace getOrElse(null) with orNull
[ https://issues.apache.org/jira/browse/SPARK-39196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-39196: - Priority: Trivial (was: Major) > Replace getOrElse(null) with orNull > --- > > Key: SPARK-39196 > URL: https://issues.apache.org/jira/browse/SPARK-39196 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 3.3.0 >Reporter: qian >Assignee: qian >Priority: Trivial > Fix For: 3.4.0 > > > Code Simplification. Replace _getOrElse(null)_ with _orNull_ -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39167) Throw an exception w/ an error class for multiple rows from a subquery used as an expression
[ https://issues.apache.org/jira/browse/SPARK-39167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538196#comment-17538196 ] Apache Spark commented on SPARK-39167: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/36580 > Throw an exception w/ an error class for multiple rows from a subquery used > as an expression > > > Key: SPARK-39167 > URL: https://issues.apache.org/jira/browse/SPARK-39167 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Users can trigger an illegal state exception by the SQL statement: > {code:sql} > > select (select a from (select 1 as a union all select 2 as a) t) as b > {code} > {code:java} > Caused by: java.lang.IllegalStateException: more than one row returned by a > subquery used as an expression: > Subquery subquery#242, [id=#100] > +- AdaptiveSparkPlan isFinalPlan=true >+- == Final Plan == > Union > :- *(1) Project [1 AS a#240] > : +- *(1) Scan OneRowRelation[] > +- *(2) Project [2 AS a#241] > +- *(2) Scan OneRowRelation[] >+- == Initial Plan == > Union > :- Project [1 AS a#240] > : +- Scan OneRowRelation[] > +- Project [2 AS a#241] > +- Scan OneRowRelation[] > at > org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:83) > {code} > but such kind of exceptions are not supposed to be visible to users. Need to > introduce an error class (or re-use an existing one), and replace the > IllegalStateException. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39167) Throw an exception w/ an error class for multiple rows from a subquery used as an expression
[ https://issues.apache.org/jira/browse/SPARK-39167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538193#comment-17538193 ] Apache Spark commented on SPARK-39167: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/36580 > Throw an exception w/ an error class for multiple rows from a subquery used > as an expression > > > Key: SPARK-39167 > URL: https://issues.apache.org/jira/browse/SPARK-39167 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Users can trigger an illegal state exception by the SQL statement: > {code:sql} > > select (select a from (select 1 as a union all select 2 as a) t) as b > {code} > {code:java} > Caused by: java.lang.IllegalStateException: more than one row returned by a > subquery used as an expression: > Subquery subquery#242, [id=#100] > +- AdaptiveSparkPlan isFinalPlan=true >+- == Final Plan == > Union > :- *(1) Project [1 AS a#240] > : +- *(1) Scan OneRowRelation[] > +- *(2) Project [2 AS a#241] > +- *(2) Scan OneRowRelation[] >+- == Initial Plan == > Union > :- Project [1 AS a#240] > : +- Scan OneRowRelation[] > +- Project [2 AS a#241] > +- Scan OneRowRelation[] > at > org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:83) > {code} > but such kind of exceptions are not supposed to be visible to users. Need to > introduce an error class (or re-use an existing one), and replace the > IllegalStateException. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39167) Throw an exception w/ an error class for multiple rows from a subquery used as an expression
[ https://issues.apache.org/jira/browse/SPARK-39167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39167: Assignee: Apache Spark > Throw an exception w/ an error class for multiple rows from a subquery used > as an expression > > > Key: SPARK-39167 > URL: https://issues.apache.org/jira/browse/SPARK-39167 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Users can trigger an illegal state exception by the SQL statement: > {code:sql} > > select (select a from (select 1 as a union all select 2 as a) t) as b > {code} > {code:java} > Caused by: java.lang.IllegalStateException: more than one row returned by a > subquery used as an expression: > Subquery subquery#242, [id=#100] > +- AdaptiveSparkPlan isFinalPlan=true >+- == Final Plan == > Union > :- *(1) Project [1 AS a#240] > : +- *(1) Scan OneRowRelation[] > +- *(2) Project [2 AS a#241] > +- *(2) Scan OneRowRelation[] >+- == Initial Plan == > Union > :- Project [1 AS a#240] > : +- Scan OneRowRelation[] > +- Project [2 AS a#241] > +- Scan OneRowRelation[] > at > org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:83) > {code} > but such kind of exceptions are not supposed to be visible to users. Need to > introduce an error class (or re-use an existing one), and replace the > IllegalStateException. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39167) Throw an exception w/ an error class for multiple rows from a subquery used as an expression
[ https://issues.apache.org/jira/browse/SPARK-39167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39167: Assignee: (was: Apache Spark) > Throw an exception w/ an error class for multiple rows from a subquery used > as an expression > > > Key: SPARK-39167 > URL: https://issues.apache.org/jira/browse/SPARK-39167 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Users can trigger an illegal state exception by the SQL statement: > {code:sql} > > select (select a from (select 1 as a union all select 2 as a) t) as b > {code} > {code:java} > Caused by: java.lang.IllegalStateException: more than one row returned by a > subquery used as an expression: > Subquery subquery#242, [id=#100] > +- AdaptiveSparkPlan isFinalPlan=true >+- == Final Plan == > Union > :- *(1) Project [1 AS a#240] > : +- *(1) Scan OneRowRelation[] > +- *(2) Project [2 AS a#241] > +- *(2) Scan OneRowRelation[] >+- == Initial Plan == > Union > :- Project [1 AS a#240] > : +- Scan OneRowRelation[] > +- Project [2 AS a#241] > +- Scan OneRowRelation[] > at > org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:83) > {code} > but such kind of exceptions are not supposed to be visible to users. Need to > introduce an error class (or re-use an existing one), and replace the > IllegalStateException. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39207) Record SQL text when executing with SparkSession.sql()
[ https://issues.apache.org/jira/browse/SPARK-39207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538096#comment-17538096 ] Apache Spark commented on SPARK-39207: -- User 'linhongliu-db' has created a pull request for this issue: https://github.com/apache/spark/pull/36578 > Record SQL text when executing with SparkSession.sql() > -- > > Key: SPARK-39207 > URL: https://issues.apache.org/jira/browse/SPARK-39207 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Linhong Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39207) Record SQL text when executing with SparkSession.sql()
[ https://issues.apache.org/jira/browse/SPARK-39207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39207: Assignee: Apache Spark > Record SQL text when executing with SparkSession.sql() > -- > > Key: SPARK-39207 > URL: https://issues.apache.org/jira/browse/SPARK-39207 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Linhong Liu >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39207) Record SQL text when executing with SparkSession.sql()
[ https://issues.apache.org/jira/browse/SPARK-39207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39207: Assignee: (was: Apache Spark) > Record SQL text when executing with SparkSession.sql() > -- > > Key: SPARK-39207 > URL: https://issues.apache.org/jira/browse/SPARK-39207 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Linhong Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39207) Record SQL text when executing with SparkSession.sql()
[ https://issues.apache.org/jira/browse/SPARK-39207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538094#comment-17538094 ] Apache Spark commented on SPARK-39207: -- User 'linhongliu-db' has created a pull request for this issue: https://github.com/apache/spark/pull/36578 > Record SQL text when executing with SparkSession.sql() > -- > > Key: SPARK-39207 > URL: https://issues.apache.org/jira/browse/SPARK-39207 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Linhong Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39208) Fix query context bugs in decimal overflow under codegen mode
[ https://issues.apache.org/jira/browse/SPARK-39208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538089#comment-17538089 ] Apache Spark commented on SPARK-39208: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/36577 > Fix query context bugs in decimal overflow under codegen mode > - > > Key: SPARK-39208 > URL: https://issues.apache.org/jira/browse/SPARK-39208 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39208) Fix query context bugs in decimal overflow under codegen mode
[ https://issues.apache.org/jira/browse/SPARK-39208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39208: Assignee: Gengliang Wang (was: Apache Spark) > Fix query context bugs in decimal overflow under codegen mode > - > > Key: SPARK-39208 > URL: https://issues.apache.org/jira/browse/SPARK-39208 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39208) Fix query context bugs in decimal overflow under codegen mode
[ https://issues.apache.org/jira/browse/SPARK-39208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538088#comment-17538088 ] Apache Spark commented on SPARK-39208: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/36577 > Fix query context bugs in decimal overflow under codegen mode > - > > Key: SPARK-39208 > URL: https://issues.apache.org/jira/browse/SPARK-39208 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39208) Fix query context bugs in decimal overflow under codegen mode
[ https://issues.apache.org/jira/browse/SPARK-39208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39208: Assignee: Apache Spark (was: Gengliang Wang) > Fix query context bugs in decimal overflow under codegen mode > - > > Key: SPARK-39208 > URL: https://issues.apache.org/jira/browse/SPARK-39208 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32268) Bloom Filter Join
[ https://issues.apache.org/jira/browse/SPARK-32268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538087#comment-17538087 ] Apache Spark commented on SPARK-32268: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/36576 > Bloom Filter Join > - > > Key: SPARK-32268 > URL: https://issues.apache.org/jira/browse/SPARK-32268 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yingyi Bu >Priority: Major > Fix For: 3.3.0 > > Attachments: q16-bloom-filter.jpg, q16-default.jpg > > > We can improve the performance of some joins by pre-filtering one side of a > join using a Bloom filter and IN predicate generated from the values from the > other side of the join. > For > example:[tpcds/q16.sql|https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q16.sql]. > [Before this > optimization|https://issues.apache.org/jira/secure/attachment/13007418/q16-default.jpg]. > [After this > optimization|https://issues.apache.org/jira/secure/attachment/13007416/q16-bloom-filter.jpg]. > *Query Performance Benchmarks: TPC-DS Performance Evaluation* > Our setup for running TPC-DS benchmark was as follows: TPC-DS 5T and > Partitioned Parquet table > > |Query|Default(Seconds)|Enable Bloom Filter Join(Seconds)| > |tpcds q16|84|46| > |tpcds q36|29|21| > |tpcds q57|39|28| > |tpcds q94|42|34| > |tpcds q95|306|288| -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32268) Bloom Filter Join
[ https://issues.apache.org/jira/browse/SPARK-32268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538086#comment-17538086 ] Apache Spark commented on SPARK-32268: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/36576 > Bloom Filter Join > - > > Key: SPARK-32268 > URL: https://issues.apache.org/jira/browse/SPARK-32268 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yingyi Bu >Priority: Major > Fix For: 3.3.0 > > Attachments: q16-bloom-filter.jpg, q16-default.jpg > > > We can improve the performance of some joins by pre-filtering one side of a > join using a Bloom filter and IN predicate generated from the values from the > other side of the join. > For > example:[tpcds/q16.sql|https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q16.sql]. > [Before this > optimization|https://issues.apache.org/jira/secure/attachment/13007418/q16-default.jpg]. > [After this > optimization|https://issues.apache.org/jira/secure/attachment/13007416/q16-bloom-filter.jpg]. > *Query Performance Benchmarks: TPC-DS Performance Evaluation* > Our setup for running TPC-DS benchmark was as follows: TPC-DS 5T and > Partitioned Parquet table > > |Query|Default(Seconds)|Enable Bloom Filter Join(Seconds)| > |tpcds q16|84|46| > |tpcds q36|29|21| > |tpcds q57|39|28| > |tpcds q94|42|34| > |tpcds q95|306|288| -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39208) Fix query context bugs in decimal overflow under codegen mode
Gengliang Wang created SPARK-39208: -- Summary: Fix query context bugs in decimal overflow under codegen mode Key: SPARK-39208 URL: https://issues.apache.org/jira/browse/SPARK-39208 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.1 Reporter: Gengliang Wang Assignee: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38255) Enable a callable in pyspark.pandas.DataFrame.loc
[ https://issues.apache.org/jira/browse/SPARK-38255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538067#comment-17538067 ] chandan singh commented on SPARK-38255: --- Hi, Following is the example in pandas doc [https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html] Callable that returns a boolean Series {code:java} >>> df.loc[lambda df: df['shield'] == 8] max_speed shield sidewinder 7 8{code} Below is a toy code example: {code:java} import pandas as pd df = pd.DataFrame({"a":[1,2,3,4,5],"b":[4,5,6,6,8]}) def even_index(x): return list(map(lambda x:x%2 == 0, df.index.values)) df.loc[lambda x:even_index(x)] {code} > Enable a callable in pyspark.pandas.DataFrame.loc > - > > Key: SPARK-38255 > URL: https://issues.apache.org/jira/browse/SPARK-38255 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Kyle Gilde >Priority: Minor > > Hi, > I was hoping that you would enable a callable to be used in the > pyspark.pandas.DataFrame.loc method. > I use a lambda function in loc all the time in my pandas code, and I was > hoping to be able to use most of my pandas code with your new pandas API. > > Thank you! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39207) Record SQL text when executing with SparkSession.sql()
Linhong Liu created SPARK-39207: --- Summary: Record SQL text when executing with SparkSession.sql() Key: SPARK-39207 URL: https://issues.apache.org/jira/browse/SPARK-39207 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.4.0 Reporter: Linhong Liu -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37197) Behaviour inconsistency between pandas and pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-37197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang updated SPARK-37197: Component/s: Pandas API on Spark > Behaviour inconsistency between pandas and pandas API on Spark > -- > > Key: SPARK-37197 > URL: https://issues.apache.org/jira/browse/SPARK-37197 > Project: Spark > Issue Type: Umbrella > Components: Pandas API on Spark, PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > This JIRA includes tickets on inconsistent behaviors pandas and pandas API on > Spark -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38819) Run Pandas on Spark with Pandas 1.4.x
[ https://issues.apache.org/jira/browse/SPARK-38819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang updated SPARK-38819: Component/s: Pandas API on Spark > Run Pandas on Spark with Pandas 1.4.x > - > > Key: SPARK-38819 > URL: https://issues.apache.org/jira/browse/SPARK-38819 > Project: Spark > Issue Type: Umbrella > Components: Pandas API on Spark, PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > This is a umbrella to track issues when pandas upgrade to 1.4.x > > I disable the fast-failed in test, 19 failed: > [https://github.com/Yikun/spark/pull/88/checks?check_run_id=5873627048] > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39199) Implement pandas API missing parameters
[ https://issues.apache.org/jira/browse/SPARK-39199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang updated SPARK-39199: Component/s: Pandas API on Spark > Implement pandas API missing parameters > --- > > Key: SPARK-39199 > URL: https://issues.apache.org/jira/browse/SPARK-39199 > Project: Spark > Issue Type: Umbrella > Components: Pandas API on Spark, PySpark >Affects Versions: 3.3.0, 3.4.0, 3.3.1 >Reporter: Xinrong Meng >Priority: Major > > pandas API on Spark aims to achieve full pandas API coverage. Currently, most > pandas functions are supported in pandas API on Spark with parameters missing. > There are some common parameters missing: > - how to do with NAs: `skipna`, `dropna` > - filter data types: `numeric_only`, `bool_only` > - filter result length: `keep` > - reindex result: `ignore_index` > They support common use cases and should be prioritized. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org