date:20220428

[jira] [Assigned] (SPARK-39063) Remove `finalize()` from `LevelDB/RocksDBIterator`

2022-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39063:


Assignee: Apache Spark

> Remove `finalize()` from `LevelDB/RocksDBIterator`
> --
>
> Key: SPARK-39063
> URL: https://issues.apache.org/jira/browse/SPARK-39063
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> After SPARK-38896, all `LevelDB/RocksDBIterator` handle open by 
> `LevelDB/RocksDB.view` method already closed by `tryWithResource`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39063) Remove `finalize()` from `LevelDB/RocksDBIterator`

2022-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529779#comment-17529779
 ] 

Apache Spark commented on SPARK-39063:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/36403

> Remove `finalize()` from `LevelDB/RocksDBIterator`
> --
>
> Key: SPARK-39063
> URL: https://issues.apache.org/jira/browse/SPARK-39063
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> After SPARK-38896, all `LevelDB/RocksDBIterator` handle open by 
> `LevelDB/RocksDB.view` method already closed by `tryWithResource`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39063) Remove `finalize()` from `LevelDB/RocksDBIterator`

2022-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39063:


Assignee: (was: Apache Spark)

> Remove `finalize()` from `LevelDB/RocksDBIterator`
> --
>
> Key: SPARK-39063
> URL: https://issues.apache.org/jira/browse/SPARK-39063
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> After SPARK-38896, all `LevelDB/RocksDBIterator` handle open by 
> `LevelDB/RocksDB.view` method already closed by `tryWithResource`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39063) Remove `finalize()` from `LevelDB/RocksDBIterator`

2022-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529780#comment-17529780
 ] 

Apache Spark commented on SPARK-39063:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/36403

> Remove `finalize()` from `LevelDB/RocksDBIterator`
> --
>
> Key: SPARK-39063
> URL: https://issues.apache.org/jira/browse/SPARK-39063
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> After SPARK-38896, all `LevelDB/RocksDBIterator` handle open by 
> `LevelDB/RocksDB.view` method already closed by `tryWithResource`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39063) Remove `finalize()` from `LevelDB/RocksDBIterator`

2022-04-28 Thread Yang Jie (Jira)

Yang Jie created SPARK-39063:


 Summary: Remove `finalize()` from `LevelDB/RocksDBIterator`
 Key: SPARK-39063
 URL: https://issues.apache.org/jira/browse/SPARK-39063
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Yang Jie


After SPARK-38896, all `LevelDB/RocksDBIterator` handle open by 
`LevelDB/RocksDB.view` method already closed by `tryWithResource`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38085) DataSource V2: Handle DELETE commands for group-based sources

2022-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529768#comment-17529768
 ] 

Apache Spark commented on SPARK-38085:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/36402

> DataSource V2: Handle DELETE commands for group-based sources
> -
>
> Key: SPARK-38085
> URL: https://issues.apache.org/jira/browse/SPARK-38085
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
> Fix For: 3.3.0
>
>
> As per SPARK-35801, we should handle DELETE statements for sources that can 
> replace groups of data (e.g. partitions, files).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39040) Respect NaNvl in EquivalentExpressions for expression elimination

2022-04-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-39040:
---

Assignee: XiDuo You

> Respect NaNvl in EquivalentExpressions for expression elimination
> -
>
> Key: SPARK-39040
> URL: https://issues.apache.org/jira/browse/SPARK-39040
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.3.0
>
>
> For example the query will fail:
> {code:java}
> set spark.sql.ansi.enabled=true;
> set 
> spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConstantFolding;
> SELECT nanvl(1, 1/0 + 1/0);  {code}
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 
> (TID 4) (10.221.98.68 executor driver): 
> org.apache.spark.SparkArithmeticException: divide by zero. To return NULL 
> instead, use 'try_divide'. If necessary set spark.sql.ansi.enabled to false 
> (except for ANSI interval type) to bypass this error.
> == SQL(line 1, position 17) ==
> select nanvl(1 , 1/0 + 1/0)
>                  ^^^    at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:151)
>  {code}
> We should respect the ordering of conditional expression that always evaluate 
> the predicate branch first, so the query above should not fail.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39040) Respect NaNvl in EquivalentExpressions for expression elimination

2022-04-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-39040.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 36376
[https://github.com/apache/spark/pull/36376]

> Respect NaNvl in EquivalentExpressions for expression elimination
> -
>
> Key: SPARK-39040
> URL: https://issues.apache.org/jira/browse/SPARK-39040
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
> Fix For: 3.3.0
>
>
> For example the query will fail:
> {code:java}
> set spark.sql.ansi.enabled=true;
> set 
> spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConstantFolding;
> SELECT nanvl(1, 1/0 + 1/0);  {code}
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 
> (TID 4) (10.221.98.68 executor driver): 
> org.apache.spark.SparkArithmeticException: divide by zero. To return NULL 
> instead, use 'try_divide'. If necessary set spark.sql.ansi.enabled to false 
> (except for ANSI interval type) to bypass this error.
> == SQL(line 1, position 17) ==
> select nanvl(1 , 1/0 + 1/0)
>                  ^^^    at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:151)
>  {code}
> We should respect the ordering of conditional expression that always evaluate 
> the predicate branch first, so the query above should not fail.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38993) Impl DataFrame.boxplot and DataFrame.plot.box

2022-04-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38993.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36317
[https://github.com/apache/spark/pull/36317]

> Impl DataFrame.boxplot and DataFrame.plot.box
> -
>
> Key: SPARK-38993
> URL: https://issues.apache.org/jira/browse/SPARK-38993
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38993) Impl DataFrame.boxplot and DataFrame.plot.box

2022-04-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38993:


Assignee: zhengruifeng

> Impl DataFrame.boxplot and DataFrame.plot.box
> -
>
> Key: SPARK-38993
> URL: https://issues.apache.org/jira/browse/SPARK-38993
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39054) GroupByTest failed due to axis Length mismatch

2022-04-28 Thread Yikun Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529751#comment-17529751
 ] 

Yikun Jiang commented on SPARK-39054:
-

[https://github.com/apache/spark/blob/973283c33ad908d071550e9be92a4fca76a8a9df/python/pyspark/pandas/groupby.py#L1377]

 

Behavior changed in here

> GroupByTest failed due to axis Length mismatch
> --
>
> Key: SPARK-39054
> URL: https://issues.apache.org/jira/browse/SPARK-39054
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> {code:java}
> An error occurred while calling o27083.getResult.
> : org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
>   at 
> org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:97)
>   at 
> org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:93)
>   at sun.reflect.GeneratedMethodAccessor91.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:282)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at 
> py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
>   at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
>   at java.lang.Thread.run(Thread.java:750)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 808.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 808.0 (TID 650) (localhost executor driver): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 686, 
> in main
> process()
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 678, 
> in process
> serializer.dump_stream(out_iter, outfile)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 343, in dump_stream
> return ArrowStreamSerializer.dump_stream(self, 
> init_stream_yield_batches(), stream)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 84, in dump_stream
> for batch in iterator:
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 336, in init_stream_yield_batches
> for series in iterator:
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 487, 
> in mapper
> return f(keys, vals)
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 207, 
> in 
> return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 185, 
> in wrapped
> result = f(pd.concat(value_series, axis=1))
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 81, in 
> wrapper
> return f(*args, **kwargs)
>   File "/__w/spark/spark/python/pyspark/pandas/groupby.py", line 1620, in 
> rename_output
> pdf.columns = return_schema.names
>   File "/usr/local/lib/python3.9/dist-packages/pandas/core/generic.py", line 
> 5588, in __setattr__
> return object.__setattr__(self, name, value)
>   File "pandas/_libs/properties.pyx", line 70, in 
> pandas._libs.properties.AxisProperty.__set__
>   File "/usr/local/lib/python3.9/dist-packages/pandas/core/generic.py", line 
> 769, in _set_axis
> self._mgr.set_axis(axis, labels)
>   File 
> "/usr/local/lib/python3.9/dist-packages/pandas/core/internals/managers.py", 
> line 214, in set_axis
> self._validate_set_axis(axis, new_labels)
>   File 
> "/usr/local/lib/python3.9/dist-packages/pandas/core/internals/base.py", line 
> 69, in _validate_set_axis
> raise ValueError(
> ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 
> elements {code}
>  
> GroupByTest.test_apply_with_new_dataframe_without_shortcut



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39054) GroupByTest failed due to axis Length mismatch

2022-04-28 Thread Yikun Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529750#comment-17529750
 ] 

Yikun Jiang commented on SPARK-39054:
-

https://github.com/pandas-dev/pandas/issues/46893

> GroupByTest failed due to axis Length mismatch
> --
>
> Key: SPARK-39054
> URL: https://issues.apache.org/jira/browse/SPARK-39054
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> {code:java}
> An error occurred while calling o27083.getResult.
> : org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
>   at 
> org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:97)
>   at 
> org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:93)
>   at sun.reflect.GeneratedMethodAccessor91.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:282)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at 
> py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
>   at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
>   at java.lang.Thread.run(Thread.java:750)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 808.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 808.0 (TID 650) (localhost executor driver): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 686, 
> in main
> process()
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 678, 
> in process
> serializer.dump_stream(out_iter, outfile)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 343, in dump_stream
> return ArrowStreamSerializer.dump_stream(self, 
> init_stream_yield_batches(), stream)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 84, in dump_stream
> for batch in iterator:
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 336, in init_stream_yield_batches
> for series in iterator:
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 487, 
> in mapper
> return f(keys, vals)
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 207, 
> in 
> return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 185, 
> in wrapped
> result = f(pd.concat(value_series, axis=1))
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 81, in 
> wrapper
> return f(*args, **kwargs)
>   File "/__w/spark/spark/python/pyspark/pandas/groupby.py", line 1620, in 
> rename_output
> pdf.columns = return_schema.names
>   File "/usr/local/lib/python3.9/dist-packages/pandas/core/generic.py", line 
> 5588, in __setattr__
> return object.__setattr__(self, name, value)
>   File "pandas/_libs/properties.pyx", line 70, in 
> pandas._libs.properties.AxisProperty.__set__
>   File "/usr/local/lib/python3.9/dist-packages/pandas/core/generic.py", line 
> 769, in _set_axis
> self._mgr.set_axis(axis, labels)
>   File 
> "/usr/local/lib/python3.9/dist-packages/pandas/core/internals/managers.py", 
> line 214, in set_axis
> self._validate_set_axis(axis, new_labels)
>   File 
> "/usr/local/lib/python3.9/dist-packages/pandas/core/internals/base.py", line 
> 69, in _validate_set_axis
> raise ValueError(
> ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 
> elements {code}
>  
> GroupByTest.test_apply_with_new_dataframe_without_shortcut



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39035) Add tests for options from `to_csv` and `from_csv`.

2022-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529737#comment-17529737
 ] 

Apache Spark commented on SPARK-39035:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/36401

> Add tests for options from `to_csv` and `from_csv`.
> ---
>
> Key: SPARK-39035
> URL: https://issues.apache.org/jira/browse/SPARK-39035
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> There are many supported options for to_json and from_json 
> (https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option),
>  but they are currently not tested.
> We should test for options to 
> `sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala`.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39035) Add tests for options from `to_csv` and `from_csv`.

2022-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39035:


Assignee: (was: Apache Spark)

> Add tests for options from `to_csv` and `from_csv`.
> ---
>
> Key: SPARK-39035
> URL: https://issues.apache.org/jira/browse/SPARK-39035
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> There are many supported options for to_json and from_json 
> (https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option),
>  but they are currently not tested.
> We should test for options to 
> `sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala`.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39035) Add tests for options from `to_csv` and `from_csv`.

2022-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529736#comment-17529736
 ] 

Apache Spark commented on SPARK-39035:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/36401

> Add tests for options from `to_csv` and `from_csv`.
> ---
>
> Key: SPARK-39035
> URL: https://issues.apache.org/jira/browse/SPARK-39035
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> There are many supported options for to_json and from_json 
> (https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option),
>  but they are currently not tested.
> We should test for options to 
> `sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala`.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39035) Add tests for options from `to_csv` and `from_csv`.

2022-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39035:


Assignee: Apache Spark

> Add tests for options from `to_csv` and `from_csv`.
> ---
>
> Key: SPARK-39035
> URL: https://issues.apache.org/jira/browse/SPARK-39035
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> There are many supported options for to_json and from_json 
> (https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option),
>  but they are currently not tested.
> We should test for options to 
> `sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala`.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38748) Test the error class: PIVOT_VALUE_DATA_TYPE_MISMATCH

2022-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38748:


Assignee: Apache Spark

> Test the error class: PIVOT_VALUE_DATA_TYPE_MISMATCH
> 
>
> Key: SPARK-38748
> URL: https://issues.apache.org/jira/browse/SPARK-38748
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> Add a test for the error classes *PIVOT_VALUE_DATA_TYPE_MISMATCH* to 
> QueryCompilationErrorsSuite. The test should cover the exception throw in 
> QueryCompilationErrors:
> {code:scala}
>   def pivotValDataTypeMismatchError(pivotVal: Expression, pivotCol: 
> Expression): Throwable = {
> new AnalysisException(
>   errorClass = "PIVOT_VALUE_DATA_TYPE_MISMATCH",
>   messageParameters = Array(
> pivotVal.toString, pivotVal.dataType.simpleString, 
> pivotCol.dataType.catalogString))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38748) Test the error class: PIVOT_VALUE_DATA_TYPE_MISMATCH

2022-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38748:


Assignee: (was: Apache Spark)

> Test the error class: PIVOT_VALUE_DATA_TYPE_MISMATCH
> 
>
> Key: SPARK-38748
> URL: https://issues.apache.org/jira/browse/SPARK-38748
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add a test for the error classes *PIVOT_VALUE_DATA_TYPE_MISMATCH* to 
> QueryCompilationErrorsSuite. The test should cover the exception throw in 
> QueryCompilationErrors:
> {code:scala}
>   def pivotValDataTypeMismatchError(pivotVal: Expression, pivotCol: 
> Expression): Throwable = {
> new AnalysisException(
>   errorClass = "PIVOT_VALUE_DATA_TYPE_MISMATCH",
>   messageParameters = Array(
> pivotVal.toString, pivotVal.dataType.simpleString, 
> pivotCol.dataType.catalogString))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38748) Test the error class: PIVOT_VALUE_DATA_TYPE_MISMATCH

2022-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529726#comment-17529726
 ] 

Apache Spark commented on SPARK-38748:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36400

> Test the error class: PIVOT_VALUE_DATA_TYPE_MISMATCH
> 
>
> Key: SPARK-38748
> URL: https://issues.apache.org/jira/browse/SPARK-38748
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add a test for the error classes *PIVOT_VALUE_DATA_TYPE_MISMATCH* to 
> QueryCompilationErrorsSuite. The test should cover the exception throw in 
> QueryCompilationErrors:
> {code:scala}
>   def pivotValDataTypeMismatchError(pivotVal: Expression, pivotCol: 
> Expression): Throwable = {
> new AnalysisException(
>   errorClass = "PIVOT_VALUE_DATA_TYPE_MISMATCH",
>   messageParameters = Array(
> pivotVal.toString, pivotVal.dataType.simpleString, 
> pivotCol.dataType.catalogString))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38748) Test the error class: PIVOT_VALUE_DATA_TYPE_MISMATCH

2022-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529725#comment-17529725
 ] 

Apache Spark commented on SPARK-38748:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36400

> Test the error class: PIVOT_VALUE_DATA_TYPE_MISMATCH
> 
>
> Key: SPARK-38748
> URL: https://issues.apache.org/jira/browse/SPARK-38748
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add a test for the error classes *PIVOT_VALUE_DATA_TYPE_MISMATCH* to 
> QueryCompilationErrorsSuite. The test should cover the exception throw in 
> QueryCompilationErrors:
> {code:scala}
>   def pivotValDataTypeMismatchError(pivotVal: Expression, pivotCol: 
> Expression): Throwable = {
> new AnalysisException(
>   errorClass = "PIVOT_VALUE_DATA_TYPE_MISMATCH",
>   messageParameters = Array(
> pivotVal.toString, pivotVal.dataType.simpleString, 
> pivotCol.dataType.catalogString))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39062) Add Standalone backend support for Stage Level Scheduling

2022-04-28 Thread huangtengfei (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529717#comment-17529717
 ] 

huangtengfei commented on SPARK-39062:
--

I am working on this. Thanks [~jiangxb1987]

> Add Standalone backend support for Stage Level Scheduling
> -
>
> Key: SPARK-39062
> URL: https://issues.apache.org/jira/browse/SPARK-39062
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> We should add Standalone backend support for Stage Level Scheduling:
> * The Master should be able to generate executors for multiple 
> ResouceProfiles, currently it only considers available CPUs;
> * The Worker need let the executor know about its ResourceProfile.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39062) Add Standalone backend support for Stage Level Scheduling

2022-04-28 Thread Xingbo Jiang (Jira)

Xingbo Jiang created SPARK-39062:


 Summary: Add Standalone backend support for Stage Level Scheduling
 Key: SPARK-39062
 URL: https://issues.apache.org/jira/browse/SPARK-39062
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.0, 3.3.0
Reporter: Xingbo Jiang


We should add Standalone backend support for Stage Level Scheduling:
* The Master should be able to generate executors for multiple ResouceProfiles, 
currently it only considers available CPUs;
* The Worker need let the executor know about its ResourceProfile.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39061) Incorrect results or NPE when using Inline function against an array of dynamically created structs

2022-04-28 Thread Bruce Robbins (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-39061:
--
Description: 
The following query returns incorrect results:
{noformat}
spark-sql> select inline(array(named_struct('a', 1, 'b', 2), null));
1   2
-1  -1
Time taken: 4.053 seconds, Fetched 2 row(s)
spark-sql>
{noformat}
In Hive, the last row is {{NULL, NULL}}:
{noformat}
Beeline version 2.3.9 by Apache Hive
0: jdbc:hive2://localhost:1> select inline(array(named_struct('a', 1, 'b', 
2), null));
+---+---+
|   a   |   b   |
+---+---+
| 1 | 2 |
| NULL  | NULL  |
+---+---+
2 rows selected (1.355 seconds)
0: jdbc:hive2://localhost:1> 
{noformat}
If the struct has string fields, you get a {{NullPointerException}}:
{noformat}
spark-sql> select inline(array(named_struct('a', '1', 'b', '2'), null));
22/04/28 16:51:54 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.NullPointerException: null
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
 ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 ~[spark-sql_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
{noformat}

You can work around the issue by casting the null entry of the array:
{noformat}
spark-sql> select inline(array(named_struct('a', 1, 'b', 2), cast(null as 
struct)));
1   2
NULLNULL
Time taken: 0.068 seconds, Fetched 2 row(s)
spark-sql>
{noformat}

As far as I can tell, this issue only happens with arrays of structs where the 
structs are created in an inline table or in a projection.

The fields of the struct are not getting set to {{nullable = true}} when there 
is no example in the array where the field is set to {{null}}. As a result, 
{{GenerateUnsafeProjection.createCode}} generates bad code: it has no code to 
create a row of null columns, so it just creates a row from variables set with 
default values.

  was:
The following query returns incorrect results:
{noformat}
spark-sql> select inline(array(named_struct('a', 1, 'b', 2), null));
1   2
-1  -1
Time taken: 4.053 seconds, Fetched 2 row(s)
spark-sql>
{noformat}
In Hive, the last row is {{NULL, NULL}}:
{noformat}
Beeline version 2.3.9 by Apache Hive
0: jdbc:hive2://localhost:1> select inline(array(named_struct('a', 1, 'b', 
2), null));
+---+---+
|   a   |   b   |
+---+---+
| 1 | 2 |
| NULL  | NULL  |
+---+---+
2 rows selected (1.355 seconds)
0: jdbc:hive2://localhost:1> 
{noformat}
If the struct has string fields, you get a {{NullPointerException}}:
{noformat}
spark-sql> select inline(array(named_struct('a', '1', 'b', '2'), null));
22/04/28 16:51:54 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.NullPointerException: null
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
 ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 ~[spark-sql_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
{noformat}
(Note: In Spark 3.1.3, both examples result in NPE).

You can work around the issue by casting the null entry of the array:
{noformat}
spark-sql> select inline(array(named_struct('a', 1, 'b', 2), cast(null as 
struct)));
1   2
NULLNULL
Time taken: 0.068 seconds, Fetched 2 row(s)
spark-sql>
{noformat}
(Note: In Spark 3.1.3, the above workaround does not work).

As far as I can tell, this issue only happens with arrays of structs where the 
structs are created in an inline table or in a projection.

The fields of the struct are not getting set to {{nullable = true}} when there 
is no example in the array where the field is set to {{null}}. As a result, 
{{GenerateUnsafeProjection.createCode}} generates bad code: it has no code to 
create a row of null columns, so it just creates a row from variables set with 
default values.


> Incorrect results or NPE when using Inline function against an array of 
> dynamically created structs
> ---
>
> Key: SPARK-39061
> URL:

[jira] [Updated] (SPARK-39061) Incorrect results or NPE when using Inline function against an array of dynamically created structs

2022-04-28 Thread Bruce Robbins (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-39061:
--
Affects Version/s: (was: 3.1.3)

> Incorrect results or NPE when using Inline function against an array of 
> dynamically created structs
> ---
>
> Key: SPARK-39061
> URL: https://issues.apache.org/jira/browse/SPARK-39061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0, 3.4.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> The following query returns incorrect results:
> {noformat}
> spark-sql> select inline(array(named_struct('a', 1, 'b', 2), null));
> 1 2
> -1-1
> Time taken: 4.053 seconds, Fetched 2 row(s)
> spark-sql>
> {noformat}
> In Hive, the last row is {{NULL, NULL}}:
> {noformat}
> Beeline version 2.3.9 by Apache Hive
> 0: jdbc:hive2://localhost:1> select inline(array(named_struct('a', 1, 
> 'b', 2), null));
> +---+---+
> |   a   |   b   |
> +---+---+
> | 1 | 2 |
> | NULL  | NULL  |
> +---+---+
> 2 rows selected (1.355 seconds)
> 0: jdbc:hive2://localhost:1> 
> {noformat}
> If the struct has string fields, you get a {{NullPointerException}}:
> {noformat}
> spark-sql> select inline(array(named_struct('a', '1', 'b', '2'), null));
> 22/04/28 16:51:54 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
> java.lang.NullPointerException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>  ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  ~[spark-sql_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
> {noformat}
> (Note: In Spark 3.1.3, both examples result in NPE).
> You can work around the issue by casting the null entry of the array:
> {noformat}
> spark-sql> select inline(array(named_struct('a', 1, 'b', 2), cast(null as 
> struct)));
> 1 2
> NULL  NULL
> Time taken: 0.068 seconds, Fetched 2 row(s)
> spark-sql>
> {noformat}
> (Note: In Spark 3.1.3, the above workaround does not work).
> As far as I can tell, this issue only happens with arrays of structs where 
> the structs are created in an inline table or in a projection.
> The fields of the struct are not getting set to {{nullable = true}} when 
> there is no example in the array where the field is set to {{null}}. As a 
> result, {{GenerateUnsafeProjection.createCode}} generates bad code: it has no 
> code to create a row of null columns, so it just creates a row from variables 
> set with default values.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39058) Add `getInputSignature` and `getOutputSignature` APIs for spark ML models/transformers

2022-04-28 Thread Weichen Xu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529703#comment-17529703
 ] 

Weichen Xu commented on SPARK-39058:


[~jasbali] volunteers to contribute this feature, so I would assign this ticket 
to him.
[~ruifengz] Would you help review when you have time ? Thank you.

> Add `getInputSignature` and `getOutputSignature` APIs for spark ML 
> models/transformers
> --
>
> Key: SPARK-39058
> URL: https://issues.apache.org/jira/browse/SPARK-39058
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.4.0
>Reporter: Weichen Xu
>Priority: Major
>
> Add `getInputSignature` and `getOutputSignature` APIs for spark ML 
> models/transformers:
> The `getInputSignature` API return a list of column name and type which 
> represent the ML model / transformer's input columns.
> The `getOutputSignature` API return a list of column column name and type 
> which represent the ML model / transformer's output columns.
> These 2 APIs are useful to in third-party library such as mlflow, which 
> requires the information of input / output signature of the Model.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-39061) Incorrect results or NPE when using Inline function against an array of dynamically created structs

2022-04-28 Thread Bruce Robbins (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529702#comment-17529702
 ] 

Bruce Robbins edited comment on SPARK-39061 at 4/29/22 12:33 AM:
-

Btw, dataframe example:
{noformat}
scala> val df = Seq((1)).toDF.withColumn("c1", array(struct(lit(1).alias("a"), 
lit(2).alias("b")), lit(null)))
df: org.apache.spark.sql.DataFrame = [value: int, c1: 
array>]

scala> df.selectExpr("inline(c1)").collect
res3: Array[org.apache.spark.sql.Row] = Array([1,2], [-1,-1])
{noformat}


was (Author: bersprockets):
Btw, dataframe example:
{noformat}
scala> val df = Seq((1)).toDF.withColumn("b", array(struct(lit(1).alias("a"), 
lit(2).alias("a")), lit(null)))
df: org.apache.spark.sql.DataFrame = [value: int, b: array>]

scala> df.selectExpr("inline(b)").collect
res2: Array[org.apache.spark.sql.Row] = Array([1,2], [-1,-1])
{noformat}

> Incorrect results or NPE when using Inline function against an array of 
> dynamically created structs
> ---
>
> Key: SPARK-39061
> URL: https://issues.apache.org/jira/browse/SPARK-39061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3, 3.2.1, 3.3.0, 3.4.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> The following query returns incorrect results:
> {noformat}
> spark-sql> select inline(array(named_struct('a', 1, 'b', 2), null));
> 1 2
> -1-1
> Time taken: 4.053 seconds, Fetched 2 row(s)
> spark-sql>
> {noformat}
> In Hive, the last row is {{NULL, NULL}}:
> {noformat}
> Beeline version 2.3.9 by Apache Hive
> 0: jdbc:hive2://localhost:1> select inline(array(named_struct('a', 1, 
> 'b', 2), null));
> +---+---+
> |   a   |   b   |
> +---+---+
> | 1 | 2 |
> | NULL  | NULL  |
> +---+---+
> 2 rows selected (1.355 seconds)
> 0: jdbc:hive2://localhost:1> 
> {noformat}
> If the struct has string fields, you get a {{NullPointerException}}:
> {noformat}
> spark-sql> select inline(array(named_struct('a', '1', 'b', '2'), null));
> 22/04/28 16:51:54 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
> java.lang.NullPointerException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>  ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  ~[spark-sql_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
> {noformat}
> (Note: In Spark 3.1.3, both examples result in NPE).
> You can work around the issue by casting the null entry of the array:
> {noformat}
> spark-sql> select inline(array(named_struct('a', 1, 'b', 2), cast(null as 
> struct)));
> 1 2
> NULL  NULL
> Time taken: 0.068 seconds, Fetched 2 row(s)
> spark-sql>
> {noformat}
> (Note: In Spark 3.1.3, the above workaround does not work).
> As far as I can tell, this issue only happens with arrays of structs where 
> the structs are created in an inline table or in a projection.
> The fields of the struct are not getting set to {{nullable = true}} when 
> there is no example in the array where the field is set to {{null}}. As a 
> result, {{GenerateUnsafeProjection.createCode}} generates bad code: it has no 
> code to create a row of null columns, so it just creates a row from variables 
> set with default values.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39061) Incorrect results or NPE when using Inline function against an array of dynamically created structs

2022-04-28 Thread Bruce Robbins (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529702#comment-17529702
 ] 

Bruce Robbins commented on SPARK-39061:
---

Btw, dataframe example:
{noformat}
scala> val df = Seq((1)).toDF.withColumn("b", array(struct(lit(1).alias("a"), 
lit(2).alias("a")), lit(null)))
df: org.apache.spark.sql.DataFrame = [value: int, b: array>]

scala> df.selectExpr("inline(b)").collect
res2: Array[org.apache.spark.sql.Row] = Array([1,2], [-1,-1])
{noformat}

> Incorrect results or NPE when using Inline function against an array of 
> dynamically created structs
> ---
>
> Key: SPARK-39061
> URL: https://issues.apache.org/jira/browse/SPARK-39061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3, 3.2.1, 3.3.0, 3.4.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> The following query returns incorrect results:
> {noformat}
> spark-sql> select inline(array(named_struct('a', 1, 'b', 2), null));
> 1 2
> -1-1
> Time taken: 4.053 seconds, Fetched 2 row(s)
> spark-sql>
> {noformat}
> In Hive, the last row is {{NULL, NULL}}:
> {noformat}
> Beeline version 2.3.9 by Apache Hive
> 0: jdbc:hive2://localhost:1> select inline(array(named_struct('a', 1, 
> 'b', 2), null));
> +---+---+
> |   a   |   b   |
> +---+---+
> | 1 | 2 |
> | NULL  | NULL  |
> +---+---+
> 2 rows selected (1.355 seconds)
> 0: jdbc:hive2://localhost:1> 
> {noformat}
> If the struct has string fields, you get a {{NullPointerException}}:
> {noformat}
> spark-sql> select inline(array(named_struct('a', '1', 'b', '2'), null));
> 22/04/28 16:51:54 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
> java.lang.NullPointerException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>  ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  ~[spark-sql_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
> {noformat}
> (Note: In Spark 3.1.3, both examples result in NPE).
> You can work around the issue by casting the null entry of the array:
> {noformat}
> spark-sql> select inline(array(named_struct('a', 1, 'b', 2), cast(null as 
> struct)));
> 1 2
> NULL  NULL
> Time taken: 0.068 seconds, Fetched 2 row(s)
> spark-sql>
> {noformat}
> (Note: In Spark 3.1.3, the above workaround does not work).
> As far as I can tell, this issue only happens with arrays of structs where 
> the structs are created in an inline table or in a projection.
> The fields of the struct are not getting set to {{nullable = true}} when 
> there is no example in the array where the field is set to {{null}}. As a 
> result, {{GenerateUnsafeProjection.createCode}} generates bad code: it has no 
> code to create a row of null columns, so it just creates a row from variables 
> set with default values.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39061) Incorrect results or NPE when using Inline function against an array of dynamically created structs

2022-04-28 Thread Bruce Robbins (Jira)

Bruce Robbins created SPARK-39061:
-

 Summary: Incorrect results or NPE when using Inline function 
against an array of dynamically created structs
 Key: SPARK-39061
 URL: https://issues.apache.org/jira/browse/SPARK-39061
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1, 3.1.3, 3.3.0, 3.4.0
Reporter: Bruce Robbins


The following query returns incorrect results:
{noformat}
spark-sql> select inline(array(named_struct('a', 1, 'b', 2), null));
1   2
-1  -1
Time taken: 4.053 seconds, Fetched 2 row(s)
spark-sql>
{noformat}
In Hive, the last row is {{NULL, NULL}}:
{noformat}
Beeline version 2.3.9 by Apache Hive
0: jdbc:hive2://localhost:1> select inline(array(named_struct('a', 1, 'b', 
2), null));
+---+---+
|   a   |   b   |
+---+---+
| 1 | 2 |
| NULL  | NULL  |
+---+---+
2 rows selected (1.355 seconds)
0: jdbc:hive2://localhost:1> 
{noformat}
If the struct has string fields, you get a {{NullPointerException}}:
{noformat}
spark-sql> select inline(array(named_struct('a', '1', 'b', '2'), null));
22/04/28 16:51:54 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.NullPointerException: null
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
 ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) ~[?:?]
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 ~[spark-sql_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
{noformat}
(Note: In Spark 3.1.3, both examples result in NPE).

You can work around the issue by casting the null entry of the array:
{noformat}
spark-sql> select inline(array(named_struct('a', 1, 'b', 2), cast(null as 
struct)));
1   2
NULLNULL
Time taken: 0.068 seconds, Fetched 2 row(s)
spark-sql>
{noformat}
(Note: In Spark 3.1.3, the above workaround does not work).

As far as I can tell, this issue only happens with arrays of structs where the 
structs are created in an inline table or in a projection.

The fields of the struct are not getting set to {{nullable = true}} when there 
is no example in the array where the field is set to {{null}}. As a result, 
{{GenerateUnsafeProjection.createCode}} generates bad code: it has no code to 
create a row of null columns, so it just creates a row from variables set with 
default values.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35739) [Spark Sql] Add Java-comptable Dataset.join overloads

2022-04-28 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-35739.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36343
[https://github.com/apache/spark/pull/36343]

> [Spark Sql] Add Java-comptable Dataset.join overloads
> -
>
> Key: SPARK-35739
> URL: https://issues.apache.org/jira/browse/SPARK-35739
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, SQL
>Affects Versions: 2.0.0, 3.0.0
>Reporter: Brandon Dahler
>Assignee: Brandon Dahler
>Priority: Minor
> Fix For: 3.4.0
>
>
> h2. Problem
> When using Spark SQL with Java, the required syntax to utilize the following 
> two overloads are unnatural and not obvious to developers that haven't had to 
> interoperate with Scala before:
> {code:java}
> def join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
> def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): 
> DataFrame
> {code}
> Examples:
> Java 11 
> {code:java}
> Dataset dataset1 = ...;
> Dataset dataset2 = ...;
> // Overload with multiple usingColumns, no join type
> dataset1
>   .join(dataset2, JavaConverters.asScalaBuffer(List.of("column", "column2))
>   .show();
> // Overload with multiple usingColumns and a join type
> dataset1
>   .join(
> dataset2,
> JavaConverters.asScalaBuffer(List.of("column", "column2")),
> "left")
>   .show();
> {code}
>  
>  Additionally there is no overload that takes a single usingColumnn and a 
> joinType, forcing the developer to use the Seq[String] overload regardless of 
> language.
> Examples:
> Scala
> {code:java}
> val dataset1 :DataFrame = ...;
> val dataset2 :DataFrame = ...;
> dataset1
>   .join(dataset2, Seq("column"), "left")
>   .show();
> {code}
>  
>  Java 11
> {code:java}
> Dataset dataset1 = ...;
> Dataset dataset2 = ...;
> dataset1
>  .join(dataset2, JavaConverters.asScalaBuffer(List.of("column")), "left")
>  .show();
> {code}
> h2. Proposed Improvement
> Add 3 additional overloads to Dataset:
>   
> {code:java}
> def join(right: Dataset[_], usingColumn: List[String]): DataFrame
> def join(right: Dataset[_], usingColumn: String, joinType: String): DataFrame
> def join(right: Dataset[_], usingColumn: List[String], joinType: String): 
> DataFrame
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35739) [Spark Sql] Add Java-comptable Dataset.join overloads

2022-04-28 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-35739:


Assignee: Brandon Dahler

> [Spark Sql] Add Java-comptable Dataset.join overloads
> -
>
> Key: SPARK-35739
> URL: https://issues.apache.org/jira/browse/SPARK-35739
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, SQL
>Affects Versions: 2.0.0, 3.0.0
>Reporter: Brandon Dahler
>Assignee: Brandon Dahler
>Priority: Minor
>
> h2. Problem
> When using Spark SQL with Java, the required syntax to utilize the following 
> two overloads are unnatural and not obvious to developers that haven't had to 
> interoperate with Scala before:
> {code:java}
> def join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
> def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): 
> DataFrame
> {code}
> Examples:
> Java 11 
> {code:java}
> Dataset dataset1 = ...;
> Dataset dataset2 = ...;
> // Overload with multiple usingColumns, no join type
> dataset1
>   .join(dataset2, JavaConverters.asScalaBuffer(List.of("column", "column2))
>   .show();
> // Overload with multiple usingColumns and a join type
> dataset1
>   .join(
> dataset2,
> JavaConverters.asScalaBuffer(List.of("column", "column2")),
> "left")
>   .show();
> {code}
>  
>  Additionally there is no overload that takes a single usingColumnn and a 
> joinType, forcing the developer to use the Seq[String] overload regardless of 
> language.
> Examples:
> Scala
> {code:java}
> val dataset1 :DataFrame = ...;
> val dataset2 :DataFrame = ...;
> dataset1
>   .join(dataset2, Seq("column"), "left")
>   .show();
> {code}
>  
>  Java 11
> {code:java}
> Dataset dataset1 = ...;
> Dataset dataset2 = ...;
> dataset1
>  .join(dataset2, JavaConverters.asScalaBuffer(List.of("column")), "left")
>  .show();
> {code}
> h2. Proposed Improvement
> Add 3 additional overloads to Dataset:
>   
> {code:java}
> def join(right: Dataset[_], usingColumn: List[String]): DataFrame
> def join(right: Dataset[_], usingColumn: String, joinType: String): DataFrame
> def join(right: Dataset[_], usingColumn: List[String], joinType: String): 
> DataFrame
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38896) Use tryWithResource to recycling KVStoreIterator

2022-04-28 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-38896.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36237
[https://github.com/apache/spark/pull/36237]

> Use tryWithResource to recycling KVStoreIterator
> 
>
> Key: SPARK-38896
> URL: https://issues.apache.org/jira/browse/SPARK-38896
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> Use `Utils.tryWithResource` to recycling all  `KVStoreIterator` opened by 
> RocksDB/LevelDB 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38896) Use tryWithResource to recycling KVStoreIterator

2022-04-28 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-38896:


Assignee: Yang Jie

> Use tryWithResource to recycling KVStoreIterator
> 
>
> Key: SPARK-38896
> URL: https://issues.apache.org/jira/browse/SPARK-38896
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> Use `Utils.tryWithResource` to recycling all  `KVStoreIterator` opened by 
> RocksDB/LevelDB 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39042) Use `Map.values()` instead of `Map.entrySet()` in scenarios that do not use `keys`

2022-04-28 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-39042:


Assignee: Yang Jie

> Use `Map.values()` instead of `Map.entrySet()` in scenarios that do not use 
> `keys`
> --
>
> Key: SPARK-39042
> URL: https://issues.apache.org/jira/browse/SPARK-39042
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> Some code in Spark use `Map.entrySet()` but not use `keys`, can use   
> `Map.values()` instead of



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39042) Use `Map.values()` instead of `Map.entrySet()` in scenarios that do not use `keys`

2022-04-28 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-39042:
-
Priority: Trivial  (was: Minor)

> Use `Map.values()` instead of `Map.entrySet()` in scenarios that do not use 
> `keys`
> --
>
> Key: SPARK-39042
> URL: https://issues.apache.org/jira/browse/SPARK-39042
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Trivial
>
> Some code in Spark use `Map.entrySet()` but not use `keys`, can use   
> `Map.values()` instead of



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39042) Use `Map.values()` instead of `Map.entrySet()` in scenarios that do not use `keys`

2022-04-28 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-39042.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36372
[https://github.com/apache/spark/pull/36372]

> Use `Map.values()` instead of `Map.entrySet()` in scenarios that do not use 
> `keys`
> --
>
> Key: SPARK-39042
> URL: https://issues.apache.org/jira/browse/SPARK-39042
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Trivial
> Fix For: 3.4.0
>
>
> Some code in Spark use `Map.entrySet()` but not use `keys`, can use   
> `Map.values()` instead of



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39034) Add tests for options from `to_json` and `from_json`.

2022-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39034:


Assignee: (was: Apache Spark)

> Add tests for options from `to_json` and `from_json`.
> -
>
> Key: SPARK-39034
> URL: https://issues.apache.org/jira/browse/SPARK-39034
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> There are many supported options for to_json and from_json 
> (https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option),
>  but they are currently not tested.
> We should test for options to 
> `sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala`.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39034) Add tests for options from `to_json` and `from_json`.

2022-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529690#comment-17529690
 ] 

Apache Spark commented on SPARK-39034:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/36399

> Add tests for options from `to_json` and `from_json`.
> -
>
> Key: SPARK-39034
> URL: https://issues.apache.org/jira/browse/SPARK-39034
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> There are many supported options for to_json and from_json 
> (https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option),
>  but they are currently not tested.
> We should test for options to 
> `sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala`.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39034) Add tests for options from `to_json` and `from_json`.

2022-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39034:


Assignee: Apache Spark

> Add tests for options from `to_json` and `from_json`.
> -
>
> Key: SPARK-39034
> URL: https://issues.apache.org/jira/browse/SPARK-39034
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> There are many supported options for to_json and from_json 
> (https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option),
>  but they are currently not tested.
> We should test for options to 
> `sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala`.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39034) Add tests for options from `to_json` and `from_json`.

2022-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529689#comment-17529689
 ] 

Apache Spark commented on SPARK-39034:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/36399

> Add tests for options from `to_json` and `from_json`.
> -
>
> Key: SPARK-39034
> URL: https://issues.apache.org/jira/browse/SPARK-39034
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> There are many supported options for to_json and from_json 
> (https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option),
>  but they are currently not tested.
> We should test for options to 
> `sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala`.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39035) Add tests for options from `to_csv` and `from_csv`.

2022-04-28 Thread Haejoon Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529688#comment-17529688
 ] 

Haejoon Lee commented on SPARK-39035:
-

I'm working on this

> Add tests for options from `to_csv` and `from_csv`.
> ---
>
> Key: SPARK-39035
> URL: https://issues.apache.org/jira/browse/SPARK-39035
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> There are many supported options for to_json and from_json 
> (https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option),
>  but they are currently not tested.
> We should test for options to 
> `sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala`.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38838) Support ALTER TABLE ALTER COLUMN commands with DEFAULT values

2022-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529661#comment-17529661
 ] 

Apache Spark commented on SPARK-38838:
--

User 'dtenedor' has created a pull request for this issue:
https://github.com/apache/spark/pull/36398

> Support ALTER TABLE ALTER COLUMN commands with DEFAULT values
> -
>
> Key: SPARK-38838
> URL: https://issues.apache.org/jira/browse/SPARK-38838
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38718) Test the error class: AMBIGUOUS_FIELD_NAME

2022-04-28 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-38718.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36395
[https://github.com/apache/spark/pull/36395]

> Test the error class: AMBIGUOUS_FIELD_NAME
> --
>
> Key: SPARK-38718
> URL: https://issues.apache.org/jira/browse/SPARK-38718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: panbingkun
>Priority: Minor
>  Labels: starter
> Fix For: 3.4.0
>
>
> Add at least one test for the error class *AMBIGUOUS_FIELD_NAME* to 
> QueryCompilationErrorsSuite. The test should cover the exception throw in 
> QueryCompilationErrors:
> {code:scala}
>   def ambiguousFieldNameError(
>   fieldName: Seq[String], numMatches: Int, context: Origin): Throwable = {
> new AnalysisException(
>   errorClass = "AMBIGUOUS_FIELD_NAME",
>   messageParameters = Array(fieldName.quoted, numMatches.toString),
>   origin = context)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38718) Test the error class: AMBIGUOUS_FIELD_NAME

2022-04-28 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-38718:


Assignee: panbingkun

> Test the error class: AMBIGUOUS_FIELD_NAME
> --
>
> Key: SPARK-38718
> URL: https://issues.apache.org/jira/browse/SPARK-38718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: panbingkun
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *AMBIGUOUS_FIELD_NAME* to 
> QueryCompilationErrorsSuite. The test should cover the exception throw in 
> QueryCompilationErrors:
> {code:scala}
>   def ambiguousFieldNameError(
>   fieldName: Seq[String], numMatches: Int, context: Origin): Throwable = {
> new AnalysisException(
>   errorClass = "AMBIGUOUS_FIELD_NAME",
>   messageParameters = Array(fieldName.quoted, numMatches.toString),
>   origin = context)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39045) INTERNAL_ERROR for "all" internal errors

2022-04-28 Thread Serge Rielau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Serge Rielau updated SPARK-39045:
-
Description: We should be able to inject the  [INTERNAL_ERROR] class for 
most cases without waiting to label the long tail on user facing error classes  
 (was: We should be able to inject the  [SYSTEM_ERROR] class for most cases 
without waiting to label the long tail on user facing error classes )

> INTERNAL_ERROR for "all" internal errors
> 
>
> Key: SPARK-39045
> URL: https://issues.apache.org/jira/browse/SPARK-39045
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Major
>
> We should be able to inject the  [INTERNAL_ERROR] class for most cases 
> without waiting to label the long tail on user facing error classes 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39060) Typo in error messages of decimal overflow

2022-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529620#comment-17529620
 ] 

Apache Spark commented on SPARK-39060:
--

User 'vli-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/36397

> Typo in error messages of decimal overflow
> --
>
> Key: SPARK-39060
> URL: https://issues.apache.org/jira/browse/SPARK-39060
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Vitalii Li
>Priority: Major
>
>    org.apache.spark.SparkArithmeticException 
>    Decimal(expanded,10.1,39,1}) cannot be 
> represented as Decimal(38, 1). If necessary set spark.sql.ansi.enabled to 
> false to bypass this error.
>  
> As shown in {{decimalArithmeticOperations.sql.out}}
> Notice the extra {{}}} before ‘cannot’
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39060) Typo in error messages of decimal overflow

2022-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39060:


Assignee: Apache Spark

> Typo in error messages of decimal overflow
> --
>
> Key: SPARK-39060
> URL: https://issues.apache.org/jira/browse/SPARK-39060
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Vitalii Li
>Assignee: Apache Spark
>Priority: Major
>
>    org.apache.spark.SparkArithmeticException 
>    Decimal(expanded,10.1,39,1}) cannot be 
> represented as Decimal(38, 1). If necessary set spark.sql.ansi.enabled to 
> false to bypass this error.
>  
> As shown in {{decimalArithmeticOperations.sql.out}}
> Notice the extra {{}}} before ‘cannot’
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39060) Typo in error messages of decimal overflow

2022-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529619#comment-17529619
 ] 

Apache Spark commented on SPARK-39060:
--

User 'vli-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/36397

> Typo in error messages of decimal overflow
> --
>
> Key: SPARK-39060
> URL: https://issues.apache.org/jira/browse/SPARK-39060
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Vitalii Li
>Priority: Major
>
>    org.apache.spark.SparkArithmeticException 
>    Decimal(expanded,10.1,39,1}) cannot be 
> represented as Decimal(38, 1). If necessary set spark.sql.ansi.enabled to 
> false to bypass this error.
>  
> As shown in {{decimalArithmeticOperations.sql.out}}
> Notice the extra {{}}} before ‘cannot’
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39060) Typo in error messages of decimal overflow

2022-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39060:


Assignee: (was: Apache Spark)

> Typo in error messages of decimal overflow
> --
>
> Key: SPARK-39060
> URL: https://issues.apache.org/jira/browse/SPARK-39060
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Vitalii Li
>Priority: Major
>
>    org.apache.spark.SparkArithmeticException 
>    Decimal(expanded,10.1,39,1}) cannot be 
> represented as Decimal(38, 1). If necessary set spark.sql.ansi.enabled to 
> false to bypass this error.
>  
> As shown in {{decimalArithmeticOperations.sql.out}}
> Notice the extra {{}}} before ‘cannot’
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39060) Typo in error messages of decimal overflow

2022-04-28 Thread Vitalii Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalii Li updated SPARK-39060:
---
Description: 
   org.apache.spark.SparkArithmeticException 

   Decimal(expanded,10.1,39,1}) cannot be 
represented as Decimal(38, 1). If necessary set spark.sql.ansi.enabled to false 
to bypass this error.
 

As shown in {{decimalArithmeticOperations.sql.out}}

Notice the extra {{}}} before ‘cannot’


 
 
 
 

  was:
```
-- !query
select (5e36BD + 0.1) + 5e36BD
-- !query schema
struct<>
-- !query output
org.apache.spark.SparkArithmeticException
[CANNOT_CHANGE_DECIMAL_PRECISION] 
Decimal(expanded,10.1,39,1}) cannot be 
represented as Decimal(38, 1). If necessary set "spark.sql.ansi.enabled" to 
false to bypass this error.
== SQL(line 1, position 7) ==
select (5e36BD + 0.1) + 5e36BD
^^^
```
 
 
 
 
 


> Typo in error messages of decimal overflow
> --
>
> Key: SPARK-39060
> URL: https://issues.apache.org/jira/browse/SPARK-39060
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Vitalii Li
>Priority: Major
>
>    org.apache.spark.SparkArithmeticException 
>    Decimal(expanded,10.1,39,1}) cannot be 
> represented as Decimal(38, 1). If necessary set spark.sql.ansi.enabled to 
> false to bypass this error.
>  
> As shown in {{decimalArithmeticOperations.sql.out}}
> Notice the extra {{}}} before ‘cannot’
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39060) Typo in error messages of decimal overflow

2022-04-28 Thread Vitalii Li (Jira)

Vitalii Li created SPARK-39060:
--

 Summary: Typo in error messages of decimal overflow
 Key: SPARK-39060
 URL: https://issues.apache.org/jira/browse/SPARK-39060
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: Vitalii Li


```
-- !query
select (5e36BD + 0.1) + 5e36BD
-- !query schema
struct<>
-- !query output
org.apache.spark.SparkArithmeticException
[CANNOT_CHANGE_DECIMAL_PRECISION] 
Decimal(expanded,10.1,39,1}) cannot be 
represented as Decimal(38, 1). If necessary set "spark.sql.ansi.enabled" to 
false to bypass this error.
== SQL(line 1, position 7) ==
select (5e36BD + 0.1) + 5e36BD
^^^
```
 
 
 
 
 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39059) When using multiple SparkSessions, DataFrame.resolve uses configuration from the wrong session

2022-04-28 Thread Furcy Pin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Furcy Pin updated SPARK-39059:
--
Description: 
We encountered unexpected error when using SparkSession.newSession and the 
"spark.sql.caseSensitive" option.

I wrote a handful of examples below to illustrate the problem, but from the 
examples below it looks like when you use _SparkSession.newSession()_ and 
change the configuration of that new session, _DataFrame.apply(col_name)_ seems 
to use the configuration from the initial session instead of the new one.

*Example 1.A*
This fails because "spark.sql.caseSensitive" has not been set at all *(OK)* 
```

val s1 = SparkSession.builder.master("local[2]").getOrCreate()
s1.conf.set("spark.sql.caseSensitive", "true")
val s2 = s1.newSession()
// s2.conf.set("spark.sql.caseSensitive", "true")
val df = s2.sql("select 'a' as A, 'a' as a")
df.select("a").show()
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Reference 
> 'a' is ambiguous, could be: a, a.
```

*Example 1.B*
This fails because "spark.sql.caseSensitive" has not been set on s2, even if it 
has been set on s1 *(OK)* 
```

val s1 = SparkSession.builder.master("local[2]").getOrCreate()
s1.conf.set("spark.sql.caseSensitive", "true")
val s2 = s1.newSession()
// s2.conf.set("spark.sql.caseSensitive", "true")
val df = s2.sql("select 'a' as A, 'a' as a")
df.select("a").show()
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Reference 
> 'a' is ambiguous, could be: a, a.
```

*Example 1.C*
This works because "spark.sql.caseSensitive" has been set on s2 *[OK]*
```
val s1 = SparkSession.builder.master("local[2]").getOrCreate()
// s1.conf.set("spark.sql.caseSensitive", "true")
val s2 = s1.newSession()
s2.conf.set("spark.sql.caseSensitive", "true")
val df = s2.sql("select 'a' as A, 'a' as a")
df.select("a").show()
```

*Example 2.A*
This fails because "spark.sql.caseSensitive" has not been set at all *[NORMAL]*

```
val s1 = SparkSession.builder.master("local[2]").getOrCreate()
// s1.conf.set("spark.sql.caseSensitive", "true")
val s2 = s1.newSession()
// s2.conf.set("spark.sql.caseSensitive", "true")
val df = s2.sql("select 'a' as A, 'a' as a")
df("a")
// > Exception in thread "main" org.apache.spark.sql.AnalysisException: 
Reference 'a' is ambiguous, could be: a, a.


*Example 2.B*

This should fail because "spark.sql.caseSensitive" has not been set on s2, but 
it works *[NOT NORMAL]*
```
val s1 = SparkSession.builder.master("local[2]").getOrCreate()
s1.conf.set("spark.sql.caseSensitive", "true")
val s2 = s1.newSession()
// s2.conf.set("spark.sql.caseSensitive", "true")
val df = s2.sql("select 'a' as A, 'a' as a")
df("a")
```

*Example 2.C*
This should work because "spark.sql.caseSensitive" has been set on s2, but it 
fails instead *[NOT NORMAL]*
```
val s1 = SparkSession.builder.master("local[2]").getOrCreate()
// s1.conf.set("spark.sql.caseSensitive", "true")
val s2 = s1.newSession()
s2.conf.set("spark.sql.caseSensitive", "true")
val df = s2.sql("select 'a' as A, 'a' as a")
df("a")
// > Exception in thread "main" org.apache.spark.sql.AnalysisException: 
Reference 'a' is a
```

  was:
We encountered unexpected error when using SparkSession.newSession and the 
"spark.sql.caseSensitive" option.

I wrote a handful of examples below to illustrate the problem, but from the 
examples below it looks like when you use _SparkSession.newSession()_ and 
change the configuration of that new session, _DataFrame.apply(col_name)_ seems 
to use the configuration from the initial session instead of the new one.


*Example 1.A*
This fails because "spark.sql.caseSensitive" has not been set at all *(OK)* 
```

val s1 = SparkSession.builder.master("local[2]").getOrCreate()
s1.conf.set("spark.sql.caseSensitive", "true")
val s2 = s1.newSession()
// s2.conf.set("spark.sql.caseSensitive", "true")
val df = s2.sql("select 'a' as A, 'a' as a")
df.select("a").show()
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Reference 
> 'a' is ambiguous, could be: a, a.
```



*Example 1.B*
This fails because "spark.sql.caseSensitive" has not been set on s2 *(OK)* 
```

val s1 = SparkSession.builder.master("local[2]").getOrCreate()
s1.conf.set("spark.sql.caseSensitive", "true")
val s2 = s1.newSession()
// s2.conf.set("spark.sql.caseSensitive", "true")
val df = s2.sql("select 'a' as A, 'a' as a")
df.select("a").show()
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Reference 
> 'a' is ambiguous, could be: a, a.
```


*Example 1.C*
This works because "spark.sql.caseSensitive" has been set on s2 *[OK]*
```
val s1 = SparkSession.builder.master("local[2]").getOrCreate()
// s1.conf.set("spark.sql.caseSensitive", "true")
val s2 = s1.newSession()
s2.conf.set("spark.sql.caseSensitive", "true")
val df = s2.sql("select 'a' as A, 'a' as a")
df.select("a").show()
```


*Example 2.A*
This fails because

[jira] [Created] (SPARK-39059) When using multiple SparkSessions, DataFrame.resolve uses configuration from the wrong session

2022-04-28 Thread Furcy Pin (Jira)

Furcy Pin created SPARK-39059:
-

 Summary: When using multiple SparkSessions, DataFrame.resolve uses 
configuration from the wrong session
 Key: SPARK-39059
 URL: https://issues.apache.org/jira/browse/SPARK-39059
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: Furcy Pin


We encountered unexpected error when using SparkSession.newSession and the 
"spark.sql.caseSensitive" option.

I wrote a handful of examples below to illustrate the problem, but from the 
examples below it looks like when you use _SparkSession.newSession()_ and 
change the configuration of that new session, _DataFrame.apply(col_name)_ seems 
to use the configuration from the initial session instead of the new one.


*Example 1.A*
This fails because "spark.sql.caseSensitive" has not been set at all *(OK)* 
```

val s1 = SparkSession.builder.master("local[2]").getOrCreate()
s1.conf.set("spark.sql.caseSensitive", "true")
val s2 = s1.newSession()
// s2.conf.set("spark.sql.caseSensitive", "true")
val df = s2.sql("select 'a' as A, 'a' as a")
df.select("a").show()
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Reference 
> 'a' is ambiguous, could be: a, a.
```



*Example 1.B*
This fails because "spark.sql.caseSensitive" has not been set on s2 *(OK)* 
```

val s1 = SparkSession.builder.master("local[2]").getOrCreate()
s1.conf.set("spark.sql.caseSensitive", "true")
val s2 = s1.newSession()
// s2.conf.set("spark.sql.caseSensitive", "true")
val df = s2.sql("select 'a' as A, 'a' as a")
df.select("a").show()
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Reference 
> 'a' is ambiguous, could be: a, a.
```


*Example 1.C*
This works because "spark.sql.caseSensitive" has been set on s2 *[OK]*
```
val s1 = SparkSession.builder.master("local[2]").getOrCreate()
// s1.conf.set("spark.sql.caseSensitive", "true")
val s2 = s1.newSession()
s2.conf.set("spark.sql.caseSensitive", "true")
val df = s2.sql("select 'a' as A, 'a' as a")
df.select("a").show()
```


*Example 2.A*
This fails because "spark.sql.caseSensitive" has not been set at all *[NORMAL]*

```
val s1 = SparkSession.builder.master("local[2]").getOrCreate()
// s1.conf.set("spark.sql.caseSensitive", "true")
val s2 = s1.newSession()
// s2.conf.set("spark.sql.caseSensitive", "true")
val df = s2.sql("select 'a' as A, 'a' as a")
df("a")
// > Exception in thread "main" org.apache.spark.sql.AnalysisException: 
Reference 'a' is ambiguous, could be: a, a.

/* This should fail because "spark.sql.caseSensitive" has not been set on s2, 
but it works */ // *[NOT NORMAL]*
```
val s1 = SparkSession.builder.master("local[2]").getOrCreate()
s1.conf.set("spark.sql.caseSensitive", "true")
val s2 = s1.newSession()
// s2.conf.set("spark.sql.caseSensitive", "true")
val df = s2.sql("select 'a' as A, 'a' as a")
df("a")
```

*Example 2.C*
This should work because "spark.sql.caseSensitive" has been set on s2, but it 
fails instead *[NOT NORMAL]*
```
val s1 = SparkSession.builder.master("local[2]").getOrCreate()
// s1.conf.set("spark.sql.caseSensitive", "true")
val s2 = s1.newSession()
s2.conf.set("spark.sql.caseSensitive", "true")
val df = s2.sql("select 'a' as A, 'a' as a")
df("a")
// > Exception in thread "main" org.apache.spark.sql.AnalysisException: 
Reference 'a' is a
```



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38741) Test the error class: MAP_KEY_DOES_NOT_EXIST*

2022-04-28 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-38741.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36232
[https://github.com/apache/spark/pull/36232]

> Test the error class: MAP_KEY_DOES_NOT_EXIST*
> -
>
> Key: SPARK-38741
> URL: https://issues.apache.org/jira/browse/SPARK-38741
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: panbingkun
>Priority: Minor
>  Labels: starter
> Fix For: 3.4.0
>
>
> Add tests for the error classes *MAP_KEY_DOES_NOT_EXIST** to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def mapKeyNotExistError(key: Any, isElementAtFunction: Boolean): 
> NoSuchElementException = {
> if (isElementAtFunction) {
>   new SparkNoSuchElementException(errorClass = 
> "MAP_KEY_DOES_NOT_EXIST_IN_ELEMENT_AT",
> messageParameters = Array(key.toString, SQLConf.ANSI_ENABLED.key))
> } else {
>   new SparkNoSuchElementException(errorClass = "MAP_KEY_DOES_NOT_EXIST",
> messageParameters = Array(key.toString, 
> SQLConf.ANSI_STRICT_INDEX_OPERATOR.key))
> }
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38741) Test the error class: MAP_KEY_DOES_NOT_EXIST*

2022-04-28 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-38741:


Assignee: panbingkun

> Test the error class: MAP_KEY_DOES_NOT_EXIST*
> -
>
> Key: SPARK-38741
> URL: https://issues.apache.org/jira/browse/SPARK-38741
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: panbingkun
>Priority: Minor
>  Labels: starter
>
> Add tests for the error classes *MAP_KEY_DOES_NOT_EXIST** to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def mapKeyNotExistError(key: Any, isElementAtFunction: Boolean): 
> NoSuchElementException = {
> if (isElementAtFunction) {
>   new SparkNoSuchElementException(errorClass = 
> "MAP_KEY_DOES_NOT_EXIST_IN_ELEMENT_AT",
> messageParameters = Array(key.toString, SQLConf.ANSI_ENABLED.key))
> } else {
>   new SparkNoSuchElementException(errorClass = "MAP_KEY_DOES_NOT_EXIST",
> messageParameters = Array(key.toString, 
> SQLConf.ANSI_STRICT_INDEX_OPERATOR.key))
> }
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39058) Add `getInputSignature` and `getOutputSignature` APIs for spark ML models/transformers

2022-04-28 Thread Weichen Xu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529462#comment-17529462
 ] 

Weichen Xu commented on SPARK-39058:


CC [~ruifengz] Are you interested in contributing this feature ? :)

> Add `getInputSignature` and `getOutputSignature` APIs for spark ML 
> models/transformers
> --
>
> Key: SPARK-39058
> URL: https://issues.apache.org/jira/browse/SPARK-39058
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.4.0
>Reporter: Weichen Xu
>Priority: Major
>
> Add `getInputSignature` and `getOutputSignature` APIs for spark ML 
> models/transformers:
> The `getInputSignature` API return a list of column name and type which 
> represent the ML model / transformer's input columns.
> The `getOutputSignature` API return a list of column column name and type 
> which represent the ML model / transformer's output columns.
> These 2 APIs are useful to in third-party library such as mlflow, which 
> requires the information of input / output signature of the Model.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39058) Add `getInputSignature` and `getOutputSignature` APIs for spark ML models/transformers

2022-04-28 Thread Weichen Xu (Jira)

Weichen Xu created SPARK-39058:
--

 Summary: Add `getInputSignature` and `getOutputSignature` APIs for 
spark ML models/transformers
 Key: SPARK-39058
 URL: https://issues.apache.org/jira/browse/SPARK-39058
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 3.4.0
Reporter: Weichen Xu


Add `getInputSignature` and `getOutputSignature` APIs for spark ML 
models/transformers:

The `getInputSignature` API return a list of column name and type which 
represent the ML model / transformer's input columns.

The `getOutputSignature` API return a list of column column name and type which 
represent the ML model / transformer's output columns.

These 2 APIs are useful to in third-party library such as mlflow, which 
requires the information of input / output signature of the Model.





--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39056) Use `Collections.singletonList` instead of `Arrays.asList` when there is only one argument

2022-04-28 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-39056.
--
Resolution: Won't Fix

>   Use `Collections.singletonList` instead of `Arrays.asList` when there is 
> only one argument
> 
>
> Key: SPARK-39056
> URL: https://issues.apache.org/jira/browse/SPARK-39056
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> Use `Collections.singletonList` instead of `Arrays.asList` when there is only 
> one argument.
>  
> before
> {code:java}
> List one = Arrays.asList("one"); {code}
> after
> {code:java}
> List one = Collections.singletonList("one"); {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38718) Test the error class: AMBIGUOUS_FIELD_NAME

2022-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529385#comment-17529385
 ] 

Apache Spark commented on SPARK-38718:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36395

> Test the error class: AMBIGUOUS_FIELD_NAME
> --
>
> Key: SPARK-38718
> URL: https://issues.apache.org/jira/browse/SPARK-38718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *AMBIGUOUS_FIELD_NAME* to 
> QueryCompilationErrorsSuite. The test should cover the exception throw in 
> QueryCompilationErrors:
> {code:scala}
>   def ambiguousFieldNameError(
>   fieldName: Seq[String], numMatches: Int, context: Origin): Throwable = {
> new AnalysisException(
>   errorClass = "AMBIGUOUS_FIELD_NAME",
>   messageParameters = Array(fieldName.quoted, numMatches.toString),
>   origin = context)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38718) Test the error class: AMBIGUOUS_FIELD_NAME

2022-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38718:


Assignee: (was: Apache Spark)

> Test the error class: AMBIGUOUS_FIELD_NAME
> --
>
> Key: SPARK-38718
> URL: https://issues.apache.org/jira/browse/SPARK-38718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *AMBIGUOUS_FIELD_NAME* to 
> QueryCompilationErrorsSuite. The test should cover the exception throw in 
> QueryCompilationErrors:
> {code:scala}
>   def ambiguousFieldNameError(
>   fieldName: Seq[String], numMatches: Int, context: Origin): Throwable = {
> new AnalysisException(
>   errorClass = "AMBIGUOUS_FIELD_NAME",
>   messageParameters = Array(fieldName.quoted, numMatches.toString),
>   origin = context)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38718) Test the error class: AMBIGUOUS_FIELD_NAME

2022-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529384#comment-17529384
 ] 

Apache Spark commented on SPARK-38718:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36395

> Test the error class: AMBIGUOUS_FIELD_NAME
> --
>
> Key: SPARK-38718
> URL: https://issues.apache.org/jira/browse/SPARK-38718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *AMBIGUOUS_FIELD_NAME* to 
> QueryCompilationErrorsSuite. The test should cover the exception throw in 
> QueryCompilationErrors:
> {code:scala}
>   def ambiguousFieldNameError(
>   fieldName: Seq[String], numMatches: Int, context: Origin): Throwable = {
> new AnalysisException(
>   errorClass = "AMBIGUOUS_FIELD_NAME",
>   messageParameters = Array(fieldName.quoted, numMatches.toString),
>   origin = context)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38718) Test the error class: AMBIGUOUS_FIELD_NAME

2022-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38718:


Assignee: Apache Spark

> Test the error class: AMBIGUOUS_FIELD_NAME
> --
>
> Key: SPARK-38718
> URL: https://issues.apache.org/jira/browse/SPARK-38718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *AMBIGUOUS_FIELD_NAME* to 
> QueryCompilationErrorsSuite. The test should cover the exception throw in 
> QueryCompilationErrors:
> {code:scala}
>   def ambiguousFieldNameError(
>   fieldName: Seq[String], numMatches: Int, context: Origin): Throwable = {
> new AnalysisException(
>   errorClass = "AMBIGUOUS_FIELD_NAME",
>   messageParameters = Array(fieldName.quoted, numMatches.toString),
>   origin = context)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-39044) AggregatingAccumulator with TypedImperativeAggregate throwing NullPointerException

2022-04-28 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529381#comment-17529381
 ] 

Willi Raschkowski edited comment on SPARK-39044 at 4/28/22 11:29 AM:
-

[~hyukjin.kwon], yes I know. But I wasn't able to get a self-contained 
reproducer. This reliably fails in prod. But using that same 
TypedImperativeAggregate with {{observe()}} in local tests works fine.

If you have ideas on what to try, I will. (Also happy to share the aggregate, 
but from the stacktrace I understood the implementation isn't relevant - it's 
the {{AggregatingAccumulator}} buffer that is {{null}}. Anyway, I attached 
[^aggregate.scala].)

I understand if you close this ticket because you cannot root-cause without a 
repro.


was (Author: raschkowski):
[~hyukjin.kwon], yes I know. But I wasn't able to get a self-contained 
reproducer. This reliably fails in prod. But using that same 
TypedImperativeAggregate with {{observe()}} in local tests works fine.

If you have ideas on what to try, I will. (Also happy to share the aggregate, 
but from the stacktrace I understood the implementation isn't relevant - it's 
the {{AggregatingAccumulator}} buffer that is {{{}null{}}}.)

I understand if you close this ticket because you cannot root-cause without a 
repro.

> AggregatingAccumulator with TypedImperativeAggregate throwing 
> NullPointerException
> --
>
> Key: SPARK-39044
> URL: https://issues.apache.org/jira/browse/SPARK-39044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Willi Raschkowski
>Priority: Major
> Attachments: aggregate.scala
>
>
> We're using a custom TypedImperativeAggregate inside an 
> AggregatingAccumulator (via {{observe()}} and get the error below. It looks 
> like we're trying to serialize an aggregation buffer that hasn't been 
> initialized yet.
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:251)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:540)
>   ...
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 9 in stage 1.0 failed 4 times, most recent failure: Lost task 9.3 in 
> stage 1.0 (TID 32) (10.0.134.136 executor 3): java.io.IOException: 
> java.lang.NullPointerException
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51)
>   at 
> java.base/java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1460)
>   at 
> java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:114)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:633)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:638)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:599)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:621)
>   at 
> org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:205)
>   at 
> org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33)
>   at 
> org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186)
>   at jdk.internal.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
>   at 
>

[jira] [Updated] (SPARK-39044) AggregatingAccumulator with TypedImperativeAggregate throwing NullPointerException

2022-04-28 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-39044:
--
Attachment: aggregate.scala

> AggregatingAccumulator with TypedImperativeAggregate throwing 
> NullPointerException
> --
>
> Key: SPARK-39044
> URL: https://issues.apache.org/jira/browse/SPARK-39044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Willi Raschkowski
>Priority: Major
> Attachments: aggregate.scala
>
>
> We're using a custom TypedImperativeAggregate inside an 
> AggregatingAccumulator (via {{observe()}} and get the error below. It looks 
> like we're trying to serialize an aggregation buffer that hasn't been 
> initialized yet.
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:251)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:540)
>   ...
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 9 in stage 1.0 failed 4 times, most recent failure: Lost task 9.3 in 
> stage 1.0 (TID 32) (10.0.134.136 executor 3): java.io.IOException: 
> java.lang.NullPointerException
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51)
>   at 
> java.base/java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1460)
>   at 
> java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:114)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:633)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:638)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:599)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:621)
>   at 
> org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:205)
>   at 
> org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33)
>   at 
> org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186)
>   at jdk.internal.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> java.base/java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1235)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1137)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55)
>   at 
>

[jira] [Commented] (SPARK-39044) AggregatingAccumulator with TypedImperativeAggregate throwing NullPointerException

2022-04-28 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529381#comment-17529381
 ] 

Willi Raschkowski commented on SPARK-39044:
---

[~hyukjin.kwon], yes I know. But I wasn't able to get a self-contained 
reproducer. This reliably fails in prod. But using that same 
TypedImperativeAggregate with {{observe()}} in local tests works fine.

If you have ideas on what to try, I will. (Also happy to share the aggregate, 
but from the stacktrace I understood the implementation isn't relevant - it's 
the {{AggregatingAccumulator}} buffer that is {{{}null{}}}.)

I understand if you close this ticket because you cannot root-cause without a 
repro.

> AggregatingAccumulator with TypedImperativeAggregate throwing 
> NullPointerException
> --
>
> Key: SPARK-39044
> URL: https://issues.apache.org/jira/browse/SPARK-39044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Willi Raschkowski
>Priority: Major
>
> We're using a custom TypedImperativeAggregate inside an 
> AggregatingAccumulator (via {{observe()}} and get the error below. It looks 
> like we're trying to serialize an aggregation buffer that hasn't been 
> initialized yet.
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:251)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:540)
>   ...
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 9 in stage 1.0 failed 4 times, most recent failure: Lost task 9.3 in 
> stage 1.0 (TID 32) (10.0.134.136 executor 3): java.io.IOException: 
> java.lang.NullPointerException
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51)
>   at 
> java.base/java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1460)
>   at 
> java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:114)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:633)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:638)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:599)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:621)
>   at 
> org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:205)
>   at 
> org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33)
>   at 
> org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186)
>   at jdk.internal.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> java.base/java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1235)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1137)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at

[jira] [Commented] (SPARK-39057) Offset could work without Limit

2022-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529374#comment-17529374
 ] 

Apache Spark commented on SPARK-39057:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/36394

> Offset could work without Limit
> ---
>
> Key: SPARK-39057
> URL: https://issues.apache.org/jira/browse/SPARK-39057
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Offset must work with Limit. The behavior limits to add offset api 
> into dataframe.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39057) Offset could work without Limit

2022-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529373#comment-17529373
 ] 

Apache Spark commented on SPARK-39057:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/36394

> Offset could work without Limit
> ---
>
> Key: SPARK-39057
> URL: https://issues.apache.org/jira/browse/SPARK-39057
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Offset must work with Limit. The behavior limits to add offset api 
> into dataframe.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39057) Offset could work without Limit

2022-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39057:


Assignee: Apache Spark

> Offset could work without Limit
> ---
>
> Key: SPARK-39057
> URL: https://issues.apache.org/jira/browse/SPARK-39057
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> Currently, Offset must work with Limit. The behavior limits to add offset api 
> into dataframe.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39057) Offset could work without Limit

2022-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39057:


Assignee: (was: Apache Spark)

> Offset could work without Limit
> ---
>
> Key: SPARK-39057
> URL: https://issues.apache.org/jira/browse/SPARK-39057
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Offset must work with Limit. The behavior limits to add offset api 
> into dataframe.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39055) Fix documentation 404 page

2022-04-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39055.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 36392
[https://github.com/apache/spark/pull/36392]

> Fix documentation 404 page
> --
>
> Key: SPARK-39055
> URL: https://issues.apache.org/jira/browse/SPARK-39055
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.3.0
>
>
> 404 page is currently not working



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39055) Fix documentation 404 page

2022-04-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39055:


Assignee: Kent Yao

> Fix documentation 404 page
> --
>
> Key: SPARK-39055
> URL: https://issues.apache.org/jira/browse/SPARK-39055
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> 404 page is currently not working



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38870) SparkSession.builder returns a new builder in Scala, but not in Python

2022-04-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38870:


Assignee: Furcy Pin

> SparkSession.builder returns a new builder in Scala, but not in Python
> --
>
> Key: SPARK-38870
> URL: https://issues.apache.org/jira/browse/SPARK-38870
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.2.1
>Reporter: Furcy Pin
>Assignee: Furcy Pin
>Priority: Major
>
> In pyspark, _SparkSession.builder_ always returns the same static builder, 
> while the expected behaviour should be the same as in Scala, where it returns 
> a new builder each time.
> *How to reproduce*
> When we run the following code in Scala :
> {code:java}
> import org.apache.spark.sql.SparkSession
> val s1 = SparkSession.builder.master("local[2]").config("key", 
> "value").getOrCreate()
> println("A : " + s1.conf.get("key")) // value
> s1.conf.set("key", "new_value")
> println("B : " + s1.conf.get("key")) // new_value
> val s2 = SparkSession.builder.getOrCreate()
> println("C : " + s1.conf.get("key")) // new_value{code}
> The output is :
> {code:java}
> A : value
> B : new_value
> C : new_value   <<<{code}
>  
> But when we run the following (supposedly equivalent) code in Python:
> {code:java}
> from pyspark.sql import SparkSession
> s1 = SparkSession.builder.master("local[2]").config("key", 
> "value").getOrCreate()
> print("A : " + s1.conf.get("key"))
> s1.conf.set("key", "new_value")
> print("B : " + s1.conf.get("key"))
> s2 = SparkSession.builder.getOrCreate()
> print("C : " + s1.conf.get("key")){code}
> The output is : 
> {code:java}
> A : value
> B : new_value
> C : value  <<<
> {code}
>  
>  
> *Root cause analysis*
> This comes from the fact that _SparkSession.builder_ behaves differently in 
> Python than in Scala. In Scala, it returns a *new builder* each time, in 
> Python it returns *the same builder* every time, and the 
> SparkSession.Builder._options are static, too.
> Because of this, whenever _SparkSession.builder.getOrCreate()_ is called, the 
> options passed to the very first builder are re-applied every time, and 
> overrides the option that were set afterwards. 
> This leads to very awkward behavior in every Spark version up to 3.2.1 
> included
> {*}Example{*}:
> This example crashes, but was fixed by SPARK-37638
>  
> {code:java}
> from pyspark.sql import SparkSession
> spark = 
> SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", 
> "DYNAMIC").getOrCreate()
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == 
> "DYNAMIC" # OK
> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC")
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
> # OK
> from pyspark.sql import functions as f
> from pyspark.sql.types import StringType
> f.col("a").cast(StringType()) 
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
> # This fails in all versions until the SPARK-37638 fix
> # because before that fix, Column.cast() calle 
> SparkSession.builder.getOrCreate(){code}
>  
> But this example still crashes in the current version on the master branch
> {code:java}
> from pyspark.sql import SparkSession
> spark = 
> SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", 
> "DYNAMIC").getOrCreate()
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == 
> "DYNAMIC" # OK
> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC")
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
> # OK
> SparkSession.builder.getOrCreate() 
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
> # This assert fails in master{code}
>  
> I made a Pull Request to fix this bug : 
> https://github.com/apache/spark/pull/36161



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38870) SparkSession.builder returns a new builder in Scala, but not in Python

2022-04-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38870.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36161
[https://github.com/apache/spark/pull/36161]

> SparkSession.builder returns a new builder in Scala, but not in Python
> --
>
> Key: SPARK-38870
> URL: https://issues.apache.org/jira/browse/SPARK-38870
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.2.1
>Reporter: Furcy Pin
>Assignee: Furcy Pin
>Priority: Major
> Fix For: 3.4.0
>
>
> In pyspark, _SparkSession.builder_ always returns the same static builder, 
> while the expected behaviour should be the same as in Scala, where it returns 
> a new builder each time.
> *How to reproduce*
> When we run the following code in Scala :
> {code:java}
> import org.apache.spark.sql.SparkSession
> val s1 = SparkSession.builder.master("local[2]").config("key", 
> "value").getOrCreate()
> println("A : " + s1.conf.get("key")) // value
> s1.conf.set("key", "new_value")
> println("B : " + s1.conf.get("key")) // new_value
> val s2 = SparkSession.builder.getOrCreate()
> println("C : " + s1.conf.get("key")) // new_value{code}
> The output is :
> {code:java}
> A : value
> B : new_value
> C : new_value   <<<{code}
>  
> But when we run the following (supposedly equivalent) code in Python:
> {code:java}
> from pyspark.sql import SparkSession
> s1 = SparkSession.builder.master("local[2]").config("key", 
> "value").getOrCreate()
> print("A : " + s1.conf.get("key"))
> s1.conf.set("key", "new_value")
> print("B : " + s1.conf.get("key"))
> s2 = SparkSession.builder.getOrCreate()
> print("C : " + s1.conf.get("key")){code}
> The output is : 
> {code:java}
> A : value
> B : new_value
> C : value  <<<
> {code}
>  
>  
> *Root cause analysis*
> This comes from the fact that _SparkSession.builder_ behaves differently in 
> Python than in Scala. In Scala, it returns a *new builder* each time, in 
> Python it returns *the same builder* every time, and the 
> SparkSession.Builder._options are static, too.
> Because of this, whenever _SparkSession.builder.getOrCreate()_ is called, the 
> options passed to the very first builder are re-applied every time, and 
> overrides the option that were set afterwards. 
> This leads to very awkward behavior in every Spark version up to 3.2.1 
> included
> {*}Example{*}:
> This example crashes, but was fixed by SPARK-37638
>  
> {code:java}
> from pyspark.sql import SparkSession
> spark = 
> SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", 
> "DYNAMIC").getOrCreate()
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == 
> "DYNAMIC" # OK
> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC")
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
> # OK
> from pyspark.sql import functions as f
> from pyspark.sql.types import StringType
> f.col("a").cast(StringType()) 
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
> # This fails in all versions until the SPARK-37638 fix
> # because before that fix, Column.cast() calle 
> SparkSession.builder.getOrCreate(){code}
>  
> But this example still crashes in the current version on the master branch
> {code:java}
> from pyspark.sql import SparkSession
> spark = 
> SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", 
> "DYNAMIC").getOrCreate()
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == 
> "DYNAMIC" # OK
> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC")
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
> # OK
> SparkSession.builder.getOrCreate() 
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
> # This assert fails in master{code}
>  
> I made a Pull Request to fix this bug : 
> https://github.com/apache/spark/pull/36161



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39057) Offset could work without Limit

2022-04-28 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-39057:
--

 Summary: Offset could work without Limit
 Key: SPARK-39057
 URL: https://issues.apache.org/jira/browse/SPARK-39057
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.4.0
Reporter: jiaan.geng


Currently, Offset must work with Limit. The behavior limits to add offset api 
into dataframe.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39056) Use `Collections.singletonList` instead of `Arrays.asList` when there is only one argument

2022-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39056:


Assignee: (was: Apache Spark)

>   Use `Collections.singletonList` instead of `Arrays.asList` when there is 
> only one argument
> 
>
> Key: SPARK-39056
> URL: https://issues.apache.org/jira/browse/SPARK-39056
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> Use `Collections.singletonList` instead of `Arrays.asList` when there is only 
> one argument.
>  
> before
> {code:java}
> List one = Arrays.asList("one"); {code}
> after
> {code:java}
> List one = Collections.singletonList("one"); {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39056) Use `Collections.singletonList` instead of `Arrays.asList` when there is only one argument

2022-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529365#comment-17529365
 ] 

Apache Spark commented on SPARK-39056:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/36393

>   Use `Collections.singletonList` instead of `Arrays.asList` when there is 
> only one argument
> 
>
> Key: SPARK-39056
> URL: https://issues.apache.org/jira/browse/SPARK-39056
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> Use `Collections.singletonList` instead of `Arrays.asList` when there is only 
> one argument.
>  
> before
> {code:java}
> List one = Arrays.asList("one"); {code}
> after
> {code:java}
> List one = Collections.singletonList("one"); {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39056) Use `Collections.singletonList` instead of `Arrays.asList` when there is only one argument

2022-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39056:


Assignee: Apache Spark

>   Use `Collections.singletonList` instead of `Arrays.asList` when there is 
> only one argument
> 
>
> Key: SPARK-39056
> URL: https://issues.apache.org/jira/browse/SPARK-39056
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> Use `Collections.singletonList` instead of `Arrays.asList` when there is only 
> one argument.
>  
> before
> {code:java}
> List one = Arrays.asList("one"); {code}
> after
> {code:java}
> List one = Collections.singletonList("one"); {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39056) Use `Collections.singletonList` instead of `Arrays.asList` when there is only one argument

2022-04-28 Thread Yang Jie (Jira)

Yang Jie created SPARK-39056:


 Summary:   Use `Collections.singletonList` instead of 
`Arrays.asList` when there is only one argument
 Key: SPARK-39056
 URL: https://issues.apache.org/jira/browse/SPARK-39056
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.4.0
Reporter: Yang Jie


Use `Collections.singletonList` instead of `Arrays.asList` when there is only 
one argument.

 

before
{code:java}
List one = Arrays.asList("one"); {code}
after
{code:java}
List one = Collections.singletonList("one"); {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39054) GroupByTest failed due to axis Length mismatch

2022-04-28 Thread Yikun Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529305#comment-17529305
 ] 

Yikun Jiang commented on SPARK-39054:
-

Related: 
https://github.com/pandas-dev/pandas/commit/d037ff6a4757bf8af2ca2431ba7d4b22b1959075

> GroupByTest failed due to axis Length mismatch
> --
>
> Key: SPARK-39054
> URL: https://issues.apache.org/jira/browse/SPARK-39054
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> {code:java}
> An error occurred while calling o27083.getResult.
> : org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
>   at 
> org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:97)
>   at 
> org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:93)
>   at sun.reflect.GeneratedMethodAccessor91.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:282)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at 
> py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
>   at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
>   at java.lang.Thread.run(Thread.java:750)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 808.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 808.0 (TID 650) (localhost executor driver): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 686, 
> in main
> process()
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 678, 
> in process
> serializer.dump_stream(out_iter, outfile)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 343, in dump_stream
> return ArrowStreamSerializer.dump_stream(self, 
> init_stream_yield_batches(), stream)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 84, in dump_stream
> for batch in iterator:
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 336, in init_stream_yield_batches
> for series in iterator:
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 487, 
> in mapper
> return f(keys, vals)
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 207, 
> in 
> return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 185, 
> in wrapped
> result = f(pd.concat(value_series, axis=1))
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 81, in 
> wrapper
> return f(*args, **kwargs)
>   File "/__w/spark/spark/python/pyspark/pandas/groupby.py", line 1620, in 
> rename_output
> pdf.columns = return_schema.names
>   File "/usr/local/lib/python3.9/dist-packages/pandas/core/generic.py", line 
> 5588, in __setattr__
> return object.__setattr__(self, name, value)
>   File "pandas/_libs/properties.pyx", line 70, in 
> pandas._libs.properties.AxisProperty.__set__
>   File "/usr/local/lib/python3.9/dist-packages/pandas/core/generic.py", line 
> 769, in _set_axis
> self._mgr.set_axis(axis, labels)
>   File 
> "/usr/local/lib/python3.9/dist-packages/pandas/core/internals/managers.py", 
> line 214, in set_axis
> self._validate_set_axis(axis, new_labels)
>   File 
> "/usr/local/lib/python3.9/dist-packages/pandas/core/internals/base.py", line 
> 69, in _validate_set_axis
> raise ValueError(
> ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 
> elements {code}
>  
> GroupByTest.test_apply_with_new_dataframe_without_shortcut



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39054) GroupByTest failed due to axis Length mismatch

2022-04-28 Thread Yikun Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-39054:

Description: 
{code:java}
An error occurred while calling o27083.getResult.
: org.apache.spark.SparkException: Exception thrown in awaitResult: 
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at 
org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:97)
at 
org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:93)
at sun.reflect.GeneratedMethodAccessor91.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at 
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 808.0 failed 1 times, most recent failure: Lost task 0.0 in 
stage 808.0 (TID 650) (localhost executor driver): 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 686, 
in main
process()
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 678, 
in process
serializer.dump_stream(out_iter, outfile)
  File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 343, in dump_stream
return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), 
stream)
  File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 84, in dump_stream
for batch in iterator:
  File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 336, in init_stream_yield_batches
for series in iterator:
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 487, 
in mapper
return f(keys, vals)
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 207, 
in 
return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 185, 
in wrapped
result = f(pd.concat(value_series, axis=1))
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 81, in 
wrapper
return f(*args, **kwargs)
  File "/__w/spark/spark/python/pyspark/pandas/groupby.py", line 1620, in 
rename_output
pdf.columns = return_schema.names
  File "/usr/local/lib/python3.9/dist-packages/pandas/core/generic.py", line 
5588, in __setattr__
return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 70, in 
pandas._libs.properties.AxisProperty.__set__
  File "/usr/local/lib/python3.9/dist-packages/pandas/core/generic.py", line 
769, in _set_axis
self._mgr.set_axis(axis, labels)
  File 
"/usr/local/lib/python3.9/dist-packages/pandas/core/internals/managers.py", 
line 214, in set_axis
self._validate_set_axis(axis, new_labels)
  File "/usr/local/lib/python3.9/dist-packages/pandas/core/internals/base.py", 
line 69, in _validate_set_axis
raise ValueError(
ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 
elements {code}
 

 

  was:
{code:java}
An error occurred while calling o27083.getResult.
: org.apache.spark.SparkException: Exception thrown in awaitResult: 
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at 
org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:97)
at 
org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:93)
at sun.reflect.GeneratedMethodAccessor91.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at 
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at

[jira] [Updated] (SPARK-39054) GroupByTest failed due to axis Length mismatch

2022-04-28 Thread Yikun Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-39054:

Description: 
{code:java}
An error occurred while calling o27083.getResult.
: org.apache.spark.SparkException: Exception thrown in awaitResult: 
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at 
org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:97)
at 
org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:93)
at sun.reflect.GeneratedMethodAccessor91.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at 
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 808.0 failed 1 times, most recent failure: Lost task 0.0 in 
stage 808.0 (TID 650) (localhost executor driver): 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 686, 
in main
process()
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 678, 
in process
serializer.dump_stream(out_iter, outfile)
  File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 343, in dump_stream
return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), 
stream)
  File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 84, in dump_stream
for batch in iterator:
  File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 336, in init_stream_yield_batches
for series in iterator:
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 487, 
in mapper
return f(keys, vals)
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 207, 
in 
return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 185, 
in wrapped
result = f(pd.concat(value_series, axis=1))
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 81, in 
wrapper
return f(*args, **kwargs)
  File "/__w/spark/spark/python/pyspark/pandas/groupby.py", line 1620, in 
rename_output
pdf.columns = return_schema.names
  File "/usr/local/lib/python3.9/dist-packages/pandas/core/generic.py", line 
5588, in __setattr__
return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 70, in 
pandas._libs.properties.AxisProperty.__set__
  File "/usr/local/lib/python3.9/dist-packages/pandas/core/generic.py", line 
769, in _set_axis
self._mgr.set_axis(axis, labels)
  File 
"/usr/local/lib/python3.9/dist-packages/pandas/core/internals/managers.py", 
line 214, in set_axis
self._validate_set_axis(axis, new_labels)
  File "/usr/local/lib/python3.9/dist-packages/pandas/core/internals/base.py", 
line 69, in _validate_set_axis
raise ValueError(
ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 
elements {code}
 

GroupByTest.test_apply_with_new_dataframe_without_shortcut

  was:
{code:java}
An error occurred while calling o27083.getResult.
: org.apache.spark.SparkException: Exception thrown in awaitResult: 
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at 
org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:97)
at 
org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:93)
at sun.reflect.GeneratedMethodAccessor91.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at 
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at

[jira] [Assigned] (SPARK-39055) Fix documentation 404 page

2022-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39055:


Assignee: (was: Apache Spark)

> Fix documentation 404 page
> --
>
> Key: SPARK-39055
> URL: https://issues.apache.org/jira/browse/SPARK-39055
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Priority: Major
>
> 404 page is currently not working



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39055) Fix documentation 404 page

2022-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529291#comment-17529291
 ] 

Apache Spark commented on SPARK-39055:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/36392

> Fix documentation 404 page
> --
>
> Key: SPARK-39055
> URL: https://issues.apache.org/jira/browse/SPARK-39055
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Priority: Major
>
> 404 page is currently not working



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39055) Fix documentation 404 page

2022-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39055:


Assignee: Apache Spark

> Fix documentation 404 page
> --
>
> Key: SPARK-39055
> URL: https://issues.apache.org/jira/browse/SPARK-39055
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> 404 page is currently not working



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39045) INTERNAL_ERROR for "all" internal errors

2022-04-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-39045:

Affects Version/s: 3.4.0
   (was: 3.3.0)

> INTERNAL_ERROR for "all" internal errors
> 
>
> Key: SPARK-39045
> URL: https://issues.apache.org/jira/browse/SPARK-39045
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Major
>
> We should be able to inject the  [SYSTEM_ERROR] class for most cases without 
> waiting to label the long tail on user facing error classes 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38819) Run Pandas on Spark with Pandas 1.4.x

2022-04-28 Thread Yikun Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529283#comment-17529283
 ] 

Yikun Jiang commented on SPARK-38819:
-

FYI, all issues of upgrading pandas 1.4.x has been listed above. SPARK-38946 
may be taking some more time, others are okay.

> Run Pandas on Spark with Pandas 1.4.x
> -
>
> Key: SPARK-38819
> URL: https://issues.apache.org/jira/browse/SPARK-38819
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> This is a umbrella to track issues when pandas upgrade to 1.4.x
>  
> I disable the fast-failed in test, 19 failed:
> [https://github.com/Yikun/spark/pull/88/checks?check_run_id=5873627048]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39037) DS V2 Top N push-down supports order by expressions

2022-04-28 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-39037:
---
Summary: DS V2 Top N push-down supports order by expressions  (was: DS V2 
aggregate push-down supports order by expressions)

> DS V2 Top N push-down supports order by expressions
> ---
>
> Key: SPARK-39037
> URL: https://issues.apache.org/jira/browse/SPARK-39037
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, Spark DS V2 aggregate push-down only supports order by column.
> But the SQL show below is very useful and common.
> SELECT CASE WHEN ("SALARY" > 8000.00) AND ("SALARY" < 1.00) THEN "SALARY" 
> ELSE 0.00 END AS key, dept, name FROM "test"."employee" ORDER BY key



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39037) DS V2 Top N push-down supports order by expressions

2022-04-28 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-39037:
---
Description: 
Currently, Spark DS V2 Top N push-down only supports order by column.
But the SQL show below is very useful and common.
SELECT CASE WHEN ("SALARY" > 8000.00) AND ("SALARY" < 1.00) THEN "SALARY" 
ELSE 0.00 END AS key, dept, name FROM "test"."employee" ORDER BY key

  was:
Currently, Spark DS V2 aggregate push-down only supports order by column.
But the SQL show below is very useful and common.
SELECT CASE WHEN ("SALARY" > 8000.00) AND ("SALARY" < 1.00) THEN "SALARY" 
ELSE 0.00 END AS key, dept, name FROM "test"."employee" ORDER BY key


> DS V2 Top N push-down supports order by expressions
> ---
>
> Key: SPARK-39037
> URL: https://issues.apache.org/jira/browse/SPARK-39037
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, Spark DS V2 Top N push-down only supports order by column.
> But the SQL show below is very useful and common.
> SELECT CASE WHEN ("SALARY" > 8000.00) AND ("SALARY" < 1.00) THEN "SALARY" 
> ELSE 0.00 END AS key, dept, name FROM "test"."employee" ORDER BY key



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39055) Fix documentation 404 page

2022-04-28 Thread Kent Yao (Jira)

Kent Yao created SPARK-39055:


 Summary: Fix documentation 404 page
 Key: SPARK-39055
 URL: https://issues.apache.org/jira/browse/SPARK-39055
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 3.4.0
Reporter: Kent Yao


404 page is currently not working



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39054) GroupByTest failed due to axis Length mismatch

2022-04-28 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-39054:
---

 Summary: GroupByTest failed due to axis Length mismatch
 Key: SPARK-39054
 URL: https://issues.apache.org/jira/browse/SPARK-39054
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Yikun Jiang


{code:java}
An error occurred while calling o27083.getResult.
: org.apache.spark.SparkException: Exception thrown in awaitResult: 
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at 
org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:97)
at 
org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:93)
at sun.reflect.GeneratedMethodAccessor91.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at 
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 808.0 failed 1 times, most recent failure: Lost task 0.0 in 
stage 808.0 (TID 650) (localhost executor driver): 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 686, 
in main
process()
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 678, 
in process
serializer.dump_stream(out_iter, outfile)
  File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 343, in dump_stream
return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), 
stream)
  File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 84, in dump_stream
for batch in iterator:
  File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 336, in init_stream_yield_batches
for series in iterator:
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 487, 
in mapper
return f(keys, vals)
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 207, 
in 
return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 185, 
in wrapped
result = f(pd.concat(value_series, axis=1))
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 81, in 
wrapper
return f(*args, **kwargs)
  File "/__w/spark/spark/python/pyspark/pandas/groupby.py", line 1620, in 
rename_output
pdf.columns = return_schema.names
  File "/usr/local/lib/python3.9/dist-packages/pandas/core/generic.py", line 
5588, in __setattr__
return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 70, in 
pandas._libs.properties.AxisProperty.__set__
  File "/usr/local/lib/python3.9/dist-packages/pandas/core/generic.py", line 
769, in _set_axis
self._mgr.set_axis(axis, labels)
  File 
"/usr/local/lib/python3.9/dist-packages/pandas/core/internals/managers.py", 
line 214, in set_axis
self._validate_set_axis(axis, new_labels)
  File "/usr/local/lib/python3.9/dist-packages/pandas/core/internals/base.py", 
line 69, in _validate_set_axis
raise ValueError(
ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 
elements {code}
 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39053) test_multi_index_dtypes failed due to index mismatch

2022-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529274#comment-17529274
 ] 

Apache Spark commented on SPARK-39053:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36391

> test_multi_index_dtypes failed due to index mismatch
> 
>
> Key: SPARK-39053
> URL: https://issues.apache.org/jira/browse/SPARK-39053
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> {code:java}
> DataFrameTest.test_multi_index_dtypesSeries.index are different
> Series.index classes are different
> [left]:  MultiIndex([('zero',  'first'),
> ( 'one', 'second')],
>)
> [right]: Index([('zero', 'first'), ('one', 'second')], dtype='object')
> Left:
> zero  first  int64
> one   secondobject
> dtype: object
> object
> Right:
> (zero, first) int64
> (one, second)object
> dtype: object
> object {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39053) test_multi_index_dtypes failed due to index mismatch

2022-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529273#comment-17529273
 ] 

Apache Spark commented on SPARK-39053:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36391

> test_multi_index_dtypes failed due to index mismatch
> 
>
> Key: SPARK-39053
> URL: https://issues.apache.org/jira/browse/SPARK-39053
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> {code:java}
> DataFrameTest.test_multi_index_dtypesSeries.index are different
> Series.index classes are different
> [left]:  MultiIndex([('zero',  'first'),
> ( 'one', 'second')],
>)
> [right]: Index([('zero', 'first'), ('one', 'second')], dtype='object')
> Left:
> zero  first  int64
> one   secondobject
> dtype: object
> object
> Right:
> (zero, first) int64
> (one, second)object
> dtype: object
> object {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39053) test_multi_index_dtypes failed due to index mismatch

2022-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39053:


Assignee: (was: Apache Spark)

> test_multi_index_dtypes failed due to index mismatch
> 
>
> Key: SPARK-39053
> URL: https://issues.apache.org/jira/browse/SPARK-39053
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> {code:java}
> DataFrameTest.test_multi_index_dtypesSeries.index are different
> Series.index classes are different
> [left]:  MultiIndex([('zero',  'first'),
> ( 'one', 'second')],
>)
> [right]: Index([('zero', 'first'), ('one', 'second')], dtype='object')
> Left:
> zero  first  int64
> one   secondobject
> dtype: object
> object
> Right:
> (zero, first) int64
> (one, second)object
> dtype: object
> object {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39053) test_multi_index_dtypes failed due to index mismatch

2022-04-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39053:


Assignee: Apache Spark

> test_multi_index_dtypes failed due to index mismatch
> 
>
> Key: SPARK-39053
> URL: https://issues.apache.org/jira/browse/SPARK-39053
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> DataFrameTest.test_multi_index_dtypesSeries.index are different
> Series.index classes are different
> [left]:  MultiIndex([('zero',  'first'),
> ( 'one', 'second')],
>)
> [right]: Index([('zero', 'first'), ('one', 'second')], dtype='object')
> Left:
> zero  first  int64
> one   secondobject
> dtype: object
> object
> Right:
> (zero, first) int64
> (one, second)object
> dtype: object
> object {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39037) DS V2 aggregate push-down supports order by expressions

2022-04-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-39037.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36370
[https://github.com/apache/spark/pull/36370]

> DS V2 aggregate push-down supports order by expressions
> ---
>
> Key: SPARK-39037
> URL: https://issues.apache.org/jira/browse/SPARK-39037
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, Spark DS V2 aggregate push-down only supports order by column.
> But the SQL show below is very useful and common.
> SELECT CASE WHEN ("SALARY" > 8000.00) AND ("SALARY" < 1.00) THEN "SALARY" 
> ELSE 0.00 END AS key, dept, name FROM "test"."employee" ORDER BY key



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39037) DS V2 aggregate push-down supports order by expressions

2022-04-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-39037:
---

Assignee: jiaan.geng

> DS V2 aggregate push-down supports order by expressions
> ---
>
> Key: SPARK-39037
> URL: https://issues.apache.org/jira/browse/SPARK-39037
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> Currently, Spark DS V2 aggregate push-down only supports order by column.
> But the SQL show below is very useful and common.
> SELECT CASE WHEN ("SALARY" > 8000.00) AND ("SALARY" < 1.00) THEN "SALARY" 
> ELSE 0.00 END AS key, dept, name FROM "test"."employee" ORDER BY key



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39053) test_multi_index_dtypes failed due to index mismatch

2022-04-28 Thread Yikun Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529259#comment-17529259
 ] 

Yikun Jiang commented on SPARK-39053:
-

https://github.com/pandas-dev/pandas/commit/d06fb912782834125f1c9b0baaea1d60f2151c69

> test_multi_index_dtypes failed due to index mismatch
> 
>
> Key: SPARK-39053
> URL: https://issues.apache.org/jira/browse/SPARK-39053
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> {code:java}
> DataFrameTest.test_multi_index_dtypesSeries.index are different
> Series.index classes are different
> [left]:  MultiIndex([('zero',  'first'),
> ( 'one', 'second')],
>)
> [right]: Index([('zero', 'first'), ('one', 'second')], dtype='object')
> Left:
> zero  first  int64
> one   secondobject
> dtype: object
> object
> Right:
> (zero, first) int64
> (one, second)object
> dtype: object
> object {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39053) test_multi_index_dtypes failed due to index mismatch

2022-04-28 Thread Yikun Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-39053:

Parent: SPARK-38819
Issue Type: Sub-task  (was: Bug)

> test_multi_index_dtypes failed due to index mismatch
> 
>
> Key: SPARK-39053
> URL: https://issues.apache.org/jira/browse/SPARK-39053
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> {code:java}
> DataFrameTest.test_multi_index_dtypesSeries.index are different
> Series.index classes are different
> [left]:  MultiIndex([('zero',  'first'),
> ( 'one', 'second')],
>)
> [right]: Index([('zero', 'first'), ('one', 'second')], dtype='object')
> Left:
> zero  first  int64
> one   secondobject
> dtype: object
> object
> Right:
> (zero, first) int64
> (one, second)object
> dtype: object
> object {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39053) test_multi_index_dtypes failed due to index mismatch

2022-04-28 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-39053:
---

 Summary: test_multi_index_dtypes failed due to index mismatch
 Key: SPARK-39053
 URL: https://issues.apache.org/jira/browse/SPARK-39053
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Yikun Jiang


{code:java}
DataFrameTest.test_multi_index_dtypesSeries.index are different

Series.index classes are different
[left]:  MultiIndex([('zero',  'first'),
( 'one', 'second')],
   )
[right]: Index([('zero', 'first'), ('one', 'second')], dtype='object')

Left:
zero  first  int64
one   secondobject
dtype: object
object

Right:
(zero, first) int64
(one, second)object
dtype: object
object {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39047) Replace the error class ILLEGAL_SUBSTRING by INVALID_PARAMETER_VALUE

2022-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529233#comment-17529233
 ] 

Apache Spark commented on SPARK-39047:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/36390

> Replace the error class ILLEGAL_SUBSTRING by INVALID_PARAMETER_VALUE
> 
>
> Key: SPARK-39047
> URL: https://issues.apache.org/jira/browse/SPARK-39047
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Use the INVALID_PARAMETER_VALUE error class instead of ILLEGAL_SUBSTRING and 
> remove the last one because it duplicates INVALID_PARAMETER_VALUE.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39047) Replace the error class ILLEGAL_SUBSTRING by INVALID_PARAMETER_VALUE

2022-04-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529232#comment-17529232
 ] 

Apache Spark commented on SPARK-39047:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/36390

> Replace the error class ILLEGAL_SUBSTRING by INVALID_PARAMETER_VALUE
> 
>
> Key: SPARK-39047
> URL: https://issues.apache.org/jira/browse/SPARK-39047
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Use the INVALID_PARAMETER_VALUE error class instead of ILLEGAL_SUBSTRING and 
> remove the last one because it duplicates INVALID_PARAMETER_VALUE.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

100 matches

Mail list logo