date:20230104

[jira] [Commented] (SPARK-41875) Throw proper errors in Dataset.to()

2023-01-04 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654845#comment-17654845
 ] 

jiaan.geng commented on SPARK-41875:


It seems this is't the issue of connect.

> Throw proper errors in Dataset.to()
> ---
>
> Key: SPARK-41875
> URL: https://issues.apache.org/jira/browse/SPARK-41875
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> schema = StructType(
> [StructField("i", StringType(), True), StructField("j", IntegerType(), 
> True)]
> )
> df = self.spark.createDataFrame([("a", 1)], schema)
> schema1 = StructType([StructField("j", StringType()), StructField("i", 
> StringType())])
> df1 = df.to(schema1)
> self.assertEqual(schema1, df1.schema)
> self.assertEqual(df.count(), df1.count())
> schema2 = StructType([StructField("j", LongType())])
> df2 = df.to(schema2)
> self.assertEqual(schema2, df2.schema)
> self.assertEqual(df.count(), df2.count())
> schema3 = StructType([StructField("struct", schema1, False)])
> df3 = df.select(struct("i", "j").alias("struct")).to(schema3)
> self.assertEqual(schema3, df3.schema)
> self.assertEqual(df.count(), df3.count())
> # incompatible field nullability
> schema4 = StructType([StructField("j", LongType(), False)])
> self.assertRaisesRegex(
> AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4)
> ){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 1486, in test_to
>     self.assertRaisesRegex(
> AssertionError: AnalysisException not raised by  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41889) Attach root cause to invalidPatternError

2023-01-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41889:


Assignee: (was: Apache Spark)

> Attach root cause to invalidPatternError
> 
>
> Key: SPARK-41889
> URL: https://issues.apache.org/jira/browse/SPARK-41889
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41889) Attach root cause to invalidPatternError

2023-01-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41889:


Assignee: Apache Spark

> Attach root cause to invalidPatternError
> 
>
> Key: SPARK-41889
> URL: https://issues.apache.org/jira/browse/SPARK-41889
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41889) Attach root cause to invalidPatternError

2023-01-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654826#comment-17654826
 ] 

Apache Spark commented on SPARK-41889:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39402

> Attach root cause to invalidPatternError
> 
>
> Key: SPARK-41889
> URL: https://issues.apache.org/jira/browse/SPARK-41889
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41893) Publish SBOM artifacts

2023-01-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41893:


Assignee: (was: Apache Spark)

> Publish SBOM artifacts
> --
>
> Key: SPARK-41893
> URL: https://issues.apache.org/jira/browse/SPARK-41893
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41893) Publish SBOM artifacts

2023-01-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654824#comment-17654824
 ] 

Apache Spark commented on SPARK-41893:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39401

> Publish SBOM artifacts
> --
>
> Key: SPARK-41893
> URL: https://issues.apache.org/jira/browse/SPARK-41893
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41893) Publish SBOM artifacts

2023-01-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41893:


Assignee: Apache Spark

> Publish SBOM artifacts
> --
>
> Key: SPARK-41893
> URL: https://issues.apache.org/jira/browse/SPARK-41893
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41894) sql/core module mvn clean failed

2023-01-04 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654823#comment-17654823
 ] 

Yang Jie commented on SPARK-41894:
--

The running environment is linux,  I haven't found the specific case that 
generated this file now, need more investigation

 

 

 

> sql/core module mvn clean failed
> 
>
> Key: SPARK-41894
> URL: https://issues.apache.org/jira/browse/SPARK-41894
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> run the following commands:
>  # mvn clean install -pl sql/core -am -DskipTests
>  # mvn test -pl sql/core 
>  # mvn clean
>  
> then following error:
>  
> {code:java}
> [INFO] Spark Project Parent POM ... SUCCESS [  0.133 
> s]
> [INFO] Spark Project Tags . SUCCESS [  0.008 
> s]
> [INFO] Spark Project Sketch ... SUCCESS [  0.007 
> s]
> [INFO] Spark Project Local DB . SUCCESS [  0.008 
> s]
> [INFO] Spark Project Networking ... SUCCESS [  0.015 
> s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [  0.020 
> s]
> [INFO] Spark Project Unsafe ... SUCCESS [  0.007 
> s]
> [INFO] Spark Project Launcher . SUCCESS [  0.008 
> s]
> [INFO] Spark Project Core . SUCCESS [  0.279 
> s]
> [INFO] Spark Project ML Local Library . SUCCESS [  0.010 
> s]
> [INFO] Spark Project GraphX ... SUCCESS [  0.016 
> s]
> [INFO] Spark Project Streaming  SUCCESS [  0.039 
> s]
> [INFO] Spark Project Catalyst . SUCCESS [  0.262 
> s]
> [INFO] Spark Project SQL .. FAILURE [  1.305 
> s]
> [INFO] Spark Project ML Library ... SKIPPED
> [INFO] Spark Project Tools  SKIPPED
> [INFO] Spark Project Hive . SKIPPED
> [INFO] Spark Project REPL . SKIPPED
> [INFO] Spark Project YARN Shuffle Service . SKIPPED
> [INFO] Spark Project YARN . SKIPPED
> [INFO] Spark Project Mesos  SKIPPED
> [INFO] Spark Project Kubernetes ... SKIPPED
> [INFO] Spark Project Hive Thrift Server ... SKIPPED
> [INFO] Spark Ganglia Integration .. SKIPPED
> [INFO] Spark Project Hadoop Cloud Integration . SKIPPED
> [INFO] Spark Project Assembly . SKIPPED
> [INFO] Kafka 0.10+ Token Provider for Streaming ... SKIPPED
> [INFO] Spark Integration for Kafka 0.10 ... SKIPPED
> [INFO] Kafka 0.10+ Source for Structured Streaming  SKIPPED
> [INFO] Spark Kinesis Integration .. SKIPPED
> [INFO] Spark Project Examples . SKIPPED
> [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED
> [INFO] Spark Avro . SKIPPED
> [INFO] Spark Project Connect Common ... SKIPPED
> [INFO] Spark Project Connect Server ... SKIPPED
> [INFO] Spark Project Connect Client ... SKIPPED
> [INFO] Spark Protobuf . SKIPPED
> [INFO] Spark Project Kinesis Assembly . SKIPPED
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time:  2.896 s
> [INFO] Finished at: 2023-01-05T15:15:57+08:00
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-clean-plugin:3.1.0:clean (default-clean) on 
> project spark-sql_2.13: Failed to clean project: Failed to delete 
> /${basedir}/sql/core/target/tmp/streaming.metadata-1b8b16d8-c9ba-4c38-9ac0-94a39f583082/commits/.0.crc
>  -> [Help 1]
>  {code}
>  
>  
> run :
>  * ll 
> /${basedir}/sql/core/target/tmp/streaming.metadata-1b8b16d8-c9ba-4c38-9ac0-94a39f583082/commits/.0.crc
>  
>  
> {code:java}
> -rw-r--r-- 1 work work 12 Dec 28 16:06 
> /${basedir}/sql/core/target/tmp/streaming.metadata-1b8b16d8-c9ba-4c38-9ac0-94a39f583082/commits/.0.crc{code}
>  
>  
> and current user(work) can't rm this file:
>  * rm  
> /${basedir}/sql/core/target/tmp/streaming.metadata-1b8b16d8-c9ba-4c38-9ac0-94a39f583082/commits/.0.crc
>  
> {code:java}
> rm: cannot remove 
>

[jira] [Created] (SPARK-41894) sql/core module mvn clean failed

2023-01-04 Thread Yang Jie (Jira)

Yang Jie created SPARK-41894:


 Summary: sql/core module mvn clean failed
 Key: SPARK-41894
 URL: https://issues.apache.org/jira/browse/SPARK-41894
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming, Tests
Affects Versions: 3.4.0
Reporter: Yang Jie


run the following commands:
 # mvn clean install -pl sql/core -am -DskipTests
 # mvn test -pl sql/core 
 # mvn clean

 

then following error:

 
{code:java}
[INFO] Spark Project Parent POM ... SUCCESS [  0.133 s]
[INFO] Spark Project Tags . SUCCESS [  0.008 s]
[INFO] Spark Project Sketch ... SUCCESS [  0.007 s]
[INFO] Spark Project Local DB . SUCCESS [  0.008 s]
[INFO] Spark Project Networking ... SUCCESS [  0.015 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [  0.020 s]
[INFO] Spark Project Unsafe ... SUCCESS [  0.007 s]
[INFO] Spark Project Launcher . SUCCESS [  0.008 s]
[INFO] Spark Project Core . SUCCESS [  0.279 s]
[INFO] Spark Project ML Local Library . SUCCESS [  0.010 s]
[INFO] Spark Project GraphX ... SUCCESS [  0.016 s]
[INFO] Spark Project Streaming  SUCCESS [  0.039 s]
[INFO] Spark Project Catalyst . SUCCESS [  0.262 s]
[INFO] Spark Project SQL .. FAILURE [  1.305 s]
[INFO] Spark Project ML Library ... SKIPPED
[INFO] Spark Project Tools  SKIPPED
[INFO] Spark Project Hive . SKIPPED
[INFO] Spark Project REPL . SKIPPED
[INFO] Spark Project YARN Shuffle Service . SKIPPED
[INFO] Spark Project YARN . SKIPPED
[INFO] Spark Project Mesos  SKIPPED
[INFO] Spark Project Kubernetes ... SKIPPED
[INFO] Spark Project Hive Thrift Server ... SKIPPED
[INFO] Spark Ganglia Integration .. SKIPPED
[INFO] Spark Project Hadoop Cloud Integration . SKIPPED
[INFO] Spark Project Assembly . SKIPPED
[INFO] Kafka 0.10+ Token Provider for Streaming ... SKIPPED
[INFO] Spark Integration for Kafka 0.10 ... SKIPPED
[INFO] Kafka 0.10+ Source for Structured Streaming  SKIPPED
[INFO] Spark Kinesis Integration .. SKIPPED
[INFO] Spark Project Examples . SKIPPED
[INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED
[INFO] Spark Avro . SKIPPED
[INFO] Spark Project Connect Common ... SKIPPED
[INFO] Spark Project Connect Server ... SKIPPED
[INFO] Spark Project Connect Client ... SKIPPED
[INFO] Spark Protobuf . SKIPPED
[INFO] Spark Project Kinesis Assembly . SKIPPED
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time:  2.896 s
[INFO] Finished at: 2023-01-05T15:15:57+08:00
[INFO] 
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-clean-plugin:3.1.0:clean (default-clean) on 
project spark-sql_2.13: Failed to clean project: Failed to delete 
/${basedir}/sql/core/target/tmp/streaming.metadata-1b8b16d8-c9ba-4c38-9ac0-94a39f583082/commits/.0.crc
 -> [Help 1]
 {code}
 

 

run :
 * ll 
/${basedir}/sql/core/target/tmp/streaming.metadata-1b8b16d8-c9ba-4c38-9ac0-94a39f583082/commits/.0.crc

 

 
{code:java}
-rw-r--r-- 1 work work 12 Dec 28 16:06 
/${basedir}/sql/core/target/tmp/streaming.metadata-1b8b16d8-c9ba-4c38-9ac0-94a39f583082/commits/.0.crc{code}
 

 

and current user(work) can't rm this file:
 * rm  
/${basedir}/sql/core/target/tmp/streaming.metadata-1b8b16d8-c9ba-4c38-9ac0-94a39f583082/commits/.0.crc

 
{code:java}
rm: cannot remove 
`/${basedir}/sql/core/target/tmp/streaming.metadata-1b8b16d8-c9ba-4c38-9ac0-94a39f583082/commits/.0.crc':
 Permission denied {code}
need to use root to clean this file

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41829) Implement Dataframe.sort,sortWithinPartitions Ordering

2023-01-04 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41829.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39398
[https://github.com/apache/spark/pull/39398]

> Implement Dataframe.sort,sortWithinPartitions Ordering
> --
>
> Key: SPARK-41829
> URL: https://issues.apache.org/jira/browse/SPARK-41829
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 422, in pyspark.sql.connect.dataframe.DataFrame.sort
> Failed example:
>     df.orderBy(["age", "name"], ascending=[False, False]).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 
> 1, in 
>         df.orderBy(["age", "name"], ascending=[False, False]).show()
>     TypeError: DataFrame.sort() got an unexpected keyword argument 'ascending'
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 379, in pyspark.sql.connect.dataframe.DataFrame.sortWithinPartitions
> Failed example:
>     df.sortWithinPartitions("age", ascending=False)
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File " pyspark.sql.connect.dataframe.DataFrame.sortWithinPartitions[1]>", line 1, in 
> 
>         df.sortWithinPartitions("age", ascending=False)
>     TypeError: DataFrame.sortWithinPartitions() got an unexpected keyword 
> argument 'ascending'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41829) Implement Dataframe.sort,sortWithinPartitions Ordering

2023-01-04 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-41829:
-

Assignee: Ruifeng Zheng

> Implement Dataframe.sort,sortWithinPartitions Ordering
> --
>
> Key: SPARK-41829
> URL: https://issues.apache.org/jira/browse/SPARK-41829
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Ruifeng Zheng
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 422, in pyspark.sql.connect.dataframe.DataFrame.sort
> Failed example:
>     df.orderBy(["age", "name"], ascending=[False, False]).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 
> 1, in 
>         df.orderBy(["age", "name"], ascending=[False, False]).show()
>     TypeError: DataFrame.sort() got an unexpected keyword argument 'ascending'
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 379, in pyspark.sql.connect.dataframe.DataFrame.sortWithinPartitions
> Failed example:
>     df.sortWithinPartitions("age", ascending=False)
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File " pyspark.sql.connect.dataframe.DataFrame.sortWithinPartitions[1]>", line 1, in 
> 
>         df.sortWithinPartitions("age", ascending=False)
>     TypeError: DataFrame.sortWithinPartitions() got an unexpected keyword 
> argument 'ascending'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41893) Publish SBOM artifacts

2023-01-04 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-41893:
-

 Summary: Publish SBOM artifacts
 Key: SPARK-41893
 URL: https://issues.apache.org/jira/browse/SPARK-41893
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33772) Build and Run Spark on Java 17

2023-01-04 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654818#comment-17654818
 ] 

Dongjoon Hyun commented on SPARK-33772:
---

FYI, Apache Spark has been Java 17 SBT test coverage, [~jomach] .
 * Java 17 on Linux (GitHub Action) 
[https://github.com/apache/spark/actions/runs/3833322692]
 * Java 17 on Apple Silicon 
[https://apache-spark.s3.fr-par.scw.cloud/index.html]

Please file a Jira with details like your environment information and 
reproducible commands.

> Build and Run Spark on Java 17
> --
>
> Key: SPARK-33772
> URL: https://issues.apache.org/jira/browse/SPARK-33772
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Yang Jie
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.3.0
>
>
> Apache Spark supports Java 8 and Java 11 (LTS). The next Java LTS version is 
> 17.
> ||Version||Release Date||
> |Java 17 (LTS)|September 2021|
> Apache Spark has a release plan and `Spark 3.2 Code freeze` was July along 
> with the release branch cut.
> - https://spark.apache.org/versioning-policy.html
> Supporting new Java version is considered as a new feature which we cannot 
> allow to backport.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41162) Anti-join must not be pushed below aggregation with ambiguous predicates

2023-01-04 Thread Enrico Minack (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-41162:
--
Affects Version/s: 3.0.3

> Anti-join must not be pushed below aggregation with ambiguous predicates
> 
>
> Key: SPARK-41162
> URL: https://issues.apache.org/jira/browse/SPARK-41162
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.3, 3.3.1, 3.2.3, 3.4.0
>Reporter: Enrico Minack
>Priority: Major
>  Labels: correctness
>
> The following query should return a single row as all values for {{id}} 
> except for the largest will be eliminated by the anti-join:
> {code}
> val ids = Seq(1, 2, 3).toDF("id").distinct()
> val result = ids.withColumn("id", $"id" + 1).join(ids, "id", 
> "left_anti").collect()
> assert(result.length == 1)
> {code}
> Without the {{distinct()}}, the assertion is true. With {{distinct()}}, the 
> assertion should still hold but is false.
> Rule {{PushDownLeftSemiAntiJoin}} pushes the {{Join}} below the left 
> {{Aggregate}} with join condition {{(id#750 + 1) = id#750}}, which can never 
> be true.
> {code}
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin ===
> !Join LeftAnti, (id#752 = id#750)  'Aggregate [id#750], 
> [(id#750 + 1) AS id#752]
> !:- Aggregate [id#750], [(id#750 + 1) AS id#752]   +- 'Join LeftAnti, 
> ((id#750 + 1) = id#750)
> !:  +- LocalRelation [id#750] :- LocalRelation 
> [id#750]
> !+- Aggregate [id#750], [id#750]  +- Aggregate [id#750], 
> [id#750]
> !   +- LocalRelation [id#750]+- LocalRelation 
> [id#750]
> {code}
> The optimizer then rightly removes the left-anti join altogether, returning 
> the left child only.
> Rule {{PushDownLeftSemiAntiJoin}} should not push down predicates that 
> reference left *and* right child.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41580) Assign name to _LEGACY_ERROR_TEMP_2137

2023-01-04 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-41580.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39305
[https://github.com/apache/spark/pull/39305]

> Assign name to _LEGACY_ERROR_TEMP_2137
> --
>
> Key: SPARK-41580
> URL: https://issues.apache.org/jira/browse/SPARK-41580
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> We should use proper error class name rather than `_LEGACY_ERROR_TEMP_xxx`.
>  
> *NOTE:* Please reply to this ticket before start working on it, to avoid 
> working on same ticket at a time



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41580) Assign name to _LEGACY_ERROR_TEMP_2137

2023-01-04 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-41580:


Assignee: Haejoon Lee

> Assign name to _LEGACY_ERROR_TEMP_2137
> --
>
> Key: SPARK-41580
> URL: https://issues.apache.org/jira/browse/SPARK-41580
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> We should use proper error class name rather than `_LEGACY_ERROR_TEMP_xxx`.
>  
> *NOTE:* Please reply to this ticket before start working on it, to avoid 
> working on same ticket at a time



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41576) Assign name to _LEGACY_ERROR_TEMP_2051

2023-01-04 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-41576.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39281
[https://github.com/apache/spark/pull/39281]

> Assign name to _LEGACY_ERROR_TEMP_2051
> --
>
> Key: SPARK-41576
> URL: https://issues.apache.org/jira/browse/SPARK-41576
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> We should use proper error class name rather than `_LEGACY_ERROR_TEMP_xxx`.
>  
> *NOTE:* Please reply to this ticket before start working on it, to avoid 
> working on same ticket at a time



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41576) Assign name to _LEGACY_ERROR_TEMP_2051

2023-01-04 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-41576:


Assignee: Haejoon Lee

> Assign name to _LEGACY_ERROR_TEMP_2051
> --
>
> Key: SPARK-41576
> URL: https://issues.apache.org/jira/browse/SPARK-41576
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> We should use proper error class name rather than `_LEGACY_ERROR_TEMP_xxx`.
>  
> *NOTE:* Please reply to this ticket before start working on it, to avoid 
> working on same ticket at a time



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41821) Fix DataFrame.describe

2023-01-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41821:


Assignee: jiaan.geng

> Fix DataFrame.describe
> --
>
> Key: SPARK-41821
> URL: https://issues.apache.org/jira/browse/SPARK-41821
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 898, in pyspark.sql.connect.dataframe.DataFrame.describe
> Failed example:
>     df.describe(['age']).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", 
> line 1, in 
>         df.describe(['age']).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 832, in describe
>         raise TypeError(f"'cols' must be list[str], but got 
> {type(s).__name__}")
>     TypeError: 'cols' must be list[str], but got list {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41821) Fix DataFrame.describe

2023-01-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41821.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39378
[https://github.com/apache/spark/pull/39378]

> Fix DataFrame.describe
> --
>
> Key: SPARK-41821
> URL: https://issues.apache.org/jira/browse/SPARK-41821
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 898, in pyspark.sql.connect.dataframe.DataFrame.describe
> Failed example:
>     df.describe(['age']).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", 
> line 1, in 
>         df.describe(['age']).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 832, in describe
>         raise TypeError(f"'cols' must be list[str], but got 
> {type(s).__name__}")
>     TypeError: 'cols' must be list[str], but got list {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41871) DataFrame hint parameter can be str, float or int

2023-01-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41871:


Assignee: Sandeep Singh

> DataFrame hint parameter can be str, float or int
> -
>
> Key: SPARK-41871
> URL: https://issues.apache.org/jira/browse/SPARK-41871
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.range(10e10).toDF("id")
> such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
> hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 556, in test_extended_hint_types
>     hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 482, in hint
>     raise TypeError(
> TypeError: param should be a int or str, but got float 1.2345{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41871) DataFrame hint parameter can be str, float or int

2023-01-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41871.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39393
[https://github.com/apache/spark/pull/39393]

> DataFrame hint parameter can be str, float or int
> -
>
> Key: SPARK-41871
> URL: https://issues.apache.org/jira/browse/SPARK-41871
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> df = self.spark.range(10e10).toDF("id")
> such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
> hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 556, in test_extended_hint_types
>     hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 482, in hint
>     raise TypeError(
> TypeError: param should be a int or str, but got float 1.2345{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41891) Enable test_add_months_function, test_array_repeat, test_dayofweek, test_first_last_ignorenulls, test_function_parity, test_inline, test_window_time, test_reciprocal_t

2023-01-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654734#comment-17654734
 ] 

Apache Spark commented on SPARK-41891:
--

User 'techaddict' has created a pull request for this issue:
https://github.com/apache/spark/pull/39400

> Enable test_add_months_function, test_array_repeat, test_dayofweek, 
> test_first_last_ignorenulls, test_function_parity, test_inline, 
> test_window_time, test_reciprocal_trig_functions
> 
>
> Key: SPARK-41891
> URL: https://issues.apache.org/jira/browse/SPARK-41891
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41891) Enable test_add_months_function, test_array_repeat, test_dayofweek, test_first_last_ignorenulls, test_function_parity, test_inline, test_window_time, test_reciprocal_tr

2023-01-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41891:


Assignee: Sandeep Singh  (was: Apache Spark)

> Enable test_add_months_function, test_array_repeat, test_dayofweek, 
> test_first_last_ignorenulls, test_function_parity, test_inline, 
> test_window_time, test_reciprocal_trig_functions
> 
>
> Key: SPARK-41891
> URL: https://issues.apache.org/jira/browse/SPARK-41891
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41891) Enable test_add_months_function, test_array_repeat, test_dayofweek, test_first_last_ignorenulls, test_function_parity, test_inline, test_window_time, test_reciprocal_tr

2023-01-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41891:


Assignee: Apache Spark  (was: Sandeep Singh)

> Enable test_add_months_function, test_array_repeat, test_dayofweek, 
> test_first_last_ignorenulls, test_function_parity, test_inline, 
> test_window_time, test_reciprocal_trig_functions
> 
>
> Key: SPARK-41891
> URL: https://issues.apache.org/jira/browse/SPARK-41891
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39318) Remove tpch-plan-stability WithStats golden files

2023-01-04 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-39318:
---

Assignee: XiDuo You

> Remove tpch-plan-stability WithStats golden files
> -
>
> Key: SPARK-39318
> URL: https://issues.apache.org/jira/browse/SPARK-39318
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
> It's a dead golden files since we have no stats with TPCH and no check for 
> that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39318) Remove tpch-plan-stability WithStats golden files

2023-01-04 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-39318.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36700
[https://github.com/apache/spark/pull/36700]

> Remove tpch-plan-stability WithStats golden files
> -
>
> Key: SPARK-39318
> URL: https://issues.apache.org/jira/browse/SPARK-39318
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
> It's a dead golden files since we have no stats with TPCH and no check for 
> that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41891) Enable test_add_months_function, test_array_repeat, test_dayofweek, test_first_last_ignorenulls, test_function_parity, test_inline, test_window_time, test_reciprocal_tri

2023-01-04 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41891:
--
Summary: Enable test_add_months_function, test_array_repeat, 
test_dayofweek, test_first_last_ignorenulls, test_function_parity, test_inline, 
test_window_time, test_reciprocal_trig_functions  (was: Enable 8 tests)

> Enable test_add_months_function, test_array_repeat, test_dayofweek, 
> test_first_last_ignorenulls, test_function_parity, test_inline, 
> test_window_time, test_reciprocal_trig_functions
> 
>
> Key: SPARK-41891
> URL: https://issues.apache.org/jira/browse/SPARK-41891
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41892) Add JIRAs or messages for skipped messages

2023-01-04 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41892:
-

 Summary: Add JIRAs or messages for skipped messages
 Key: SPARK-41892
 URL: https://issues.apache.org/jira/browse/SPARK-41892
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh
Assignee: Sandeep Singh
 Fix For: 3.4.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41878) Add JIRAs or messages for skipped tests

2023-01-04 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41878:
--
Summary: Add JIRAs or messages for skipped tests  (was: Add JIRAs or 
messages for skipped messages)

> Add JIRAs or messages for skipped tests
> ---
>
> Key: SPARK-41878
> URL: https://issues.apache.org/jira/browse/SPARK-41878
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> Add JIRAs or Messages for all the skipped messages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41891) Enable 8 tests

2023-01-04 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41891:
-

 Summary: Enable 8 tests
 Key: SPARK-41891
 URL: https://issues.apache.org/jira/browse/SPARK-41891
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh
Assignee: Sandeep Singh
 Fix For: 3.4.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41694) Add new config to clean up `spark.ui.store.path` directory when SparkContext.stop()

2023-01-04 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-41694:
--

Assignee: Yang Jie

> Add new config to clean up `spark.ui.store.path` directory when 
> SparkContext.stop()
> ---
>
> Key: SPARK-41694
> URL: https://issues.apache.org/jira/browse/SPARK-41694
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> {{spark.ui.store.path}} directory not clean up when {{SparkContext.stop() 
> now:}}
>  # {{{}{}}}The disk space occupied by the {{spark.ui.store.path}} directory 
> will continue to grow.
>  # When submitting new App and reusing the {{spark.ui.store.path}} directory, 
> we will see the content related to the previous App, which is a bit strange



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41694) Add new config to clean up `spark.ui.store.path` directory when SparkContext.stop()

2023-01-04 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-41694.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39226
[https://github.com/apache/spark/pull/39226]

> Add new config to clean up `spark.ui.store.path` directory when 
> SparkContext.stop()
> ---
>
> Key: SPARK-41694
> URL: https://issues.apache.org/jira/browse/SPARK-41694
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.4.0
>
>
> {{spark.ui.store.path}} directory not clean up when {{SparkContext.stop() 
> now:}}
>  # {{{}{}}}The disk space occupied by the {{spark.ui.store.path}} directory 
> will continue to grow.
>  # When submitting new App and reusing the {{spark.ui.store.path}} directory, 
> we will see the content related to the previous App, which is a bit strange



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41890) Reduce `toSeq` in `RDDOperationGraphWrapperSerializer`/SparkPlanGraphWrapperSerializer` for Scala 2.13

2023-01-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654723#comment-17654723
 ] 

Apache Spark commented on SPARK-41890:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/39399

> Reduce `toSeq` in 
> `RDDOperationGraphWrapperSerializer`/SparkPlanGraphWrapperSerializer` for 
> Scala 2.13
> --
>
> Key: SPARK-41890
> URL: https://issues.apache.org/jira/browse/SPARK-41890
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL, Web UI
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> Similar work as SPARK-41709



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41890) Reduce `toSeq` in `RDDOperationGraphWrapperSerializer`/SparkPlanGraphWrapperSerializer` for Scala 2.13

2023-01-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41890:


Assignee: (was: Apache Spark)

> Reduce `toSeq` in 
> `RDDOperationGraphWrapperSerializer`/SparkPlanGraphWrapperSerializer` for 
> Scala 2.13
> --
>
> Key: SPARK-41890
> URL: https://issues.apache.org/jira/browse/SPARK-41890
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL, Web UI
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> Similar work as SPARK-41709



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41890) Reduce `toSeq` in `RDDOperationGraphWrapperSerializer`/SparkPlanGraphWrapperSerializer` for Scala 2.13

2023-01-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41890:


Assignee: Apache Spark

> Reduce `toSeq` in 
> `RDDOperationGraphWrapperSerializer`/SparkPlanGraphWrapperSerializer` for 
> Scala 2.13
> --
>
> Key: SPARK-41890
> URL: https://issues.apache.org/jira/browse/SPARK-41890
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL, Web UI
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> Similar work as SPARK-41709



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41890) Reduce `toSeq` in `RDDOperationGraphWrapperSerializer`/SparkPlanGraphWrapperSerializer` for Scala 2.13

2023-01-04 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-41890:
-
Description: Similar work to SPARK-41709  (was: Similar work as SPARK-41709)

> Reduce `toSeq` in 
> `RDDOperationGraphWrapperSerializer`/SparkPlanGraphWrapperSerializer` for 
> Scala 2.13
> --
>
> Key: SPARK-41890
> URL: https://issues.apache.org/jira/browse/SPARK-41890
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL, Web UI
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> Similar work to SPARK-41709



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41890) Reduce `toSeq` in `RDDOperationGraphWrapperSerializer`/SparkPlanGraphWrapperSerializer` for Scala 2.13

2023-01-04 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-41890:
-
Summary: Reduce `toSeq` in 
`RDDOperationGraphWrapperSerializer`/SparkPlanGraphWrapperSerializer` for Scala 
2.13  (was: Reduce `toSeq` in 
`RDDOperationGraphWrapperSerializer`/`sql/core/src/main/scala/org/apache/spark/status/protobuf/sql/SparkPlanGraphWrapperSerializer`
 for Scala 2.13)

> Reduce `toSeq` in 
> `RDDOperationGraphWrapperSerializer`/SparkPlanGraphWrapperSerializer` for 
> Scala 2.13
> --
>
> Key: SPARK-41890
> URL: https://issues.apache.org/jira/browse/SPARK-41890
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL, Web UI
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> Similar work as SPARK-41709



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41890) Reduce `toSeq` in `RDDOperationGraphWrapperSerializer`/`sql/core/src/main/scala/org/apache/spark/status/protobuf/sql/SparkPlanGraphWrapperSerializer` for Scala 2.13

2023-01-04 Thread Yang Jie (Jira)

Yang Jie created SPARK-41890:


 Summary: Reduce `toSeq` in 
`RDDOperationGraphWrapperSerializer`/`sql/core/src/main/scala/org/apache/spark/status/protobuf/sql/SparkPlanGraphWrapperSerializer`
 for Scala 2.13
 Key: SPARK-41890
 URL: https://issues.apache.org/jira/browse/SPARK-41890
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL, Web UI
Affects Versions: 3.4.0
Reporter: Yang Jie


Similar work as SPARK-41709



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41829) Implement Dataframe.sort,sortWithinPartitions Ordering

2023-01-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41829:


Assignee: (was: Apache Spark)

> Implement Dataframe.sort,sortWithinPartitions Ordering
> --
>
> Key: SPARK-41829
> URL: https://issues.apache.org/jira/browse/SPARK-41829
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 422, in pyspark.sql.connect.dataframe.DataFrame.sort
> Failed example:
>     df.orderBy(["age", "name"], ascending=[False, False]).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 
> 1, in 
>         df.orderBy(["age", "name"], ascending=[False, False]).show()
>     TypeError: DataFrame.sort() got an unexpected keyword argument 'ascending'
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 379, in pyspark.sql.connect.dataframe.DataFrame.sortWithinPartitions
> Failed example:
>     df.sortWithinPartitions("age", ascending=False)
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File " pyspark.sql.connect.dataframe.DataFrame.sortWithinPartitions[1]>", line 1, in 
> 
>         df.sortWithinPartitions("age", ascending=False)
>     TypeError: DataFrame.sortWithinPartitions() got an unexpected keyword 
> argument 'ascending'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41829) Implement Dataframe.sort,sortWithinPartitions Ordering

2023-01-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41829:


Assignee: Apache Spark

> Implement Dataframe.sort,sortWithinPartitions Ordering
> --
>
> Key: SPARK-41829
> URL: https://issues.apache.org/jira/browse/SPARK-41829
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 422, in pyspark.sql.connect.dataframe.DataFrame.sort
> Failed example:
>     df.orderBy(["age", "name"], ascending=[False, False]).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 
> 1, in 
>         df.orderBy(["age", "name"], ascending=[False, False]).show()
>     TypeError: DataFrame.sort() got an unexpected keyword argument 'ascending'
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 379, in pyspark.sql.connect.dataframe.DataFrame.sortWithinPartitions
> Failed example:
>     df.sortWithinPartitions("age", ascending=False)
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File " pyspark.sql.connect.dataframe.DataFrame.sortWithinPartitions[1]>", line 1, in 
> 
>         df.sortWithinPartitions("age", ascending=False)
>     TypeError: DataFrame.sortWithinPartitions() got an unexpected keyword 
> argument 'ascending'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41829) Implement Dataframe.sort,sortWithinPartitions Ordering

2023-01-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654718#comment-17654718
 ] 

Apache Spark commented on SPARK-41829:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39398

> Implement Dataframe.sort,sortWithinPartitions Ordering
> --
>
> Key: SPARK-41829
> URL: https://issues.apache.org/jira/browse/SPARK-41829
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 422, in pyspark.sql.connect.dataframe.DataFrame.sort
> Failed example:
>     df.orderBy(["age", "name"], ascending=[False, False]).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 
> 1, in 
>         df.orderBy(["age", "name"], ascending=[False, False]).show()
>     TypeError: DataFrame.sort() got an unexpected keyword argument 'ascending'
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 379, in pyspark.sql.connect.dataframe.DataFrame.sortWithinPartitions
> Failed example:
>     df.sortWithinPartitions("age", ascending=False)
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File " pyspark.sql.connect.dataframe.DataFrame.sortWithinPartitions[1]>", line 1, in 
> 
>         df.sortWithinPartitions("age", ascending=False)
>     TypeError: DataFrame.sortWithinPartitions() got an unexpected keyword 
> argument 'ascending'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41889) Attach root cause to invalidPatternError

2023-01-04 Thread BingKun Pan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-41889:

Summary: Attach root cause to invalidPatternError  (was: Attach root cause 
to INVALID_PARAMETER_VALUE)

> Attach root cause to invalidPatternError
> 
>
> Key: SPARK-41889
> URL: https://issues.apache.org/jira/browse/SPARK-41889
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41889) Attach root cause to INVALID_PARAMETER_VALUE

2023-01-04 Thread BingKun Pan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654717#comment-17654717
 ] 

BingKun Pan commented on SPARK-41889:
-

I work on it.

> Attach root cause to INVALID_PARAMETER_VALUE
> 
>
> Key: SPARK-41889
> URL: https://issues.apache.org/jira/browse/SPARK-41889
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41825) DataFrame.show formatting int as double

2023-01-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41825:


Assignee: Ruifeng Zheng

> DataFrame.show formatting int as double
> ---
>
> Key: SPARK-41825
> URL: https://issues.apache.org/jira/browse/SPARK-41825
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Ruifeng Zheng
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 650, in pyspark.sql.connect.dataframe.DataFrame.fillna
> Failed example:
>     df.na.fill(50).show()
> Expected:
>     +---+--+-++
>     |age|height| name|bool|
>     +---+--+-++
>     | 10|  80.5|Alice|null|
>     |  5|  50.0|  Bob|null|
>     | 50|  50.0|  Tom|null|
>     | 50|  50.0| null|true|
>     +---+--+-++
> Got:
>     ++--+-++
>     | age|height| name|bool|
>     ++--+-++
>     |10.0|  80.5|Alice|null|
>     | 5.0|  50.0|  Bob|null|
>     |50.0|  50.0|  Tom|null|
>     |50.0|  50.0| null|true|
>     ++--+-++
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41825) DataFrame.show formatting int as double

2023-01-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41825.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39396
[https://github.com/apache/spark/pull/39396]

> DataFrame.show formatting int as double
> ---
>
> Key: SPARK-41825
> URL: https://issues.apache.org/jira/browse/SPARK-41825
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 650, in pyspark.sql.connect.dataframe.DataFrame.fillna
> Failed example:
>     df.na.fill(50).show()
> Expected:
>     +---+--+-++
>     |age|height| name|bool|
>     +---+--+-++
>     | 10|  80.5|Alice|null|
>     |  5|  50.0|  Bob|null|
>     | 50|  50.0|  Tom|null|
>     | 50|  50.0| null|true|
>     +---+--+-++
> Got:
>     ++--+-++
>     | age|height| name|bool|
>     ++--+-++
>     |10.0|  80.5|Alice|null|
>     | 5.0|  50.0|  Bob|null|
>     |50.0|  50.0|  Tom|null|
>     |50.0|  50.0| null|true|
>     ++--+-++
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41889) Attach root cause to INVALID_PARAMETER_VALUE

2023-01-04 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-41889:
---

 Summary: Attach root cause to INVALID_PARAMETER_VALUE
 Key: SPARK-41889
 URL: https://issues.apache.org/jira/browse/SPARK-41889
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41888) Support StreamingQueryListener for DataFrame.observe

2023-01-04 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-41888:
---
Summary: Support StreamingQueryListener for DataFrame.observe  (was: 
Support StreamingQueryListener for connect)

> Support StreamingQueryListener for DataFrame.observe
> 
>
> Key: SPARK-41888
> URL: https://issues.apache.org/jira/browse/SPARK-41888
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> {code:java}
> **
> 1334
> File "/__w/spark/spark/python/pyspark/sql/connect/dataframe.py", line 619, in 
> pyspark.sql.connect.dataframe.DataFrame.observe
> 1335
> Failed example:
> 1336
> observation.get
> 1337
> Exception raised:
> 1338
> Traceback (most recent call last):
> 1339
>   File "/usr/lib/python3.9/doctest.py", line 1336, in __run
> 1340
> exec(compile(example.source, filename, "single",
> 1341
>   File "", 
> line 1, in 
> 1342
> observation.get
> 1343
>   File "/__w/spark/spark/python/pyspark/sql/utils.py", line 378, in 
> wrapped
> 1344
> raise NotImplementedError()
> 1345
> NotImplementedError
> 1346
> **
> 1347
> File "/__w/spark/spark/python/pyspark/sql/connect/dataframe.py", line 642, in 
> pyspark.sql.connect.dataframe.DataFrame.observe
> 1348
> Failed example:
> 1349
> spark.streams.addListener(MyErrorListener())
> 1350
> Exception raised:
> 1351
> Traceback (most recent call last):
> 1352
>   File "/usr/lib/python3.9/doctest.py", line 1336, in __run
> 1353
> exec(compile(example.source, filename, "single",
> 1354
>   File "", 
> line 1, in 
> 1355
> spark.streams.addListener(MyErrorListener())
> 1356
> AttributeError: 'SparkSession' object has no attribute 'streams'
> 1357
> **
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41888) Support StreamingQueryListener for connect

2023-01-04 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-41888:
---
Description: 

{code:java}
**
1334
File "/__w/spark/spark/python/pyspark/sql/connect/dataframe.py", line 619, in 
pyspark.sql.connect.dataframe.DataFrame.observe
1335
Failed example:
1336
observation.get
1337
Exception raised:
1338
Traceback (most recent call last):
1339
  File "/usr/lib/python3.9/doctest.py", line 1336, in __run
1340
exec(compile(example.source, filename, "single",
1341
  File "", line 
1, in 
1342
observation.get
1343
  File "/__w/spark/spark/python/pyspark/sql/utils.py", line 378, in wrapped
1344
raise NotImplementedError()
1345
NotImplementedError
1346
**
1347
File "/__w/spark/spark/python/pyspark/sql/connect/dataframe.py", line 642, in 
pyspark.sql.connect.dataframe.DataFrame.observe
1348
Failed example:
1349
spark.streams.addListener(MyErrorListener())
1350
Exception raised:
1351
Traceback (most recent call last):
1352
  File "/usr/lib/python3.9/doctest.py", line 1336, in __run
1353
exec(compile(example.source, filename, "single",
1354
  File "", line 
1, in 
1355
spark.streams.addListener(MyErrorListener())
1356
AttributeError: 'SparkSession' object has no attribute 'streams'
1357
**
{code}


> Support StreamingQueryListener for connect
> --
>
> Key: SPARK-41888
> URL: https://issues.apache.org/jira/browse/SPARK-41888
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> {code:java}
> **
> 1334
> File "/__w/spark/spark/python/pyspark/sql/connect/dataframe.py", line 619, in 
> pyspark.sql.connect.dataframe.DataFrame.observe
> 1335
> Failed example:
> 1336
> observation.get
> 1337
> Exception raised:
> 1338
> Traceback (most recent call last):
> 1339
>   File "/usr/lib/python3.9/doctest.py", line 1336, in __run
> 1340
> exec(compile(example.source, filename, "single",
> 1341
>   File "", 
> line 1, in 
> 1342
> observation.get
> 1343
>   File "/__w/spark/spark/python/pyspark/sql/utils.py", line 378, in 
> wrapped
> 1344
> raise NotImplementedError()
> 1345
> NotImplementedError
> 1346
> **
> 1347
> File "/__w/spark/spark/python/pyspark/sql/connect/dataframe.py", line 642, in 
> pyspark.sql.connect.dataframe.DataFrame.observe
> 1348
> Failed example:
> 1349
> spark.streams.addListener(MyErrorListener())
> 1350
> Exception raised:
> 1351
> Traceback (most recent call last):
> 1352
>   File "/usr/lib/python3.9/doctest.py", line 1336, in __run
> 1353
> exec(compile(example.source, filename, "single",
> 1354
>   File "", 
> line 1, in 
> 1355
> spark.streams.addListener(MyErrorListener())
> 1356
> AttributeError: 'SparkSession' object has no attribute 'streams'
> 1357
> **
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41888) Support StreamingQueryListener for connect

2023-01-04 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-41888:
--

 Summary: Support StreamingQueryListener for connect
 Key: SPARK-41888
 URL: https://issues.apache.org/jira/browse/SPARK-41888
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: jiaan.geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41887) Support DataFrame hint parameter to be list

2023-01-04 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41887:
-

 Summary: Support DataFrame hint parameter to be list
 Key: SPARK-41887
 URL: https://issues.apache.org/jira/browse/SPARK-41887
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
df = self.spark.range(10e10).toDF("id")
such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 556, in test_extended_hint_types
    hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 482, in hint
    raise TypeError(
TypeError: param should be a int or str, but got float 1.2345{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41887) Support DataFrame hint parameter to be list

2023-01-04 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41887:
--
Description: 
{code:java}
df = self.spark.range(10e10).toDF("id")
such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}

  was:
{code:java}
df = self.spark.range(10e10).toDF("id")
such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 556, in test_extended_hint_types
    hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 482, in hint
    raise TypeError(
TypeError: param should be a int or str, but got float 1.2345{code}


> Support DataFrame hint parameter to be list
> ---
>
> Key: SPARK-41887
> URL: https://issues.apache.org/jira/browse/SPARK-41887
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.range(10e10).toDF("id")
> such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
> hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41871) DataFrame hint parameter can be str, float or int

2023-01-04 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41871:
--
Summary: DataFrame hint parameter can be str, float or int  (was: DataFrame 
hint parameter can be str, list, float or int)

> DataFrame hint parameter can be str, float or int
> -
>
> Key: SPARK-41871
> URL: https://issues.apache.org/jira/browse/SPARK-41871
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.range(10e10).toDF("id")
> such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
> hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 556, in test_extended_hint_types
>     hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 482, in hint
>     raise TypeError(
> TypeError: param should be a int or str, but got float 1.2345{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41825) DataFrame.show formatting int as double

2023-01-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654697#comment-17654697
 ] 

Apache Spark commented on SPARK-41825:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39396

> DataFrame.show formatting int as double
> ---
>
> Key: SPARK-41825
> URL: https://issues.apache.org/jira/browse/SPARK-41825
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 650, in pyspark.sql.connect.dataframe.DataFrame.fillna
> Failed example:
>     df.na.fill(50).show()
> Expected:
>     +---+--+-++
>     |age|height| name|bool|
>     +---+--+-++
>     | 10|  80.5|Alice|null|
>     |  5|  50.0|  Bob|null|
>     | 50|  50.0|  Tom|null|
>     | 50|  50.0| null|true|
>     +---+--+-++
> Got:
>     ++--+-++
>     | age|height| name|bool|
>     ++--+-++
>     |10.0|  80.5|Alice|null|
>     | 5.0|  50.0|  Bob|null|
>     |50.0|  50.0|  Tom|null|
>     |50.0|  50.0| null|true|
>     ++--+-++
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41825) DataFrame.show formatting int as double

2023-01-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41825:


Assignee: Apache Spark

> DataFrame.show formatting int as double
> ---
>
> Key: SPARK-41825
> URL: https://issues.apache.org/jira/browse/SPARK-41825
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 650, in pyspark.sql.connect.dataframe.DataFrame.fillna
> Failed example:
>     df.na.fill(50).show()
> Expected:
>     +---+--+-++
>     |age|height| name|bool|
>     +---+--+-++
>     | 10|  80.5|Alice|null|
>     |  5|  50.0|  Bob|null|
>     | 50|  50.0|  Tom|null|
>     | 50|  50.0| null|true|
>     +---+--+-++
> Got:
>     ++--+-++
>     | age|height| name|bool|
>     ++--+-++
>     |10.0|  80.5|Alice|null|
>     | 5.0|  50.0|  Bob|null|
>     |50.0|  50.0|  Tom|null|
>     |50.0|  50.0| null|true|
>     ++--+-++
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41825) DataFrame.show formatting int as double

2023-01-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654698#comment-17654698
 ] 

Apache Spark commented on SPARK-41825:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39396

> DataFrame.show formatting int as double
> ---
>
> Key: SPARK-41825
> URL: https://issues.apache.org/jira/browse/SPARK-41825
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 650, in pyspark.sql.connect.dataframe.DataFrame.fillna
> Failed example:
>     df.na.fill(50).show()
> Expected:
>     +---+--+-++
>     |age|height| name|bool|
>     +---+--+-++
>     | 10|  80.5|Alice|null|
>     |  5|  50.0|  Bob|null|
>     | 50|  50.0|  Tom|null|
>     | 50|  50.0| null|true|
>     +---+--+-++
> Got:
>     ++--+-++
>     | age|height| name|bool|
>     ++--+-++
>     |10.0|  80.5|Alice|null|
>     | 5.0|  50.0|  Bob|null|
>     |50.0|  50.0|  Tom|null|
>     |50.0|  50.0| null|true|
>     ++--+-++
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41825) DataFrame.show formatting int as double

2023-01-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41825:


Assignee: (was: Apache Spark)

> DataFrame.show formatting int as double
> ---
>
> Key: SPARK-41825
> URL: https://issues.apache.org/jira/browse/SPARK-41825
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 650, in pyspark.sql.connect.dataframe.DataFrame.fillna
> Failed example:
>     df.na.fill(50).show()
> Expected:
>     +---+--+-++
>     |age|height| name|bool|
>     +---+--+-++
>     | 10|  80.5|Alice|null|
>     |  5|  50.0|  Bob|null|
>     | 50|  50.0|  Tom|null|
>     | 50|  50.0| null|true|
>     +---+--+-++
> Got:
>     ++--+-++
>     | age|height| name|bool|
>     ++--+-++
>     |10.0|  80.5|Alice|null|
>     | 5.0|  50.0|  Bob|null|
>     |50.0|  50.0|  Tom|null|
>     |50.0|  50.0| null|true|
>     ++--+-++
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41886) `DataFrame.intersect` doctest output has different order

2023-01-04 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-41886:
-

 Summary: `DataFrame.intersect` doctest output has different order
 Key: SPARK-41886
 URL: https://issues.apache.org/jira/browse/SPARK-41886
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng


not sure whether this needs to be fix:


{code:java}
File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/dataframe.py", 
line 609, in pyspark.sql.connect.dataframe.DataFrame.intersect
Failed example:
df1.intersect(df2).show()
Expected:
+---+---+
| C1| C2|
+---+---+
|  b|  3|
|  a|  1|
+---+---+
Got:
+---+---+
| C1| C2|
+---+---+
|  a|  1|
|  b|  3|
+---+---+

**
   1 of   3 in pyspark.sql.connect.dataframe.DataFrame.intersect

{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications

2023-01-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-41053:
--
Labels: releasenotes  (was: release-notes)

> Better Spark UI scalability and Driver stability for large applications
> ---
>
> Key: SPARK-41053
> URL: https://issues.apache.org/jira/browse/SPARK-41053
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>  Labels: releasenotes
> Attachments: Better Spark UI scalability and Driver stability for 
> large applications.pdf
>
>
> After SPARK-18085, the Spark history server(SHS) becomes more scalable for 
> processing large applications by supporting a persistent 
> KV-store(LevelDB/RocksDB) as the storage layer.
> As for the live Spark UI, all the data is still stored in memory, which can 
> bring memory pressures to the Spark driver for large applications.
> For better Spark UI scalability and Driver stability, I propose to
>  * {*}Support storing all the UI data in a persistent KV store{*}. 
> RocksDB/LevelDB provides low memory overhead. Their write/read performance is 
> fast enough to serve the write/read workload for live UI. SHS can leverage 
> the persistent KV store to fasten its startup.
>  * *Support a new Protobuf serializer for all the UI data.* The new 
> serializer is supposed to be faster, according to benchmarks. It will be the 
> default serializer for the persistent KV store of live UI. As for event logs, 
> it is optional. The current serializer for UI data is JSON. When writing 
> persistent KV-store, there is GZip compression. Since there is compression 
> support in RocksDB/LevelDB, the new serializer won’t compress the output 
> before writing to the persistent KV store. Here is a benchmark of 
> writing/reading 100,000 SQLExecutionUIData to/from RocksDB:
>  
> |*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total 
> Size(MB)*|*Result total size in memory(MB)*|
> |*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868|
> |*Protobuf*|109.9|34.3|858|2105|
> I am also proposing to support RocksDB instead of both LevelDB & RocksDB in 
> the live UI.
> SPIP: 
> [https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing]
> SPIP vote: https://lists.apache.org/thread/lom4zcob6237q6nnj46jylkzwmmsxvgj



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41286) Build, package and infrastructure for Spark Connect

2023-01-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41286.
--
Resolution: Done

I am going to mark it as done for now.

> Build, package and infrastructure for Spark Connect
> ---
>
> Key: SPARK-41286
> URL: https://issues.apache.org/jira/browse/SPARK-41286
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41841) Support PyPI packaging without JVM

2023-01-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41841.
--
Resolution: Later

> Support PyPI packaging without JVM
> --
>
> Key: SPARK-41841
> URL: https://issues.apache.org/jira/browse/SPARK-41841
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Blocker
>
> We should support pip install pyspark without JVM so Spark Connect can be 
> real lightweight library.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41878) Add JIRAs or messages for skipped messages

2023-01-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41878.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39382
[https://github.com/apache/spark/pull/39382]

> Add JIRAs or messages for skipped messages
> --
>
> Key: SPARK-41878
> URL: https://issues.apache.org/jira/browse/SPARK-41878
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> Add JIRAs or Messages for all the skipped messages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41881) `DataFrame.collect` should handle None/NaN properly

2023-01-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41881:


Assignee: Ruifeng Zheng

> `DataFrame.collect` should handle None/NaN properly
> ---
>
> Key: SPARK-41881
> URL: https://issues.apache.org/jira/browse/SPARK-41881
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41815) Column.isNull returns nan instead of None

2023-01-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41815:


Assignee: Ruifeng Zheng

> Column.isNull returns nan instead of None
> -
>
> Key: SPARK-41815
> URL: https://issues.apache.org/jira/browse/SPARK-41815
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Ruifeng Zheng
>Priority: Major
>
> {code}
> File "/.../spark/python/pyspark/sql/connect/column.py", line 99, in 
> pyspark.sql.connect.column.Column.isNull
> Failed example:
> df.filter(df.height.isNull()).collect()
> Expected:
> [Row(name='Alice', height=None)]
> Got:
> [Row(name='Alice', height=nan)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41815) Column.isNull returns nan instead of None

2023-01-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41815.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39386
[https://github.com/apache/spark/pull/39386]

> Column.isNull returns nan instead of None
> -
>
> Key: SPARK-41815
> URL: https://issues.apache.org/jira/browse/SPARK-41815
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>
> {code}
> File "/.../spark/python/pyspark/sql/connect/column.py", line 99, in 
> pyspark.sql.connect.column.Column.isNull
> Failed example:
> df.filter(df.height.isNull()).collect()
> Expected:
> [Row(name='Alice', height=None)]
> Got:
> [Row(name='Alice', height=nan)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41833) DataFrame.collect() output parity with pyspark

2023-01-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41833.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39386
[https://github.com/apache/spark/pull/39386]

> DataFrame.collect() output parity with pyspark
> --
>
> Key: SPARK-41833
> URL: https://issues.apache.org/jira/browse/SPARK-41833
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> **        
>   
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1117, in pyspark.sql.connect.functions.array
> Failed example:
>     df.select(array('age', 'age').alias("arr")).collect()
> Expected:
>     [Row(arr=[2, 2]), Row(arr=[5, 5])]
> Got:
>     [Row(arr=array([2, 2])), Row(arr=array([5, 5]))]
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1119, in pyspark.sql.connect.functions.array
> Failed example:
>     df.select(array([df.age, df.age]).alias("arr")).collect()
> Expected:
>     [Row(arr=[2, 2]), Row(arr=[5, 5])]
> Got:
>     [Row(arr=array([2, 2])), Row(arr=array([5, 5]))]
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1124, in pyspark.sql.connect.functions.array_distinct
> Failed example:
>     df.select(array_distinct(df.data)).collect()
> Expected:
>     [Row(array_distinct(data)=[1, 2, 3]), Row(array_distinct(data)=[4, 5])]
> Got:
>     [Row(array_distinct(data)=array([1, 2, 3])), 
> Row(array_distinct(data)=array([4, 5]))]
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1135, in pyspark.sql.connect.functions.array_except
> Failed example:
>     df.select(array_except(df.c1, df.c2)).collect()
> Expected:
>     [Row(array_except(c1, c2)=['b'])]
> Got:
>     [Row(array_except(c1, c2)=array(['b'], dtype=object))]
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1142, in pyspark.sql.connect.functions.array_intersect
> Failed example:
>     df.select(array_intersect(df.c1, df.c2)).collect()
> Expected:
>     [Row(array_intersect(c1, c2)=['a', 'c'])]
> Got:
>     [Row(array_intersect(c1, c2)=array(['a', 'c'], dtype=object))]
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1180, in pyspark.sql.connect.functions.array_remove
> Failed example:
>     df.select(array_remove(df.data, 1)).collect()
> Expected:
>     [Row(array_remove(data, 1)=[2, 3]), Row(array_remove(data, 1)=[])]
> Got:
>     [Row(array_remove(data, 1)=array([2, 3])), Row(array_remove(data, 
> 1)=array([], dtype=int64))]
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1187, in pyspark.sql.connect.functions.array_repeat
> Failed example:
>     df.select(array_repeat(df.data, 3).alias('r')).collect()
> Expected:
>     [Row(r=['ab', 'ab', 'ab'])]
> Got:
>     [Row(r=array(['ab', 'ab', 'ab'], dtype=object))]
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1204, in pyspark.sql.connect.functions.array_sort
> Failed example:
>     df.select(array_sort(df.data).alias('r')).collect()
> Expected:
>     [Row(r=[1, 2, 3, None]), Row(r=[1]), Row(r=[])]
> Got:
>     [Row(r=array([ 1.,  2.,  3., nan])), Row(r=array([1])), Row(r=array([], 
> dtype=int64))]
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1207, in pyspark.sql.connect.functions.array_sort
> Failed example:
>     df.select(array_sort(
>         "data",
>         lambda x, y: when(x.isNull() | y.isNull(), 
> lit(0)).otherwise(length(y) - length(x))
>     ).alias("r")).collect()
> Expected:
>     [Row(r=['foobar', 'foo', None, 'bar']), Row(r=['foo']), Row(r=[])]
> Got:
>     [Row(r=array(['foobar', 'foo', None, 'bar'], dtype=object)), 
> Row(r=array(['foo'], dtype=object)), Row(r=array([], dtype=object))]
>

[jira] [Assigned] (SPARK-41833) DataFrame.collect() output parity with pyspark

2023-01-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41833:


Assignee: Ruifeng Zheng

> DataFrame.collect() output parity with pyspark
> --
>
> Key: SPARK-41833
> URL: https://issues.apache.org/jira/browse/SPARK-41833
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Ruifeng Zheng
>Priority: Major
>
> {code:java}
> **        
>   
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1117, in pyspark.sql.connect.functions.array
> Failed example:
>     df.select(array('age', 'age').alias("arr")).collect()
> Expected:
>     [Row(arr=[2, 2]), Row(arr=[5, 5])]
> Got:
>     [Row(arr=array([2, 2])), Row(arr=array([5, 5]))]
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1119, in pyspark.sql.connect.functions.array
> Failed example:
>     df.select(array([df.age, df.age]).alias("arr")).collect()
> Expected:
>     [Row(arr=[2, 2]), Row(arr=[5, 5])]
> Got:
>     [Row(arr=array([2, 2])), Row(arr=array([5, 5]))]
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1124, in pyspark.sql.connect.functions.array_distinct
> Failed example:
>     df.select(array_distinct(df.data)).collect()
> Expected:
>     [Row(array_distinct(data)=[1, 2, 3]), Row(array_distinct(data)=[4, 5])]
> Got:
>     [Row(array_distinct(data)=array([1, 2, 3])), 
> Row(array_distinct(data)=array([4, 5]))]
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1135, in pyspark.sql.connect.functions.array_except
> Failed example:
>     df.select(array_except(df.c1, df.c2)).collect()
> Expected:
>     [Row(array_except(c1, c2)=['b'])]
> Got:
>     [Row(array_except(c1, c2)=array(['b'], dtype=object))]
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1142, in pyspark.sql.connect.functions.array_intersect
> Failed example:
>     df.select(array_intersect(df.c1, df.c2)).collect()
> Expected:
>     [Row(array_intersect(c1, c2)=['a', 'c'])]
> Got:
>     [Row(array_intersect(c1, c2)=array(['a', 'c'], dtype=object))]
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1180, in pyspark.sql.connect.functions.array_remove
> Failed example:
>     df.select(array_remove(df.data, 1)).collect()
> Expected:
>     [Row(array_remove(data, 1)=[2, 3]), Row(array_remove(data, 1)=[])]
> Got:
>     [Row(array_remove(data, 1)=array([2, 3])), Row(array_remove(data, 
> 1)=array([], dtype=int64))]
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1187, in pyspark.sql.connect.functions.array_repeat
> Failed example:
>     df.select(array_repeat(df.data, 3).alias('r')).collect()
> Expected:
>     [Row(r=['ab', 'ab', 'ab'])]
> Got:
>     [Row(r=array(['ab', 'ab', 'ab'], dtype=object))]
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1204, in pyspark.sql.connect.functions.array_sort
> Failed example:
>     df.select(array_sort(df.data).alias('r')).collect()
> Expected:
>     [Row(r=[1, 2, 3, None]), Row(r=[1]), Row(r=[])]
> Got:
>     [Row(r=array([ 1.,  2.,  3., nan])), Row(r=array([1])), Row(r=array([], 
> dtype=int64))]
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1207, in pyspark.sql.connect.functions.array_sort
> Failed example:
>     df.select(array_sort(
>         "data",
>         lambda x, y: when(x.isNull() | y.isNull(), 
> lit(0)).otherwise(length(y) - length(x))
>     ).alias("r")).collect()
> Expected:
>     [Row(r=['foobar', 'foo', None, 'bar']), Row(r=['foo']), Row(r=[])]
> Got:
>     [Row(r=array(['foobar', 'foo', None, 'bar'], dtype=object)), 
> Row(r=array(['foo'], dtype=object)), Row(r=array([], dtype=object))]
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py",

[jira] [Resolved] (SPARK-41881) `DataFrame.collect` should handle None/NaN properly

2023-01-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41881.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39386
[https://github.com/apache/spark/pull/39386]

> `DataFrame.collect` should handle None/NaN properly
> ---
>
> Key: SPARK-41881
> URL: https://issues.apache.org/jira/browse/SPARK-41881
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41846) DataFrame windowspec functions : unresolved columns

2023-01-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41846.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39392
[https://github.com/apache/spark/pull/39392]

> DataFrame windowspec functions : unresolved columns
> ---
>
> Key: SPARK-41846
> URL: https://issues.apache.org/jira/browse/SPARK-41846
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1098, in pyspark.sql.connect.functions.rank
> Failed example:
>     df.withColumn("drank", rank().over(w)).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.withColumn("drank", rank().over(w)).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
> `value` cannot be resolved. Did you mean one of the following? [`_1`]
>     Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS 
> FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS drank#4003]
>     +- Project [0#3998L AS _1#4000L]
>        +- LocalRelation [0#3998L] {code}
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1032, in pyspark.sql.connect.functions.cume_dist
> Failed example:
>     df.withColumn("cd", cume_dist().over(w)).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.withColumn("cd", cume_dist().over(w)).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
> `value` cannot be resolved. Did you mean one of the following? [`_1`]
>     Plan: 'Project [_1#2202L, cume_dist() windowspecdefinition('value ASC 
> NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), 
> currentrow$())) AS cd#2205]
>     +- Project [0#2200L AS _1#2202L]
>        +- LocalRelation [0#2200L] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (SPARK-41846) DataFrame windowspec functions : unresolved columns

2023-01-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41846:


Assignee: Ruifeng Zheng  (was: Sandeep Singh)

> DataFrame windowspec functions : unresolved columns
> ---
>
> Key: SPARK-41846
> URL: https://issues.apache.org/jira/browse/SPARK-41846
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1098, in pyspark.sql.connect.functions.rank
> Failed example:
>     df.withColumn("drank", rank().over(w)).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.withColumn("drank", rank().over(w)).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
> `value` cannot be resolved. Did you mean one of the following? [`_1`]
>     Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS 
> FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS drank#4003]
>     +- Project [0#3998L AS _1#4000L]
>        +- LocalRelation [0#3998L] {code}
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1032, in pyspark.sql.connect.functions.cume_dist
> Failed example:
>     df.withColumn("cd", cume_dist().over(w)).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.withColumn("cd", cume_dist().over(w)).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
> `value` cannot be resolved. Did you mean one of the following? [`_1`]
>     Plan: 'Project [_1#2202L, cume_dist() windowspecdefinition('value ASC 
> NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), 
> currentrow$())) AS cd#2205]
>     +- Project [0#2200L AS _1#2202L]
>        +- LocalRelation [0#2200L] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail:

[jira] [Assigned] (SPARK-41846) DataFrame windowspec functions : unresolved columns

2023-01-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41846:


Assignee: Sandeep Singh

> DataFrame windowspec functions : unresolved columns
> ---
>
> Key: SPARK-41846
> URL: https://issues.apache.org/jira/browse/SPARK-41846
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1098, in pyspark.sql.connect.functions.rank
> Failed example:
>     df.withColumn("drank", rank().over(w)).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.withColumn("drank", rank().over(w)).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
> `value` cannot be resolved. Did you mean one of the following? [`_1`]
>     Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS 
> FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS drank#4003]
>     +- Project [0#3998L AS _1#4000L]
>        +- LocalRelation [0#3998L] {code}
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1032, in pyspark.sql.connect.functions.cume_dist
> Failed example:
>     df.withColumn("cd", cume_dist().over(w)).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.withColumn("cd", cume_dist().over(w)).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
> `value` cannot be resolved. Did you mean one of the following? [`_1`]
>     Plan: 'Project [_1#2202L, cume_dist() windowspecdefinition('value ASC 
> NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), 
> currentrow$())) AS cd#2205]
>     +- Project [0#2200L AS _1#2202L]
>        +- LocalRelation [0#2200L] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail:

[jira] [Assigned] (SPARK-41840) DataFrame.show(): 'Column' object is not callable

2023-01-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41840:


Assignee: Ruifeng Zheng

> DataFrame.show(): 'Column' object is not callable
> -
>
> Key: SPARK-41840
> URL: https://issues.apache.org/jira/browse/SPARK-41840
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Ruifeng Zheng
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 855, in pyspark.sql.connect.functions.first
> Failed example:
>     df.groupby("name").agg(first("age", 
> ignorenulls=True)).orderBy("name").show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.groupby("name").agg(first("age", 
> ignorenulls=True)).orderBy("name").show()
>     TypeError: 'Column' object is not callable{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41840) DataFrame.show(): 'Column' object is not callable

2023-01-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41840.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39390
[https://github.com/apache/spark/pull/39390]

> DataFrame.show(): 'Column' object is not callable
> -
>
> Key: SPARK-41840
> URL: https://issues.apache.org/jira/browse/SPARK-41840
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 855, in pyspark.sql.connect.functions.first
> Failed example:
>     df.groupby("name").agg(first("age", 
> ignorenulls=True)).orderBy("name").show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.groupby("name").agg(first("age", 
> ignorenulls=True)).orderBy("name").show()
>     TypeError: 'Column' object is not callable{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41677) Protobuf serializer for StreamingQueryProgressWrapper

2023-01-04 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-41677:
--

Assignee: Yang Jie

> Protobuf serializer for StreamingQueryProgressWrapper
> -
>
> Key: SPARK-41677
> URL: https://issues.apache.org/jira/browse/SPARK-41677
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41677) Protobuf serializer for StreamingQueryProgressWrapper

2023-01-04 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-41677.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39357
[https://github.com/apache/spark/pull/39357]

> Protobuf serializer for StreamingQueryProgressWrapper
> -
>
> Key: SPARK-41677
> URL: https://issues.apache.org/jira/browse/SPARK-41677
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41768) Refactor the definition of enum - `JobExecutionStatus` to follow with the code style

2023-01-04 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-41768.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39286
[https://github.com/apache/spark/pull/39286]

> Refactor the definition of enum - `JobExecutionStatus` to follow with the 
> code style 
> -
>
> Key: SPARK-41768
> URL: https://issues.apache.org/jira/browse/SPARK-41768
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41768) Refactor the definition of enum - `JobExecutionStatus` to follow with the code style

2023-01-04 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-41768:
--

Assignee: BingKun Pan

> Refactor the definition of enum - `JobExecutionStatus` to follow with the 
> code style 
> -
>
> Key: SPARK-41768
> URL: https://issues.apache.org/jira/browse/SPARK-41768
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41573) Assign name to _LEGACY_ERROR_TEMP_2136

2023-01-04 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-41573:


Assignee: Haejoon Lee

> Assign name to _LEGACY_ERROR_TEMP_2136
> --
>
> Key: SPARK-41573
> URL: https://issues.apache.org/jira/browse/SPARK-41573
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> We should use proper error class name rather than `_LEGACY_ERROR_TEMP_xxx`.
>  
> *NOTE:* Please reply to this ticket before start working on it, to avoid 
> working on same ticket at a time



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41573) Assign name to _LEGACY_ERROR_TEMP_2136

2023-01-04 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-41573.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39284
[https://github.com/apache/spark/pull/39284]

> Assign name to _LEGACY_ERROR_TEMP_2136
> --
>
> Key: SPARK-41573
> URL: https://issues.apache.org/jira/browse/SPARK-41573
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> We should use proper error class name rather than `_LEGACY_ERROR_TEMP_xxx`.
>  
> *NOTE:* Please reply to this ticket before start working on it, to avoid 
> working on same ticket at a time



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41497) Accumulator undercounting in the case of retry task with rdd cache

2023-01-04 Thread Mridul Muralidharan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654590#comment-17654590
 ] 

Mridul Muralidharan commented on SPARK-41497:
-

Sounds good [~Ngone51], thanks !

> Accumulator undercounting in the case of retry task with rdd cache
> --
>
> Key: SPARK-41497
> URL: https://issues.apache.org/jira/browse/SPARK-41497
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.2, 3.3.1
>Reporter: wuyi
>Priority: Major
>
> Accumulator could be undercounted when the retried task has rdd cache.  See 
> the example below and you could also find the completed and reproducible 
> example at 
> [https://github.com/apache/spark/compare/master...Ngone51:spark:fix-acc]
>   
> {code:scala}
> test("SPARK-XXX") {
>   // Set up a cluster with 2 executors
>   val conf = new SparkConf()
> .setMaster("local-cluster[2, 1, 
> 1024]").setAppName("TaskSchedulerImplSuite")
>   sc = new SparkContext(conf)
>   // Set up a custom task scheduler. The scheduler will fail the first task 
> attempt of the job
>   // submitted below. In particular, the failed first attempt task would 
> success on computation
>   // (accumulator accounting, result caching) but only fail to report its 
> success status due
>   // to the concurrent executor lost. The second task attempt would success.
>   taskScheduler = setupSchedulerWithCustomStatusUpdate(sc)
>   val myAcc = sc.longAccumulator("myAcc")
>   // Initiate a rdd with only one partition so there's only one task and 
> specify the storage level
>   // with MEMORY_ONLY_2 so that the rdd result will be cached on both two 
> executors.
>   val rdd = sc.parallelize(0 until 10, 1).mapPartitions { iter =>
> myAcc.add(100)
> iter.map(x => x + 1)
>   }.persist(StorageLevel.MEMORY_ONLY_2)
>   // This will pass since the second task attempt will succeed
>   assert(rdd.count() === 10)
>   // This will fail due to `myAcc.add(100)` won't be executed during the 
> second task attempt's
>   // execution. Because the second task attempt will load the rdd cache 
> directly instead of
>   // executing the task function so `myAcc.add(100)` is skipped.
>   assert(myAcc.value === 100)
> } {code}
>  
> We could also hit this issue with decommission even if the rdd only has one 
> copy. For example, decommission could migrate the rdd cache block to another 
> executor (the result is actually the same with 2 copies) and the 
> decommissioned executor lost before the task reports its success status to 
> the driver. 
>  
> And the issue is a bit more complicated than expected to fix. I have tried to 
> give some fixes but all of them are not ideal:
> Option 1: Clean up any rdd cache related to the failed task: in practice, 
> this option can already fix the issue in most cases. However, theoretically, 
> rdd cache could be reported to the driver right after the driver cleans up 
> the failed task's caches due to asynchronous communication. So this option 
> can’t resolve the issue thoroughly;
> Option 2: Disallow rdd cache reuse across the task attempts for the same 
> task: this option can 100% fix the issue. The problem is this way can also 
> affect the case where rdd cache can be reused across the attempts (e.g., when 
> there is no accumulator operation in the task), which can have perf 
> regression;
> Option 3: Introduce accumulator cache: first, this requires a new framework 
> for supporting accumulator cache; second, the driver should improve its logic 
> to distinguish whether the accumulator cache value should be reported to the 
> user to avoid overcounting. For example, in the case of task retry, the value 
> should be reported. However, in the case of rdd cache reuse, the value 
> shouldn’t be reported (should it?);
> Option 4: Do task success validation when a task trying to load the rdd 
> cache: this way defines a rdd cache is only valid/accessible if the task has 
> succeeded. This way could be either overkill or a bit complex (because 
> currently Spark would clean up the task state once it’s finished. So we need 
> to maintain a structure to know if task once succeeded or not. )



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40497) Upgrade Scala to 2.13.11

2023-01-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-40497:
--
Description: 
We tested and decided to skip the following releases. This issue aims to use 
2.13.11.
- 2022-09-21: v2.13.9 released 
[https://github.com/scala/scala/releases/tag/v2.13.9]
- 2022-10-13: 2.13.10 released 
[https://github.com/scala/scala/releases/tag/v2.13.10]
 

Scala 2.13.11 Milestone
- https://github.com/scala/scala/milestone/100

  was:
We tested and decided to skip the following releases. This issue aims to use 
2.13.11.
- 2022-09-21: v2.13.9 released 
[https://github.com/scala/scala/releases/tag/v2.13.9]
- 2022-10-13: 2.13.10 released 
[https://github.com/scala/scala/releases/tag/v2.13.10]
 


> Upgrade Scala to 2.13.11
> 
>
> Key: SPARK-40497
> URL: https://issues.apache.org/jira/browse/SPARK-40497
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> We tested and decided to skip the following releases. This issue aims to use 
> 2.13.11.
> - 2022-09-21: v2.13.9 released 
> [https://github.com/scala/scala/releases/tag/v2.13.9]
> - 2022-10-13: 2.13.10 released 
> [https://github.com/scala/scala/releases/tag/v2.13.10]
>  
> Scala 2.13.11 Milestone
> - https://github.com/scala/scala/milestone/100



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40497) Upgrade Scala to 2.13.11

2023-01-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-40497:
--
Description: 
We tested and decided to skip the following releases. This issue aims to use 
2.13.11.
- 2022-09-21: v2.13.9 released 
[https://github.com/scala/scala/releases/tag/v2.13.9]
- 2022-10-13: 2.13.10 released 
[https://github.com/scala/scala/releases/tag/v2.13.10]
 

  was:
2.13.9 released [https://github.com/scala/scala/releases/tag/v2.13.9]

 


> Upgrade Scala to 2.13.11
> 
>
> Key: SPARK-40497
> URL: https://issues.apache.org/jira/browse/SPARK-40497
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> We tested and decided to skip the following releases. This issue aims to use 
> 2.13.11.
> - 2022-09-21: v2.13.9 released 
> [https://github.com/scala/scala/releases/tag/v2.13.9]
> - 2022-10-13: 2.13.10 released 
> [https://github.com/scala/scala/releases/tag/v2.13.10]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41885) --packages may not work on Windows 11

2023-01-04 Thread Shixiong Zhu (Jira)

Shixiong Zhu created SPARK-41885:


 Summary: --packages may not work on Windows 11
 Key: SPARK-41885
 URL: https://issues.apache.org/jira/browse/SPARK-41885
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.1
 Environment: Hadoop 2.7 in windows 11
Reporter: Shixiong Zhu


Gastón Ortiz reported an issue when using spark 3.2.1 and hadoop 2.7 in windows 
11. See [https://github.com/delta-io/delta/issues/1059]

Looks like executor cannot fetch the jar files. See the critical stack trace 
below (the full stack trace is in 
[https://github.com/delta-io/delta/issues/1059] ):
{code:java}
org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:366) at 
org.apache.spark.util.Utils$.doFetchFile(Utils.scala:762) at 
org.apache.spark.util.Utils$.fetchFile(Utils.scala:549) at 
org.apache.spark.executor.Executor.$anonfun$updateDependencies$13(Executor.scala:962)
 at 
org.apache.spark.executor.Executor.$anonfun$updateDependencies$13$adapted(Executor.scala:954)
 at 
scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:985)
 at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at 
scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at 
scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at 
scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at 
scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:984) 
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:954)
 at org.apache.spark.executor.Executor.(Executor.scala:247) at  {code}
This is not a Delta Lake issue, as this can be reproduced by running `pyspark 
--packages org.apache.kafka:kafka-clients:2.8.1` as well.

I don't have a Windows 11 environment to debug. Hence I help Gastón Ortiz 
create this ticket and it would be great if anyone who has a Windows 11 
environment can help this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41575) Assign name to _LEGACY_ERROR_TEMP_2054

2023-01-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41575:


Assignee: Apache Spark

> Assign name to _LEGACY_ERROR_TEMP_2054
> --
>
> Key: SPARK-41575
> URL: https://issues.apache.org/jira/browse/SPARK-41575
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> We should use proper error class name rather than `_LEGACY_ERROR_TEMP_xxx`.
>  
> *NOTE:* Please reply to this ticket before start working on it, to avoid 
> working on same ticket at a time



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41575) Assign name to _LEGACY_ERROR_TEMP_2054

2023-01-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654532#comment-17654532
 ] 

Apache Spark commented on SPARK-41575:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/39394

> Assign name to _LEGACY_ERROR_TEMP_2054
> --
>
> Key: SPARK-41575
> URL: https://issues.apache.org/jira/browse/SPARK-41575
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should use proper error class name rather than `_LEGACY_ERROR_TEMP_xxx`.
>  
> *NOTE:* Please reply to this ticket before start working on it, to avoid 
> working on same ticket at a time



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41575) Assign name to _LEGACY_ERROR_TEMP_2054

2023-01-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654533#comment-17654533
 ] 

Apache Spark commented on SPARK-41575:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/39394

> Assign name to _LEGACY_ERROR_TEMP_2054
> --
>
> Key: SPARK-41575
> URL: https://issues.apache.org/jira/browse/SPARK-41575
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should use proper error class name rather than `_LEGACY_ERROR_TEMP_xxx`.
>  
> *NOTE:* Please reply to this ticket before start working on it, to avoid 
> working on same ticket at a time



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41575) Assign name to _LEGACY_ERROR_TEMP_2054

2023-01-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41575:


Assignee: (was: Apache Spark)

> Assign name to _LEGACY_ERROR_TEMP_2054
> --
>
> Key: SPARK-41575
> URL: https://issues.apache.org/jira/browse/SPARK-41575
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should use proper error class name rather than `_LEGACY_ERROR_TEMP_xxx`.
>  
> *NOTE:* Please reply to this ticket before start working on it, to avoid 
> working on same ticket at a time



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41871) DataFrame hint parameter can be str, list, float or int

2023-01-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654516#comment-17654516
 ] 

Apache Spark commented on SPARK-41871:
--

User 'techaddict' has created a pull request for this issue:
https://github.com/apache/spark/pull/39393

> DataFrame hint parameter can be str, list, float or int
> ---
>
> Key: SPARK-41871
> URL: https://issues.apache.org/jira/browse/SPARK-41871
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.range(10e10).toDF("id")
> such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
> hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 556, in test_extended_hint_types
>     hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 482, in hint
>     raise TypeError(
> TypeError: param should be a int or str, but got float 1.2345{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41871) DataFrame hint parameter can be str, list, float or int

2023-01-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41871:


Assignee: Apache Spark

> DataFrame hint parameter can be str, list, float or int
> ---
>
> Key: SPARK-41871
> URL: https://issues.apache.org/jira/browse/SPARK-41871
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> df = self.spark.range(10e10).toDF("id")
> such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
> hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 556, in test_extended_hint_types
>     hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 482, in hint
>     raise TypeError(
> TypeError: param should be a int or str, but got float 1.2345{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41871) DataFrame hint parameter can be str, list, float or int

2023-01-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41871:


Assignee: (was: Apache Spark)

> DataFrame hint parameter can be str, list, float or int
> ---
>
> Key: SPARK-41871
> URL: https://issues.apache.org/jira/browse/SPARK-41871
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.range(10e10).toDF("id")
> such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
> hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 556, in test_extended_hint_types
>     hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 482, in hint
>     raise TypeError(
> TypeError: param should be a int or str, but got float 1.2345{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41871) DataFrame hint parameter can be str, list, float or int

2023-01-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654515#comment-17654515
 ] 

Apache Spark commented on SPARK-41871:
--

User 'techaddict' has created a pull request for this issue:
https://github.com/apache/spark/pull/39393

> DataFrame hint parameter can be str, list, float or int
> ---
>
> Key: SPARK-41871
> URL: https://issues.apache.org/jira/browse/SPARK-41871
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.range(10e10).toDF("id")
> such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
> hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 556, in test_extended_hint_types
>     hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 482, in hint
>     raise TypeError(
> TypeError: param should be a int or str, but got float 1.2345{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41884) DataFrame `toPandas` parity in return types

2023-01-04 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41884:
--
Description: 
{code:java}
import numpy as np
import pandas as pd

df = self.spark.createDataFrame(
[[[("a", 2, 3.0), ("a", 2, 3.0)]], [[("b", 5, 6.0), ("b", 5, 6.0)]]],
"array_struct_col Array>",
)
for is_arrow_enabled in [True, False]:
with self.sql_conf({"spark.sql.execution.arrow.pyspark.enabled": 
is_arrow_enabled}):
pdf = df.toPandas()
self.assertEqual(type(pdf), pd.DataFrame)
self.assertEqual(type(pdf["array_struct_col"]), pd.Series)
if is_arrow_enabled:
self.assertEqual(type(pdf["array_struct_col"][0]), np.ndarray)
else:
self.assertEqual(type(pdf["array_struct_col"][0]), list){code}
{code:java}
Traceback (most recent call last):
1415  File "/__w/spark/spark/python/pyspark/sql/tests/test_dataframe.py", line 
1202, in test_to_pandas_for_array_of_struct
1416df = self.spark.createDataFrame(
1417  File "/__w/spark/spark/python/pyspark/sql/connect/session.py", line 264, 
in createDataFrame
1418table = pa.Table.from_pylist([dict(zip(_cols, list(item))) for item in 
_data])
1419  File "pyarrow/table.pxi", line 3700, in pyarrow.lib.Table.from_pylist
1420  File "pyarrow/table.pxi", line 5221, in pyarrow.lib._from_pylist
1421  File "pyarrow/table.pxi", line 3575, in pyarrow.lib.Table.from_arrays
1422  File "pyarrow/table.pxi", line 1383, in pyarrow.lib._sanitize_arrays
1423  File "pyarrow/table.pxi", line 1364, in pyarrow.lib._schema_from_arrays
1424  File "pyarrow/array.pxi", line 320, in pyarrow.lib.array
1425  File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
1426  File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
1427  File "pyarrow/error.pxi", line 123, in pyarrow.lib.check_status
1428pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object{code}
 
{code:java}
import numpy as np

pdf = self._to_pandas()
types = pdf.dtypes
self.assertEqual(types[0], np.int32)
self.assertEqual(types[1], np.object)
self.assertEqual(types[2], np.bool)
self.assertEqual(types[3], np.float32)
self.assertEqual(types[4], np.object)  # datetime.date
self.assertEqual(types[5], "datetime64[ns]")
self.assertEqual(types[6], "datetime64[ns]")
self.assertEqual(types[7], "timedelta64[ns]") {code}
{code:java}
Traceback (most recent call last): 1434 File 
"/__w/spark/spark/python/pyspark/sql/tests/test_dataframe.py", line 1039, in 
test_to_pandas 1435 self.assertEqual(types[5], "datetime64[ns]") 
1436AssertionError: datetime64[ns, Etc/UTC] != 'datetime64[ns]' 1437
{code}

  was:
{code:java}
schema = StructType(
[StructField("i", StringType(), True), StructField("j", IntegerType(), 
True)]
)
df = self.spark.createDataFrame([("a", 1)], schema)

schema1 = StructType([StructField("j", StringType()), StructField("i", 
StringType())])
df1 = df.to(schema1)
self.assertEqual(schema1, df1.schema)
self.assertEqual(df.count(), df1.count())

schema2 = StructType([StructField("j", LongType())])
df2 = df.to(schema2)
self.assertEqual(schema2, df2.schema)
self.assertEqual(df.count(), df2.count())

schema3 = StructType([StructField("struct", schema1, False)])
df3 = df.select(struct("i", "j").alias("struct")).to(schema3)
self.assertEqual(schema3, df3.schema)
self.assertEqual(df.count(), df3.count())

# incompatible field nullability
schema4 = StructType([StructField("j", LongType(), False)])
self.assertRaisesRegex(
AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4)
){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 1486, in test_to
    self.assertRaisesRegex(
AssertionError: AnalysisException not raised by  {code}


> DataFrame `toPandas` parity in return types
> ---
>
> Key: SPARK-41884
> URL: https://issues.apache.org/jira/browse/SPARK-41884
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> import numpy as np
> import pandas as pd
> df = self.spark.createDataFrame(
> [[[("a", 2, 3.0), ("a", 2, 3.0)]], [[("b", 5, 6.0), ("b", 5, 6.0)]]],
> "array_struct_col Array>",
> )
> for is_arrow_enabled in [True, False]:
> with self.sql_conf({"spark.sql.execution.arrow.pyspark.enabled": 
> is_arrow_enabled}):
> pdf = df.toPandas()
> self.assertEqual(type(pdf), pd.DataFrame)
> self.assertEqual(type(pdf["array_struct_col"]), pd.Series)
> if is_arrow_enabled:
> self.assertEqual(type(pdf["array_struct_col"][0]), np.ndarray)
> else:
> self.assertEqual(type(pdf["array_struct_col"][0]), list){code}
> {code:java}
>

[jira] [Created] (SPARK-41884) DataFrame `toPandas` parity in return types

2023-01-04 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41884:
-

 Summary: DataFrame `toPandas` parity in return types
 Key: SPARK-41884
 URL: https://issues.apache.org/jira/browse/SPARK-41884
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
schema = StructType(
[StructField("i", StringType(), True), StructField("j", IntegerType(), 
True)]
)
df = self.spark.createDataFrame([("a", 1)], schema)

schema1 = StructType([StructField("j", StringType()), StructField("i", 
StringType())])
df1 = df.to(schema1)
self.assertEqual(schema1, df1.schema)
self.assertEqual(df.count(), df1.count())

schema2 = StructType([StructField("j", LongType())])
df2 = df.to(schema2)
self.assertEqual(schema2, df2.schema)
self.assertEqual(df.count(), df2.count())

schema3 = StructType([StructField("struct", schema1, False)])
df3 = df.select(struct("i", "j").alias("struct")).to(schema3)
self.assertEqual(schema3, df3.schema)
self.assertEqual(df.count(), df3.count())

# incompatible field nullability
schema4 = StructType([StructField("j", LongType(), False)])
self.assertRaisesRegex(
AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4)
){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 1486, in test_to
    self.assertRaisesRegex(
AssertionError: AnalysisException not raised by  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39304) ps.read_csv ignore double quotes.

2023-01-04 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-39304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bjørn Jørgensen resolved SPARK-39304.
-
Resolution: Won't Fix

> ps.read_csv ignore double quotes.
> -
>
> Key: SPARK-39304
> URL: https://issues.apache.org/jira/browse/SPARK-39304
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
> Attachments: Untitled (4).ipynb, csvfile.csv
>
>
> This one is coming from u...@spark.org mail list tittle "Complexity with the 
> data" and also on 
> [SO|https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark]
>  
> Add a notebook and the sample data, where this error is tested. 
> Test data :
> Some years,"If your job title needs additional context, please clarify 
> here:","If ""Other,"" please indicate the currency here: "
> 5-7 years,"I started as the Marketing Coordinator, and was given the 
> ""Associate Product Manager"" title as a promotion. My duties remained mostly 
> the same and include graphic design work, marketing, and product management.",
> 8 - 10 years,equivalent to Assistant Registrar,
> 2 - 4 years,"I manage our fundraising department, primarily overseeing our 
> direct mail, planned giving, and grant writing programs. ",



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41846) DataFrame windowspec functions : unresolved columns

2023-01-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654446#comment-17654446
 ] 

Apache Spark commented on SPARK-41846:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39392

> DataFrame windowspec functions : unresolved columns
> ---
>
> Key: SPARK-41846
> URL: https://issues.apache.org/jira/browse/SPARK-41846
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1098, in pyspark.sql.connect.functions.rank
> Failed example:
>     df.withColumn("drank", rank().over(w)).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.withColumn("drank", rank().over(w)).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
> `value` cannot be resolved. Did you mean one of the following? [`_1`]
>     Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS 
> FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS drank#4003]
>     +- Project [0#3998L AS _1#4000L]
>        +- LocalRelation [0#3998L] {code}
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1032, in pyspark.sql.connect.functions.cume_dist
> Failed example:
>     df.withColumn("cd", cume_dist().over(w)).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.withColumn("cd", cume_dist().over(w)).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
> `value` cannot be resolved. Did you mean one of the following? [`_1`]
>     Plan: 'Project [_1#2202L, cume_dist() windowspecdefinition('value ASC 
> NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), 
> currentrow$())) AS cd#2205]
>     +- Project [0#2200L AS _1#2202L]
>        +- LocalRelation [0#2200L] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail:

[jira] [Assigned] (SPARK-41846) DataFrame windowspec functions : unresolved columns

2023-01-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41846:


Assignee: (was: Apache Spark)

> DataFrame windowspec functions : unresolved columns
> ---
>
> Key: SPARK-41846
> URL: https://issues.apache.org/jira/browse/SPARK-41846
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1098, in pyspark.sql.connect.functions.rank
> Failed example:
>     df.withColumn("drank", rank().over(w)).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.withColumn("drank", rank().over(w)).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
> `value` cannot be resolved. Did you mean one of the following? [`_1`]
>     Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS 
> FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS drank#4003]
>     +- Project [0#3998L AS _1#4000L]
>        +- LocalRelation [0#3998L] {code}
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1032, in pyspark.sql.connect.functions.cume_dist
> Failed example:
>     df.withColumn("cd", cume_dist().over(w)).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.withColumn("cd", cume_dist().over(w)).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
> `value` cannot be resolved. Did you mean one of the following? [`_1`]
>     Plan: 'Project [_1#2202L, cume_dist() windowspecdefinition('value ASC 
> NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), 
> currentrow$())) AS cd#2205]
>     +- Project [0#2200L AS _1#2202L]
>        +- LocalRelation [0#2200L] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41846) DataFrame windowspec functions : unresolved columns

2023-01-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41846:


Assignee: Apache Spark

> DataFrame windowspec functions : unresolved columns
> ---
>
> Key: SPARK-41846
> URL: https://issues.apache.org/jira/browse/SPARK-41846
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1098, in pyspark.sql.connect.functions.rank
> Failed example:
>     df.withColumn("drank", rank().over(w)).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.withColumn("drank", rank().over(w)).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
> `value` cannot be resolved. Did you mean one of the following? [`_1`]
>     Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS 
> FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS drank#4003]
>     +- Project [0#3998L AS _1#4000L]
>        +- LocalRelation [0#3998L] {code}
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1032, in pyspark.sql.connect.functions.cume_dist
> Failed example:
>     df.withColumn("cd", cume_dist().over(w)).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.withColumn("cd", cume_dist().over(w)).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
> `value` cannot be resolved. Did you mean one of the following? [`_1`]
>     Plan: 'Project [_1#2202L, cume_dist() windowspecdefinition('value ASC 
> NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), 
> currentrow$())) AS cd#2205]
>     +- Project [0#2200L AS _1#2202L]
>        +- LocalRelation [0#2200L] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail:

[jira] [Commented] (SPARK-41825) DataFrame.show formatting int as double

2023-01-04 Thread Ruifeng Zheng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654440#comment-17654440
 ] 

Ruifeng Zheng commented on SPARK-41825:
---

I'll take this one

> DataFrame.show formatting int as double
> ---
>
> Key: SPARK-41825
> URL: https://issues.apache.org/jira/browse/SPARK-41825
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 650, in pyspark.sql.connect.dataframe.DataFrame.fillna
> Failed example:
>     df.na.fill(50).show()
> Expected:
>     +---+--+-++
>     |age|height| name|bool|
>     +---+--+-++
>     | 10|  80.5|Alice|null|
>     |  5|  50.0|  Bob|null|
>     | 50|  50.0|  Tom|null|
>     | 50|  50.0| null|true|
>     +---+--+-++
> Got:
>     ++--+-++
>     | age|height| name|bool|
>     ++--+-++
>     |10.0|  80.5|Alice|null|
>     | 5.0|  50.0|  Bob|null|
>     |50.0|  50.0|  Tom|null|
>     |50.0|  50.0| null|true|
>     ++--+-++
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41883) Upgrade dropwizard metrics 4.2.15

2023-01-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41883:


Assignee: (was: Apache Spark)

>  Upgrade dropwizard metrics 4.2.15
> --
>
> Key: SPARK-41883
> URL: https://issues.apache.org/jira/browse/SPARK-41883
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41883) Upgrade dropwizard metrics 4.2.15

2023-01-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41883:


Assignee: Apache Spark

>  Upgrade dropwizard metrics 4.2.15
> --
>
> Key: SPARK-41883
> URL: https://issues.apache.org/jira/browse/SPARK-41883
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 148 matches

Mail list logo