[jira] [Commented] (SPARK-42763) Upgrade ZooKeeper from 3.6.3 to 3.6.4

2023-03-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699298#comment-17699298
 ] 

Apache Spark commented on SPARK-42763:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/40384

> Upgrade ZooKeeper from 3.6.3 to 3.6.4
> -
>
> Key: SPARK-42763
> URL: https://issues.apache.org/jira/browse/SPARK-42763
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42763) Upgrade ZooKeeper from 3.6.3 to 3.6.4

2023-03-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42763.
---
Resolution: Duplicate

> Upgrade ZooKeeper from 3.6.3 to 3.6.4
> -
>
> Key: SPARK-42763
> URL: https://issues.apache.org/jira/browse/SPARK-42763
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-42763) Upgrade ZooKeeper from 3.6.3 to 3.6.4

2023-03-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-42763.
-

> Upgrade ZooKeeper from 3.6.3 to 3.6.4
> -
>
> Key: SPARK-42763
> URL: https://issues.apache.org/jira/browse/SPARK-42763
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42762) Improve Logging for disconnects during exec id request

2023-03-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42762.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40383
[https://github.com/apache/spark/pull/40383]

> Improve Logging for disconnects during exec id request
> --
>
> Key: SPARK-42762
> URL: https://issues.apache.org/jira/browse/SPARK-42762
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Minor
> Fix For: 3.5.0
>
>
> Improve Logging for disconnects during exec id request to simplify our 
> network logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42762) Improve Logging for disconnects during exec id request

2023-03-11 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42762:
-

Assignee: Holden Karau

> Improve Logging for disconnects during exec id request
> --
>
> Key: SPARK-42762
> URL: https://issues.apache.org/jira/browse/SPARK-42762
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Minor
>
> Improve Logging for disconnects during exec id request to simplify our 
> network logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42763) Upgrade ZooKeeper from 3.6.3 to 3.6.4

2023-03-11 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-42763:
---

 Summary: Upgrade ZooKeeper from 3.6.3 to 3.6.4
 Key: SPARK-42763
 URL: https://issues.apache.org/jira/browse/SPARK-42763
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42758) Remove dependency on breeze

2023-03-11 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-42758.
--
Fix Version/s: 3.5.0
 Assignee: BingKun Pan
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/40378

> Remove dependency on breeze
> ---
>
> Key: SPARK-42758
> URL: https://issues.apache.org/jira/browse/SPARK-42758
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, MLlib
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Trivial
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42758) Remove dependency on breeze

2023-03-11 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-42758:
-
Priority: Trivial  (was: Minor)

> Remove dependency on breeze
> ---
>
> Key: SPARK-42758
> URL: https://issues.apache.org/jira/browse/SPARK-42758
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, MLlib
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42760) The partition of result data frame of join is always 1

2023-03-11 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699291#comment-17699291
 ] 

Yuming Wang commented on SPARK-42760:
-

Could you try to disable AQE(set spark.sql.adaptive.enabled = false)?

> The partition of result data frame of join is always 1
> --
>
> Key: SPARK-42760
> URL: https://issues.apache.org/jira/browse/SPARK-42760
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.3.2
> Environment: standard spark 3.0.3/3.3.2, using in jupyter notebook, 
> local mode
>Reporter: binyang
>Priority: Major
>
> I am using pyspark. The partition of result data frame of join is always 1.
> Here is my code from 
> https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join
>  
> print(spark.version)
> def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4):
>     spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions)
>     spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
>     df1 = spark.range(1, 1000).repartition(data_partitions)
>     df2 = spark.range(1, 2000).repartition(data_partitions)
>     df3 = spark.range(1, 3000).repartition(data_partitions)
>     print("Data partitions is: {}. Shuffle partitions is 
> {}".format(data_partitions, shuffle_partitions))
>     print("Data partitions before join: 
> {}".format(df1.rdd.getNumPartitions()))
>     df = (df1.join(df2, df1.id == df2.id)
>           .join(df3, df1.id == df3.id))
>     print("Data partitions after join : {}".format(df.rdd.getNumPartitions()))
> example_shuffle_partitions()
>  
> In Spark 3.0.3, it prints out:
> 3.0.3
> Data partitions is: 10. Shuffle partitions is 4
> Data partitions before join: 10
> Data partitions after join : 4
> However, it prints out the following in the latest 3.3.2
> 3.3.2
> Data partitions is: 10. Shuffle partitions is 4
> Data partitions before join: 10
> Data partitions after join : 1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42762) Improve Logging for disconnects during exec id request

2023-03-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42762:


Assignee: (was: Apache Spark)

> Improve Logging for disconnects during exec id request
> --
>
> Key: SPARK-42762
> URL: https://issues.apache.org/jira/browse/SPARK-42762
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.0
>Reporter: Holden Karau
>Priority: Minor
>
> Improve Logging for disconnects during exec id request to simplify our 
> network logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42762) Improve Logging for disconnects during exec id request

2023-03-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42762:


Assignee: Apache Spark

> Improve Logging for disconnects during exec id request
> --
>
> Key: SPARK-42762
> URL: https://issues.apache.org/jira/browse/SPARK-42762
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Minor
>
> Improve Logging for disconnects during exec id request to simplify our 
> network logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42762) Improve Logging for disconnects during exec id request

2023-03-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699280#comment-17699280
 ] 

Apache Spark commented on SPARK-42762:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/40383

> Improve Logging for disconnects during exec id request
> --
>
> Key: SPARK-42762
> URL: https://issues.apache.org/jira/browse/SPARK-42762
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.0
>Reporter: Holden Karau
>Priority: Minor
>
> Improve Logging for disconnects during exec id request to simplify our 
> network logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42762) Improve Logging for disconnects during exec id request

2023-03-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699279#comment-17699279
 ] 

Apache Spark commented on SPARK-42762:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/40383

> Improve Logging for disconnects during exec id request
> --
>
> Key: SPARK-42762
> URL: https://issues.apache.org/jira/browse/SPARK-42762
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.0
>Reporter: Holden Karau
>Priority: Minor
>
> Improve Logging for disconnects during exec id request to simplify our 
> network logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42762) Improve Logging for disconnects during exec id request

2023-03-11 Thread Holden Karau (Jira)
Holden Karau created SPARK-42762:


 Summary: Improve Logging for disconnects during exec id request
 Key: SPARK-42762
 URL: https://issues.apache.org/jira/browse/SPARK-42762
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.5.0
Reporter: Holden Karau


Improve Logging for disconnects during exec id request to simplify our network 
logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42679) createDataFrame doesn't work with non-nullable schema.

2023-03-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699271#comment-17699271
 ] 

Apache Spark commented on SPARK-42679:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/40382

> createDataFrame doesn't work with non-nullable schema.
> --
>
> Key: SPARK-42679
> URL: https://issues.apache.org/jira/browse/SPARK-42679
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> spark.createDataFrame won't work with non-nullable schema as below:
> {code:java}
> from pyspark.sql.types import *
> schema_false = StructType([StructField("id", IntegerType(), False)])
> spark.createDataFrame([[1]], schema=schema_false)
> Traceback (most recent call last):
> ...
> pyspark.errors.exceptions.connect.AnalysisException: 
> [NULLABLE_COLUMN_OR_FIELD] Column or field `id` is nullable while it's 
> required to be non-nullable.{code}
> whereas it works fine with nullable schema:
> {code:java}
> schema_true = StructType([StructField("id", IntegerType(), True)])
> spark.createDataFrame([[1]], schema=schema_true)
> DataFrame[id: int]{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42761) kubernetes-client from 6.4.1 to 6.5.0

2023-03-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42761:


Assignee: (was: Apache Spark)

> kubernetes-client from 6.4.1 to 6.5.0
> -
>
> Key: SPARK-42761
> URL: https://issues.apache.org/jira/browse/SPARK-42761
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> Upgrade fabric8:kubernetes-client from 6.4.1 to 6.5.0
> [CVE-2022-1471|https://www.cve.org/CVERecord?id=CVE-2022-1471]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42761) kubernetes-client from 6.4.1 to 6.5.0

2023-03-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42761:


Assignee: Apache Spark

> kubernetes-client from 6.4.1 to 6.5.0
> -
>
> Key: SPARK-42761
> URL: https://issues.apache.org/jira/browse/SPARK-42761
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Bjørn Jørgensen
>Assignee: Apache Spark
>Priority: Major
>
> Upgrade fabric8:kubernetes-client from 6.4.1 to 6.5.0
> [CVE-2022-1471|https://www.cve.org/CVERecord?id=CVE-2022-1471]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42761) kubernetes-client from 6.4.1 to 6.5.0

2023-03-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699269#comment-17699269
 ] 

Apache Spark commented on SPARK-42761:
--

User 'bjornjorgensen' has created a pull request for this issue:
https://github.com/apache/spark/pull/40381

> kubernetes-client from 6.4.1 to 6.5.0
> -
>
> Key: SPARK-42761
> URL: https://issues.apache.org/jira/browse/SPARK-42761
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> Upgrade fabric8:kubernetes-client from 6.4.1 to 6.5.0
> [CVE-2022-1471|https://www.cve.org/CVERecord?id=CVE-2022-1471]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42761) kubernetes-client from 6.4.1 to 6.5.0

2023-03-11 Thread Jira
Bjørn Jørgensen created SPARK-42761:
---

 Summary: kubernetes-client from 6.4.1 to 6.5.0
 Key: SPARK-42761
 URL: https://issues.apache.org/jira/browse/SPARK-42761
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Build
Affects Versions: 3.5.0
Reporter: Bjørn Jørgensen


Upgrade fabric8:kubernetes-client from 6.4.1 to 6.5.0

[CVE-2022-1471|https://www.cve.org/CVERecord?id=CVE-2022-1471]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42760) The partition of result data frame of join is always 1

2023-03-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699264#comment-17699264
 ] 

Apache Spark commented on SPARK-42760:
--

User '1511351836' has created a pull request for this issue:
https://github.com/apache/spark/pull/40380

> The partition of result data frame of join is always 1
> --
>
> Key: SPARK-42760
> URL: https://issues.apache.org/jira/browse/SPARK-42760
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.3.2
> Environment: standard spark 3.0.3/3.3.2, using in jupyter notebook, 
> local mode
>Reporter: binyang
>Priority: Major
>
> I am using pyspark. The partition of result data frame of join is always 1.
> Here is my code from 
> https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join
>  
> print(spark.version)
> def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4):
>     spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions)
>     spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
>     df1 = spark.range(1, 1000).repartition(data_partitions)
>     df2 = spark.range(1, 2000).repartition(data_partitions)
>     df3 = spark.range(1, 3000).repartition(data_partitions)
>     print("Data partitions is: {}. Shuffle partitions is 
> {}".format(data_partitions, shuffle_partitions))
>     print("Data partitions before join: 
> {}".format(df1.rdd.getNumPartitions()))
>     df = (df1.join(df2, df1.id == df2.id)
>           .join(df3, df1.id == df3.id))
>     print("Data partitions after join : {}".format(df.rdd.getNumPartitions()))
> example_shuffle_partitions()
>  
> In Spark 3.0.3, it prints out:
> 3.0.3
> Data partitions is: 10. Shuffle partitions is 4
> Data partitions before join: 10
> Data partitions after join : 4
> However, it prints out the following in the latest 3.3.2
> 3.3.2
> Data partitions is: 10. Shuffle partitions is 4
> Data partitions before join: 10
> Data partitions after join : 1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42760) The partition of result data frame of join is always 1

2023-03-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42760:


Assignee: Apache Spark

> The partition of result data frame of join is always 1
> --
>
> Key: SPARK-42760
> URL: https://issues.apache.org/jira/browse/SPARK-42760
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.3.2
> Environment: standard spark 3.0.3/3.3.2, using in jupyter notebook, 
> local mode
>Reporter: binyang
>Assignee: Apache Spark
>Priority: Major
>
> I am using pyspark. The partition of result data frame of join is always 1.
> Here is my code from 
> https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join
>  
> print(spark.version)
> def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4):
>     spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions)
>     spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
>     df1 = spark.range(1, 1000).repartition(data_partitions)
>     df2 = spark.range(1, 2000).repartition(data_partitions)
>     df3 = spark.range(1, 3000).repartition(data_partitions)
>     print("Data partitions is: {}. Shuffle partitions is 
> {}".format(data_partitions, shuffle_partitions))
>     print("Data partitions before join: 
> {}".format(df1.rdd.getNumPartitions()))
>     df = (df1.join(df2, df1.id == df2.id)
>           .join(df3, df1.id == df3.id))
>     print("Data partitions after join : {}".format(df.rdd.getNumPartitions()))
> example_shuffle_partitions()
>  
> In Spark 3.0.3, it prints out:
> 3.0.3
> Data partitions is: 10. Shuffle partitions is 4
> Data partitions before join: 10
> Data partitions after join : 4
> However, it prints out the following in the latest 3.3.2
> 3.3.2
> Data partitions is: 10. Shuffle partitions is 4
> Data partitions before join: 10
> Data partitions after join : 1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42760) The partition of result data frame of join is always 1

2023-03-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42760:


Assignee: (was: Apache Spark)

> The partition of result data frame of join is always 1
> --
>
> Key: SPARK-42760
> URL: https://issues.apache.org/jira/browse/SPARK-42760
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.3.2
> Environment: standard spark 3.0.3/3.3.2, using in jupyter notebook, 
> local mode
>Reporter: binyang
>Priority: Major
>
> I am using pyspark. The partition of result data frame of join is always 1.
> Here is my code from 
> https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join
>  
> print(spark.version)
> def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4):
>     spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions)
>     spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
>     df1 = spark.range(1, 1000).repartition(data_partitions)
>     df2 = spark.range(1, 2000).repartition(data_partitions)
>     df3 = spark.range(1, 3000).repartition(data_partitions)
>     print("Data partitions is: {}. Shuffle partitions is 
> {}".format(data_partitions, shuffle_partitions))
>     print("Data partitions before join: 
> {}".format(df1.rdd.getNumPartitions()))
>     df = (df1.join(df2, df1.id == df2.id)
>           .join(df3, df1.id == df3.id))
>     print("Data partitions after join : {}".format(df.rdd.getNumPartitions()))
> example_shuffle_partitions()
>  
> In Spark 3.0.3, it prints out:
> 3.0.3
> Data partitions is: 10. Shuffle partitions is 4
> Data partitions before join: 10
> Data partitions after join : 4
> However, it prints out the following in the latest 3.3.2
> 3.3.2
> Data partitions is: 10. Shuffle partitions is 4
> Data partitions before join: 10
> Data partitions after join : 1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42425) spark-hadoop-cloud is not provided in the default Spark distribution

2023-03-11 Thread Arseniy Tashoyan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699248#comment-17699248
 ] 

Arseniy Tashoyan commented on SPARK-42425:
--

The doc says to declare this dependency as provided, hence assumes this jar is 
bundled in the Spark distro. Either the doc is wrong or the distro is missing 
the lib.

> spark-hadoop-cloud is not provided in the default Spark distribution
> 
>
> Key: SPARK-42425
> URL: https://issues.apache.org/jira/browse/SPARK-42425
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.3.1
>Reporter: Arseniy Tashoyan
>Priority: Major
>
> The library spark-hadoop-cloud is absent in the default Spark distribution 
> (as well as its dependencies like hadoop-aws). Therefore the dependency 
> management section described in [Integration with Cloud 
> Infrastructures|https://spark.apache.org/docs/3.3.1/cloud-integration.html#installation]
>  is invalid. Actually the libraries for cloud integration are not provided.
> A naive workaround would be to add the spark-hadoop-cloud library as a 
> compile-scope dependency. However, this does not work due to Spark classpath 
> hierarchy. Spark system classloader does not see classes loaded by the 
> application classloader.
> Therefore a proper fix would be to enable the hadoop-cloud build profile by 
> default: -Phadoop-cloud



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42759) Avoid duplicated `build/apache-maven` install when target already exists

2023-03-11 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-42759.
--
Resolution: Not A Problem

> Avoid duplicated `build/apache-maven` install when target already exists
> 
>
> Key: SPARK-42759
> URL: https://issues.apache.org/jira/browse/SPARK-42759
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42759) Avoid duplicated `build/apache-maven` install when target already exists

2023-03-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42759:


Assignee: (was: Apache Spark)

> Avoid duplicated `build/apache-maven` install when target already exists
> 
>
> Key: SPARK-42759
> URL: https://issues.apache.org/jira/browse/SPARK-42759
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42759) Avoid duplicated `build/apache-maven` install when target already exists

2023-03-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42759:


Assignee: Apache Spark

> Avoid duplicated `build/apache-maven` install when target already exists
> 
>
> Key: SPARK-42759
> URL: https://issues.apache.org/jira/browse/SPARK-42759
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42759) Avoid duplicated `build/apache-maven` install when target already exists

2023-03-11 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-42759:
-
Summary: Avoid duplicated `build/apache-maven` install when target already 
exists  (was: Avoid repeated install of `build/apache-maven`)

> Avoid duplicated `build/apache-maven` install when target already exists
> 
>
> Key: SPARK-42759
> URL: https://issues.apache.org/jira/browse/SPARK-42759
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-42759) Avoid repeated install of `build/apache-maven`

2023-03-11 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reopened SPARK-42759:
--

> Avoid repeated install of `build/apache-maven`
> --
>
> Key: SPARK-42759
> URL: https://issues.apache.org/jira/browse/SPARK-42759
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42759) Avoid repeated install of `build/apache-maven`

2023-03-11 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-42759.
--
Resolution: Not A Problem

> Avoid repeated install of `build/apache-maven`
> --
>
> Key: SPARK-42759
> URL: https://issues.apache.org/jira/browse/SPARK-42759
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42759) Avoid repeated install of `build/apache-maven`

2023-03-11 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-42759:
-
Summary: Avoid repeated install of `build/apache-maven`  (was: Avoid 
repeated downloads of maven.tar.gz)

> Avoid repeated install of `build/apache-maven`
> --
>
> Key: SPARK-42759
> URL: https://issues.apache.org/jira/browse/SPARK-42759
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42760) The partition of result data frame of join is always 1

2023-03-11 Thread binyang (Jira)
binyang created SPARK-42760:
---

 Summary: The partition of result data frame of join is always 1
 Key: SPARK-42760
 URL: https://issues.apache.org/jira/browse/SPARK-42760
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 3.3.2
 Environment: standard spark 3.0.3/3.3.2, using in jupyter notebook, 
local mode
Reporter: binyang


I am using pyspark. The partition of result data frame of join is always 1.

Here is my code from 
https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join

 

print(spark.version)

def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4):
    spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions)
    spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
    df1 = spark.range(1, 1000).repartition(data_partitions)
    df2 = spark.range(1, 2000).repartition(data_partitions)
    df3 = spark.range(1, 3000).repartition(data_partitions)

    print("Data partitions is: {}. Shuffle partitions is 
{}".format(data_partitions, shuffle_partitions))
    print("Data partitions before join: {}".format(df1.rdd.getNumPartitions()))

    df = (df1.join(df2, df1.id == df2.id)
          .join(df3, df1.id == df3.id))

    print("Data partitions after join : {}".format(df.rdd.getNumPartitions()))

example_shuffle_partitions()

 


In Spark 3.0.3, it prints out:
3.0.3
Data partitions is: 10. Shuffle partitions is 4
Data partitions before join: 10
Data partitions after join : 4


However, it prints out the following in the latest 3.3.2
3.3.2
Data partitions is: 10. Shuffle partitions is 4
Data partitions before join: 10
Data partitions after join : 1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42747) Fix incorrect internal status of LoR and AFT

2023-03-11 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-42747.
--
Fix Version/s: 3.2.4
   3.4.1
   3.5.0
   3.3.2
 Assignee: Ruifeng Zheng
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/40367

> Fix incorrect internal status of LoR and AFT
> 
>
> Key: SPARK-42747
> URL: https://issues.apache.org/jira/browse/SPARK-42747
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.1.0, 3.2.0, 3.3.0, 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.2.4, 3.4.1, 3.5.0, 3.3.2
>
>
> LoR and AFT applied internal status to optimize prediction/transform, but the 
> status is not correctly updated in some case:
> {code:java}
> from pyspark.sql import Row
> from pyspark.ml.classification import *
> from pyspark.ml.linalg import Vectors
> df = spark.createDataFrame(
> [
> (1.0, 1.0, Vectors.dense(0.0, 5.0)),
> (0.0, 2.0, Vectors.dense(1.0, 2.0)),
> (1.0, 3.0, Vectors.dense(2.0, 1.0)),
> (0.0, 4.0, Vectors.dense(3.0, 3.0)),
> ],
> ["label", "weight", "features"],
> )
> lor = LogisticRegression(weightCol="weight")
> model = lor.fit(df)
> # status changes 1
> for t in [0.0, 0.1, 0.2, 0.5, 1.0]:
> model.setThreshold(t).transform(df)
> # status changes 2
> [model.setThreshold(t).predict(Vectors.dense(0.0, 5.0)) for t in [0.0, 0.1, 
> 0.2, 0.5, 1.0]]
> for t in [0.0, 0.1, 0.2, 0.5, 1.0]:
> print(t)
> model.setThreshold(t).transform(df).show()
> #  <- error results
> {code}
> results:
> {code:java}
> 0.0
> +-+--+-+++--+
> |label|weight| features|   rawPrediction| probability|prediction|
> +-+--+-+++--+
> |  1.0|   1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...|   0.0|
> |  0.0|   2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...|   0.0|
> |  1.0|   3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...|   0.0|
> |  0.0|   4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...|   0.0|
> +-+--+-+++--+
> 0.1
> +-+--+-+++--+
> |label|weight| features|   rawPrediction| probability|prediction|
> +-+--+-+++--+
> |  1.0|   1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...|   0.0|
> |  0.0|   2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...|   0.0|
> |  1.0|   3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...|   0.0|
> |  0.0|   4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...|   0.0|
> +-+--+-+++--+
> 0.2
> +-+--+-+++--+
> |label|weight| features|   rawPrediction| probability|prediction|
> +-+--+-+++--+
> |  1.0|   1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...|   0.0|
> |  0.0|   2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...|   0.0|
> |  1.0|   3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...|   0.0|
> |  0.0|   4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...|   0.0|
> +-+--+-+++--+
> 0.5
> +-+--+-+++--+
> |label|weight| features|   rawPrediction| probability|prediction|
> +-+--+-+++--+
> |  1.0|   1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...|   0.0|
> |  0.0|   2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...|   0.0|
> |  1.0|   3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...|   0.0|
> |  0.0|   4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...|   0.0|
> +-+--+-+++--+
> 1.0
> +-+--+-+++--+
> |label|weight| features|   rawPrediction| probability|prediction|
> +-+--+-+++--+
> |  1.0|   1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...|   0.0|
> |  0.0|   2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...|   0.0|
> |  1.0|   3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...| 

[jira] [Commented] (SPARK-42759) Avoid repeated downloads of maven.tar.gz

2023-03-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699229#comment-17699229
 ] 

Apache Spark commented on SPARK-42759:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40379

> Avoid repeated downloads of maven.tar.gz
> 
>
> Key: SPARK-42759
> URL: https://issues.apache.org/jira/browse/SPARK-42759
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42759) Avoid repeated downloads of maven.tar.gz

2023-03-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699228#comment-17699228
 ] 

Apache Spark commented on SPARK-42759:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40379

> Avoid repeated downloads of maven.tar.gz
> 
>
> Key: SPARK-42759
> URL: https://issues.apache.org/jira/browse/SPARK-42759
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42759) Avoid repeated downloads of maven.tar.gz

2023-03-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42759:


Assignee: (was: Apache Spark)

> Avoid repeated downloads of maven.tar.gz
> 
>
> Key: SPARK-42759
> URL: https://issues.apache.org/jira/browse/SPARK-42759
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42759) Avoid repeated downloads of maven.tar.gz

2023-03-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42759:


Assignee: Apache Spark

> Avoid repeated downloads of maven.tar.gz
> 
>
> Key: SPARK-42759
> URL: https://issues.apache.org/jira/browse/SPARK-42759
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42759) Avoid repeated downloads of maven.tar.gz

2023-03-11 Thread Yang Jie (Jira)
Yang Jie created SPARK-42759:


 Summary: Avoid repeated downloads of maven.tar.gz
 Key: SPARK-42759
 URL: https://issues.apache.org/jira/browse/SPARK-42759
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.4.0, 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42670) Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings

2023-03-11 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-42670.
--
Fix Version/s: 3.5.0
 Assignee: BingKun Pan
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/40278

> Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings
> 
>
> Key: SPARK-42670
> URL: https://issues.apache.org/jira/browse/SPARK-42670
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Trivial
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42670) Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings

2023-03-11 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-42670:
-
Priority: Trivial  (was: Minor)

> Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings
> 
>
> Key: SPARK-42670
> URL: https://issues.apache.org/jira/browse/SPARK-42670
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42758) Remove dependency on breeze

2023-03-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42758:


Assignee: Apache Spark

> Remove dependency on breeze
> ---
>
> Key: SPARK-42758
> URL: https://issues.apache.org/jira/browse/SPARK-42758
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, MLlib
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42758) Remove dependency on breeze

2023-03-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42758:


Assignee: (was: Apache Spark)

> Remove dependency on breeze
> ---
>
> Key: SPARK-42758
> URL: https://issues.apache.org/jira/browse/SPARK-42758
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, MLlib
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42758) Remove dependency on breeze

2023-03-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699214#comment-17699214
 ] 

Apache Spark commented on SPARK-42758:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/40378

> Remove dependency on breeze
> ---
>
> Key: SPARK-42758
> URL: https://issues.apache.org/jira/browse/SPARK-42758
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, MLlib
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42758) Remove dependency on breeze

2023-03-11 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-42758:
---

 Summary: Remove dependency on breeze
 Key: SPARK-42758
 URL: https://issues.apache.org/jira/browse/SPARK-42758
 Project: Spark
  Issue Type: Improvement
  Components: Build, MLlib
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42669) Short circuit local relation rpcs

2023-03-11 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699213#comment-17699213
 ] 

jiaan.geng commented on SPARK-42669:


I will take a look!

> Short circuit local relation rpcs
> -
>
> Key: SPARK-42669
> URL: https://issues.apache.org/jira/browse/SPARK-42669
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>
> Operations on LocalRelation can mostly be done locally (without sending 
> rpcs). We should leverage this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42757) Implement textFile for DataFrameReader

2023-03-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699204#comment-17699204
 ] 

Apache Spark commented on SPARK-42757:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/40377

> Implement textFile for DataFrameReader
> --
>
> Key: SPARK-42757
> URL: https://issues.apache.org/jira/browse/SPARK-42757
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42757) Implement textFile for DataFrameReader

2023-03-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699203#comment-17699203
 ] 

Apache Spark commented on SPARK-42757:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/40377

> Implement textFile for DataFrameReader
> --
>
> Key: SPARK-42757
> URL: https://issues.apache.org/jira/browse/SPARK-42757
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42757) Implement textFile for DataFrameReader

2023-03-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42757:


Assignee: Apache Spark

> Implement textFile for DataFrameReader
> --
>
> Key: SPARK-42757
> URL: https://issues.apache.org/jira/browse/SPARK-42757
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42757) Implement textFile for DataFrameReader

2023-03-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42757:


Assignee: (was: Apache Spark)

> Implement textFile for DataFrameReader
> --
>
> Key: SPARK-42757
> URL: https://issues.apache.org/jira/browse/SPARK-42757
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42757) Implement textFile for DataFrameReader

2023-03-11 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-42757:
---

 Summary: Implement textFile for DataFrameReader
 Key: SPARK-42757
 URL: https://issues.apache.org/jira/browse/SPARK-42757
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.4.1
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42691) Implement Dataset.semanticHash

2023-03-11 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-42691.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40366
[https://github.com/apache/spark/pull/40366]

> Implement Dataset.semanticHash
> --
>
> Key: SPARK-42691
> URL: https://issues.apache.org/jira/browse/SPARK-42691
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement Dataset.semanticHash:
> {code:java}
> /**
> * Returns a `hashCode` of the logical query plan against this [[Dataset]].
> *
> * @note Unlike the standard `hashCode`, the hash is calculated against the 
> query plan
> * simplified by tolerating the cosmetic differences such as attribute names.
> * @since 3.4.0
> */
> @DeveloperApi
> def semanticHash(): Int{code}
> This has to be computed on the spark connect server to do this. Please extend 
> the 
> AnalyzePlanRequest and AnalyzePlanResponse messages for this.
> Also make sure this works in PySpark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42756) Helper function to convert proto literal to value in Python Client

2023-03-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42756:


Assignee: Apache Spark

> Helper function to convert proto literal to value in Python Client
> --
>
> Key: SPARK-42756
> URL: https://issues.apache.org/jira/browse/SPARK-42756
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42756) Helper function to convert proto literal to value in Python Client

2023-03-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42756:


Assignee: (was: Apache Spark)

> Helper function to convert proto literal to value in Python Client
> --
>
> Key: SPARK-42756
> URL: https://issues.apache.org/jira/browse/SPARK-42756
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42756) Helper function to convert proto literal to value in Python Client

2023-03-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699196#comment-17699196
 ] 

Apache Spark commented on SPARK-42756:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40376

> Helper function to convert proto literal to value in Python Client
> --
>
> Key: SPARK-42756
> URL: https://issues.apache.org/jira/browse/SPARK-42756
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42756) Helper function to convert proto literal to value in Python Client

2023-03-11 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-42756:
-

 Summary: Helper function to convert proto literal to value in 
Python Client
 Key: SPARK-42756
 URL: https://issues.apache.org/jira/browse/SPARK-42756
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org