[jira] [Commented] (SPARK-42763) Upgrade ZooKeeper from 3.6.3 to 3.6.4
[ https://issues.apache.org/jira/browse/SPARK-42763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699298#comment-17699298 ] Apache Spark commented on SPARK-42763: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/40384 > Upgrade ZooKeeper from 3.6.3 to 3.6.4 > - > > Key: SPARK-42763 > URL: https://issues.apache.org/jira/browse/SPARK-42763 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42763) Upgrade ZooKeeper from 3.6.3 to 3.6.4
[ https://issues.apache.org/jira/browse/SPARK-42763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42763. --- Resolution: Duplicate > Upgrade ZooKeeper from 3.6.3 to 3.6.4 > - > > Key: SPARK-42763 > URL: https://issues.apache.org/jira/browse/SPARK-42763 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-42763) Upgrade ZooKeeper from 3.6.3 to 3.6.4
[ https://issues.apache.org/jira/browse/SPARK-42763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-42763. - > Upgrade ZooKeeper from 3.6.3 to 3.6.4 > - > > Key: SPARK-42763 > URL: https://issues.apache.org/jira/browse/SPARK-42763 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42762) Improve Logging for disconnects during exec id request
[ https://issues.apache.org/jira/browse/SPARK-42762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42762. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40383 [https://github.com/apache/spark/pull/40383] > Improve Logging for disconnects during exec id request > -- > > Key: SPARK-42762 > URL: https://issues.apache.org/jira/browse/SPARK-42762 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.5.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Minor > Fix For: 3.5.0 > > > Improve Logging for disconnects during exec id request to simplify our > network logs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42762) Improve Logging for disconnects during exec id request
[ https://issues.apache.org/jira/browse/SPARK-42762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42762: - Assignee: Holden Karau > Improve Logging for disconnects during exec id request > -- > > Key: SPARK-42762 > URL: https://issues.apache.org/jira/browse/SPARK-42762 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.5.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Minor > > Improve Logging for disconnects during exec id request to simplify our > network logs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42763) Upgrade ZooKeeper from 3.6.3 to 3.6.4
BingKun Pan created SPARK-42763: --- Summary: Upgrade ZooKeeper from 3.6.3 to 3.6.4 Key: SPARK-42763 URL: https://issues.apache.org/jira/browse/SPARK-42763 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42758) Remove dependency on breeze
[ https://issues.apache.org/jira/browse/SPARK-42758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-42758. -- Fix Version/s: 3.5.0 Assignee: BingKun Pan Resolution: Fixed Resolved by https://github.com/apache/spark/pull/40378 > Remove dependency on breeze > --- > > Key: SPARK-42758 > URL: https://issues.apache.org/jira/browse/SPARK-42758 > Project: Spark > Issue Type: Improvement > Components: Build, MLlib >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Trivial > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42758) Remove dependency on breeze
[ https://issues.apache.org/jira/browse/SPARK-42758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-42758: - Priority: Trivial (was: Minor) > Remove dependency on breeze > --- > > Key: SPARK-42758 > URL: https://issues.apache.org/jira/browse/SPARK-42758 > Project: Spark > Issue Type: Improvement > Components: Build, MLlib >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42760) The partition of result data frame of join is always 1
[ https://issues.apache.org/jira/browse/SPARK-42760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699291#comment-17699291 ] Yuming Wang commented on SPARK-42760: - Could you try to disable AQE(set spark.sql.adaptive.enabled = false)? > The partition of result data frame of join is always 1 > -- > > Key: SPARK-42760 > URL: https://issues.apache.org/jira/browse/SPARK-42760 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.3.2 > Environment: standard spark 3.0.3/3.3.2, using in jupyter notebook, > local mode >Reporter: binyang >Priority: Major > > I am using pyspark. The partition of result data frame of join is always 1. > Here is my code from > https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join > > print(spark.version) > def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4): > spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions) > spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1") > df1 = spark.range(1, 1000).repartition(data_partitions) > df2 = spark.range(1, 2000).repartition(data_partitions) > df3 = spark.range(1, 3000).repartition(data_partitions) > print("Data partitions is: {}. Shuffle partitions is > {}".format(data_partitions, shuffle_partitions)) > print("Data partitions before join: > {}".format(df1.rdd.getNumPartitions())) > df = (df1.join(df2, df1.id == df2.id) > .join(df3, df1.id == df3.id)) > print("Data partitions after join : {}".format(df.rdd.getNumPartitions())) > example_shuffle_partitions() > > In Spark 3.0.3, it prints out: > 3.0.3 > Data partitions is: 10. Shuffle partitions is 4 > Data partitions before join: 10 > Data partitions after join : 4 > However, it prints out the following in the latest 3.3.2 > 3.3.2 > Data partitions is: 10. Shuffle partitions is 4 > Data partitions before join: 10 > Data partitions after join : 1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42762) Improve Logging for disconnects during exec id request
[ https://issues.apache.org/jira/browse/SPARK-42762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42762: Assignee: (was: Apache Spark) > Improve Logging for disconnects during exec id request > -- > > Key: SPARK-42762 > URL: https://issues.apache.org/jira/browse/SPARK-42762 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.5.0 >Reporter: Holden Karau >Priority: Minor > > Improve Logging for disconnects during exec id request to simplify our > network logs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42762) Improve Logging for disconnects during exec id request
[ https://issues.apache.org/jira/browse/SPARK-42762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42762: Assignee: Apache Spark > Improve Logging for disconnects during exec id request > -- > > Key: SPARK-42762 > URL: https://issues.apache.org/jira/browse/SPARK-42762 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.5.0 >Reporter: Holden Karau >Assignee: Apache Spark >Priority: Minor > > Improve Logging for disconnects during exec id request to simplify our > network logs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42762) Improve Logging for disconnects during exec id request
[ https://issues.apache.org/jira/browse/SPARK-42762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699280#comment-17699280 ] Apache Spark commented on SPARK-42762: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/40383 > Improve Logging for disconnects during exec id request > -- > > Key: SPARK-42762 > URL: https://issues.apache.org/jira/browse/SPARK-42762 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.5.0 >Reporter: Holden Karau >Priority: Minor > > Improve Logging for disconnects during exec id request to simplify our > network logs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42762) Improve Logging for disconnects during exec id request
[ https://issues.apache.org/jira/browse/SPARK-42762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699279#comment-17699279 ] Apache Spark commented on SPARK-42762: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/40383 > Improve Logging for disconnects during exec id request > -- > > Key: SPARK-42762 > URL: https://issues.apache.org/jira/browse/SPARK-42762 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.5.0 >Reporter: Holden Karau >Priority: Minor > > Improve Logging for disconnects during exec id request to simplify our > network logs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42762) Improve Logging for disconnects during exec id request
Holden Karau created SPARK-42762: Summary: Improve Logging for disconnects during exec id request Key: SPARK-42762 URL: https://issues.apache.org/jira/browse/SPARK-42762 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.5.0 Reporter: Holden Karau Improve Logging for disconnects during exec id request to simplify our network logs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42679) createDataFrame doesn't work with non-nullable schema.
[ https://issues.apache.org/jira/browse/SPARK-42679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699271#comment-17699271 ] Apache Spark commented on SPARK-42679: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/40382 > createDataFrame doesn't work with non-nullable schema. > -- > > Key: SPARK-42679 > URL: https://issues.apache.org/jira/browse/SPARK-42679 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > spark.createDataFrame won't work with non-nullable schema as below: > {code:java} > from pyspark.sql.types import * > schema_false = StructType([StructField("id", IntegerType(), False)]) > spark.createDataFrame([[1]], schema=schema_false) > Traceback (most recent call last): > ... > pyspark.errors.exceptions.connect.AnalysisException: > [NULLABLE_COLUMN_OR_FIELD] Column or field `id` is nullable while it's > required to be non-nullable.{code} > whereas it works fine with nullable schema: > {code:java} > schema_true = StructType([StructField("id", IntegerType(), True)]) > spark.createDataFrame([[1]], schema=schema_true) > DataFrame[id: int]{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42761) kubernetes-client from 6.4.1 to 6.5.0
[ https://issues.apache.org/jira/browse/SPARK-42761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42761: Assignee: (was: Apache Spark) > kubernetes-client from 6.4.1 to 6.5.0 > - > > Key: SPARK-42761 > URL: https://issues.apache.org/jira/browse/SPARK-42761 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.5.0 >Reporter: Bjørn Jørgensen >Priority: Major > > Upgrade fabric8:kubernetes-client from 6.4.1 to 6.5.0 > [CVE-2022-1471|https://www.cve.org/CVERecord?id=CVE-2022-1471] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42761) kubernetes-client from 6.4.1 to 6.5.0
[ https://issues.apache.org/jira/browse/SPARK-42761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42761: Assignee: Apache Spark > kubernetes-client from 6.4.1 to 6.5.0 > - > > Key: SPARK-42761 > URL: https://issues.apache.org/jira/browse/SPARK-42761 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.5.0 >Reporter: Bjørn Jørgensen >Assignee: Apache Spark >Priority: Major > > Upgrade fabric8:kubernetes-client from 6.4.1 to 6.5.0 > [CVE-2022-1471|https://www.cve.org/CVERecord?id=CVE-2022-1471] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42761) kubernetes-client from 6.4.1 to 6.5.0
[ https://issues.apache.org/jira/browse/SPARK-42761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699269#comment-17699269 ] Apache Spark commented on SPARK-42761: -- User 'bjornjorgensen' has created a pull request for this issue: https://github.com/apache/spark/pull/40381 > kubernetes-client from 6.4.1 to 6.5.0 > - > > Key: SPARK-42761 > URL: https://issues.apache.org/jira/browse/SPARK-42761 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.5.0 >Reporter: Bjørn Jørgensen >Priority: Major > > Upgrade fabric8:kubernetes-client from 6.4.1 to 6.5.0 > [CVE-2022-1471|https://www.cve.org/CVERecord?id=CVE-2022-1471] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42761) kubernetes-client from 6.4.1 to 6.5.0
Bjørn Jørgensen created SPARK-42761: --- Summary: kubernetes-client from 6.4.1 to 6.5.0 Key: SPARK-42761 URL: https://issues.apache.org/jira/browse/SPARK-42761 Project: Spark Issue Type: Dependency upgrade Components: Build Affects Versions: 3.5.0 Reporter: Bjørn Jørgensen Upgrade fabric8:kubernetes-client from 6.4.1 to 6.5.0 [CVE-2022-1471|https://www.cve.org/CVERecord?id=CVE-2022-1471] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42760) The partition of result data frame of join is always 1
[ https://issues.apache.org/jira/browse/SPARK-42760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699264#comment-17699264 ] Apache Spark commented on SPARK-42760: -- User '1511351836' has created a pull request for this issue: https://github.com/apache/spark/pull/40380 > The partition of result data frame of join is always 1 > -- > > Key: SPARK-42760 > URL: https://issues.apache.org/jira/browse/SPARK-42760 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.3.2 > Environment: standard spark 3.0.3/3.3.2, using in jupyter notebook, > local mode >Reporter: binyang >Priority: Major > > I am using pyspark. The partition of result data frame of join is always 1. > Here is my code from > https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join > > print(spark.version) > def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4): > spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions) > spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1") > df1 = spark.range(1, 1000).repartition(data_partitions) > df2 = spark.range(1, 2000).repartition(data_partitions) > df3 = spark.range(1, 3000).repartition(data_partitions) > print("Data partitions is: {}. Shuffle partitions is > {}".format(data_partitions, shuffle_partitions)) > print("Data partitions before join: > {}".format(df1.rdd.getNumPartitions())) > df = (df1.join(df2, df1.id == df2.id) > .join(df3, df1.id == df3.id)) > print("Data partitions after join : {}".format(df.rdd.getNumPartitions())) > example_shuffle_partitions() > > In Spark 3.0.3, it prints out: > 3.0.3 > Data partitions is: 10. Shuffle partitions is 4 > Data partitions before join: 10 > Data partitions after join : 4 > However, it prints out the following in the latest 3.3.2 > 3.3.2 > Data partitions is: 10. Shuffle partitions is 4 > Data partitions before join: 10 > Data partitions after join : 1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42760) The partition of result data frame of join is always 1
[ https://issues.apache.org/jira/browse/SPARK-42760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42760: Assignee: Apache Spark > The partition of result data frame of join is always 1 > -- > > Key: SPARK-42760 > URL: https://issues.apache.org/jira/browse/SPARK-42760 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.3.2 > Environment: standard spark 3.0.3/3.3.2, using in jupyter notebook, > local mode >Reporter: binyang >Assignee: Apache Spark >Priority: Major > > I am using pyspark. The partition of result data frame of join is always 1. > Here is my code from > https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join > > print(spark.version) > def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4): > spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions) > spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1") > df1 = spark.range(1, 1000).repartition(data_partitions) > df2 = spark.range(1, 2000).repartition(data_partitions) > df3 = spark.range(1, 3000).repartition(data_partitions) > print("Data partitions is: {}. Shuffle partitions is > {}".format(data_partitions, shuffle_partitions)) > print("Data partitions before join: > {}".format(df1.rdd.getNumPartitions())) > df = (df1.join(df2, df1.id == df2.id) > .join(df3, df1.id == df3.id)) > print("Data partitions after join : {}".format(df.rdd.getNumPartitions())) > example_shuffle_partitions() > > In Spark 3.0.3, it prints out: > 3.0.3 > Data partitions is: 10. Shuffle partitions is 4 > Data partitions before join: 10 > Data partitions after join : 4 > However, it prints out the following in the latest 3.3.2 > 3.3.2 > Data partitions is: 10. Shuffle partitions is 4 > Data partitions before join: 10 > Data partitions after join : 1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42760) The partition of result data frame of join is always 1
[ https://issues.apache.org/jira/browse/SPARK-42760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42760: Assignee: (was: Apache Spark) > The partition of result data frame of join is always 1 > -- > > Key: SPARK-42760 > URL: https://issues.apache.org/jira/browse/SPARK-42760 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.3.2 > Environment: standard spark 3.0.3/3.3.2, using in jupyter notebook, > local mode >Reporter: binyang >Priority: Major > > I am using pyspark. The partition of result data frame of join is always 1. > Here is my code from > https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join > > print(spark.version) > def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4): > spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions) > spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1") > df1 = spark.range(1, 1000).repartition(data_partitions) > df2 = spark.range(1, 2000).repartition(data_partitions) > df3 = spark.range(1, 3000).repartition(data_partitions) > print("Data partitions is: {}. Shuffle partitions is > {}".format(data_partitions, shuffle_partitions)) > print("Data partitions before join: > {}".format(df1.rdd.getNumPartitions())) > df = (df1.join(df2, df1.id == df2.id) > .join(df3, df1.id == df3.id)) > print("Data partitions after join : {}".format(df.rdd.getNumPartitions())) > example_shuffle_partitions() > > In Spark 3.0.3, it prints out: > 3.0.3 > Data partitions is: 10. Shuffle partitions is 4 > Data partitions before join: 10 > Data partitions after join : 4 > However, it prints out the following in the latest 3.3.2 > 3.3.2 > Data partitions is: 10. Shuffle partitions is 4 > Data partitions before join: 10 > Data partitions after join : 1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42425) spark-hadoop-cloud is not provided in the default Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-42425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699248#comment-17699248 ] Arseniy Tashoyan commented on SPARK-42425: -- The doc says to declare this dependency as provided, hence assumes this jar is bundled in the Spark distro. Either the doc is wrong or the distro is missing the lib. > spark-hadoop-cloud is not provided in the default Spark distribution > > > Key: SPARK-42425 > URL: https://issues.apache.org/jira/browse/SPARK-42425 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.3.1 >Reporter: Arseniy Tashoyan >Priority: Major > > The library spark-hadoop-cloud is absent in the default Spark distribution > (as well as its dependencies like hadoop-aws). Therefore the dependency > management section described in [Integration with Cloud > Infrastructures|https://spark.apache.org/docs/3.3.1/cloud-integration.html#installation] > is invalid. Actually the libraries for cloud integration are not provided. > A naive workaround would be to add the spark-hadoop-cloud library as a > compile-scope dependency. However, this does not work due to Spark classpath > hierarchy. Spark system classloader does not see classes loaded by the > application classloader. > Therefore a proper fix would be to enable the hadoop-cloud build profile by > default: -Phadoop-cloud -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42759) Avoid duplicated `build/apache-maven` install when target already exists
[ https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-42759. -- Resolution: Not A Problem > Avoid duplicated `build/apache-maven` install when target already exists > > > Key: SPARK-42759 > URL: https://issues.apache.org/jira/browse/SPARK-42759 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42759) Avoid duplicated `build/apache-maven` install when target already exists
[ https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42759: Assignee: (was: Apache Spark) > Avoid duplicated `build/apache-maven` install when target already exists > > > Key: SPARK-42759 > URL: https://issues.apache.org/jira/browse/SPARK-42759 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42759) Avoid duplicated `build/apache-maven` install when target already exists
[ https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42759: Assignee: Apache Spark > Avoid duplicated `build/apache-maven` install when target already exists > > > Key: SPARK-42759 > URL: https://issues.apache.org/jira/browse/SPARK-42759 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42759) Avoid duplicated `build/apache-maven` install when target already exists
[ https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-42759: - Summary: Avoid duplicated `build/apache-maven` install when target already exists (was: Avoid repeated install of `build/apache-maven`) > Avoid duplicated `build/apache-maven` install when target already exists > > > Key: SPARK-42759 > URL: https://issues.apache.org/jira/browse/SPARK-42759 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-42759) Avoid repeated install of `build/apache-maven`
[ https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reopened SPARK-42759: -- > Avoid repeated install of `build/apache-maven` > -- > > Key: SPARK-42759 > URL: https://issues.apache.org/jira/browse/SPARK-42759 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42759) Avoid repeated install of `build/apache-maven`
[ https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-42759. -- Resolution: Not A Problem > Avoid repeated install of `build/apache-maven` > -- > > Key: SPARK-42759 > URL: https://issues.apache.org/jira/browse/SPARK-42759 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42759) Avoid repeated install of `build/apache-maven`
[ https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-42759: - Summary: Avoid repeated install of `build/apache-maven` (was: Avoid repeated downloads of maven.tar.gz) > Avoid repeated install of `build/apache-maven` > -- > > Key: SPARK-42759 > URL: https://issues.apache.org/jira/browse/SPARK-42759 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42760) The partition of result data frame of join is always 1
binyang created SPARK-42760: --- Summary: The partition of result data frame of join is always 1 Key: SPARK-42760 URL: https://issues.apache.org/jira/browse/SPARK-42760 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 3.3.2 Environment: standard spark 3.0.3/3.3.2, using in jupyter notebook, local mode Reporter: binyang I am using pyspark. The partition of result data frame of join is always 1. Here is my code from https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join print(spark.version) def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4): spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions) spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1") df1 = spark.range(1, 1000).repartition(data_partitions) df2 = spark.range(1, 2000).repartition(data_partitions) df3 = spark.range(1, 3000).repartition(data_partitions) print("Data partitions is: {}. Shuffle partitions is {}".format(data_partitions, shuffle_partitions)) print("Data partitions before join: {}".format(df1.rdd.getNumPartitions())) df = (df1.join(df2, df1.id == df2.id) .join(df3, df1.id == df3.id)) print("Data partitions after join : {}".format(df.rdd.getNumPartitions())) example_shuffle_partitions() In Spark 3.0.3, it prints out: 3.0.3 Data partitions is: 10. Shuffle partitions is 4 Data partitions before join: 10 Data partitions after join : 4 However, it prints out the following in the latest 3.3.2 3.3.2 Data partitions is: 10. Shuffle partitions is 4 Data partitions before join: 10 Data partitions after join : 1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42747) Fix incorrect internal status of LoR and AFT
[ https://issues.apache.org/jira/browse/SPARK-42747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-42747. -- Fix Version/s: 3.2.4 3.4.1 3.5.0 3.3.2 Assignee: Ruifeng Zheng Resolution: Fixed Resolved by https://github.com/apache/spark/pull/40367 > Fix incorrect internal status of LoR and AFT > > > Key: SPARK-42747 > URL: https://issues.apache.org/jira/browse/SPARK-42747 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 3.1.0, 3.2.0, 3.3.0, 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.2.4, 3.4.1, 3.5.0, 3.3.2 > > > LoR and AFT applied internal status to optimize prediction/transform, but the > status is not correctly updated in some case: > {code:java} > from pyspark.sql import Row > from pyspark.ml.classification import * > from pyspark.ml.linalg import Vectors > df = spark.createDataFrame( > [ > (1.0, 1.0, Vectors.dense(0.0, 5.0)), > (0.0, 2.0, Vectors.dense(1.0, 2.0)), > (1.0, 3.0, Vectors.dense(2.0, 1.0)), > (0.0, 4.0, Vectors.dense(3.0, 3.0)), > ], > ["label", "weight", "features"], > ) > lor = LogisticRegression(weightCol="weight") > model = lor.fit(df) > # status changes 1 > for t in [0.0, 0.1, 0.2, 0.5, 1.0]: > model.setThreshold(t).transform(df) > # status changes 2 > [model.setThreshold(t).predict(Vectors.dense(0.0, 5.0)) for t in [0.0, 0.1, > 0.2, 0.5, 1.0]] > for t in [0.0, 0.1, 0.2, 0.5, 1.0]: > print(t) > model.setThreshold(t).transform(df).show() > # <- error results > {code} > results: > {code:java} > 0.0 > +-+--+-+++--+ > |label|weight| features| rawPrediction| probability|prediction| > +-+--+-+++--+ > | 1.0| 1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...| 0.0| > | 0.0| 2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...| 0.0| > | 1.0| 3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...| 0.0| > | 0.0| 4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...| 0.0| > +-+--+-+++--+ > 0.1 > +-+--+-+++--+ > |label|weight| features| rawPrediction| probability|prediction| > +-+--+-+++--+ > | 1.0| 1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...| 0.0| > | 0.0| 2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...| 0.0| > | 1.0| 3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...| 0.0| > | 0.0| 4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...| 0.0| > +-+--+-+++--+ > 0.2 > +-+--+-+++--+ > |label|weight| features| rawPrediction| probability|prediction| > +-+--+-+++--+ > | 1.0| 1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...| 0.0| > | 0.0| 2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...| 0.0| > | 1.0| 3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...| 0.0| > | 0.0| 4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...| 0.0| > +-+--+-+++--+ > 0.5 > +-+--+-+++--+ > |label|weight| features| rawPrediction| probability|prediction| > +-+--+-+++--+ > | 1.0| 1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...| 0.0| > | 0.0| 2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...| 0.0| > | 1.0| 3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...| 0.0| > | 0.0| 4.0|[3.0,3.0]|[2.33975176373760...|[0.91211618852612...| 0.0| > +-+--+-+++--+ > 1.0 > +-+--+-+++--+ > |label|weight| features| rawPrediction| probability|prediction| > +-+--+-+++--+ > | 1.0| 1.0|[0.0,5.0]|[0.10932013376341...|[0.52730284774069...| 0.0| > | 0.0| 2.0|[1.0,2.0]|[-0.8619624039359...|[0.29692950635762...| 0.0| > | 1.0| 3.0|[2.0,1.0]|[-0.3634508721860...|[0.41012446452385...|
[jira] [Commented] (SPARK-42759) Avoid repeated downloads of maven.tar.gz
[ https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699229#comment-17699229 ] Apache Spark commented on SPARK-42759: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/40379 > Avoid repeated downloads of maven.tar.gz > > > Key: SPARK-42759 > URL: https://issues.apache.org/jira/browse/SPARK-42759 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42759) Avoid repeated downloads of maven.tar.gz
[ https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699228#comment-17699228 ] Apache Spark commented on SPARK-42759: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/40379 > Avoid repeated downloads of maven.tar.gz > > > Key: SPARK-42759 > URL: https://issues.apache.org/jira/browse/SPARK-42759 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42759) Avoid repeated downloads of maven.tar.gz
[ https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42759: Assignee: (was: Apache Spark) > Avoid repeated downloads of maven.tar.gz > > > Key: SPARK-42759 > URL: https://issues.apache.org/jira/browse/SPARK-42759 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42759) Avoid repeated downloads of maven.tar.gz
[ https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42759: Assignee: Apache Spark > Avoid repeated downloads of maven.tar.gz > > > Key: SPARK-42759 > URL: https://issues.apache.org/jira/browse/SPARK-42759 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42759) Avoid repeated downloads of maven.tar.gz
Yang Jie created SPARK-42759: Summary: Avoid repeated downloads of maven.tar.gz Key: SPARK-42759 URL: https://issues.apache.org/jira/browse/SPARK-42759 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.4.0, 3.5.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42670) Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings
[ https://issues.apache.org/jira/browse/SPARK-42670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-42670. -- Fix Version/s: 3.5.0 Assignee: BingKun Pan Resolution: Fixed Resolved by https://github.com/apache/spark/pull/40278 > Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings > > > Key: SPARK-42670 > URL: https://issues.apache.org/jira/browse/SPARK-42670 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Trivial > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42670) Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings
[ https://issues.apache.org/jira/browse/SPARK-42670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-42670: - Priority: Trivial (was: Minor) > Upgrade maven-surefire-plugin to 3.0.0-M9 & eliminate build warnings > > > Key: SPARK-42670 > URL: https://issues.apache.org/jira/browse/SPARK-42670 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42758) Remove dependency on breeze
[ https://issues.apache.org/jira/browse/SPARK-42758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42758: Assignee: Apache Spark > Remove dependency on breeze > --- > > Key: SPARK-42758 > URL: https://issues.apache.org/jira/browse/SPARK-42758 > Project: Spark > Issue Type: Improvement > Components: Build, MLlib >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42758) Remove dependency on breeze
[ https://issues.apache.org/jira/browse/SPARK-42758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42758: Assignee: (was: Apache Spark) > Remove dependency on breeze > --- > > Key: SPARK-42758 > URL: https://issues.apache.org/jira/browse/SPARK-42758 > Project: Spark > Issue Type: Improvement > Components: Build, MLlib >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42758) Remove dependency on breeze
[ https://issues.apache.org/jira/browse/SPARK-42758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699214#comment-17699214 ] Apache Spark commented on SPARK-42758: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/40378 > Remove dependency on breeze > --- > > Key: SPARK-42758 > URL: https://issues.apache.org/jira/browse/SPARK-42758 > Project: Spark > Issue Type: Improvement > Components: Build, MLlib >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42758) Remove dependency on breeze
BingKun Pan created SPARK-42758: --- Summary: Remove dependency on breeze Key: SPARK-42758 URL: https://issues.apache.org/jira/browse/SPARK-42758 Project: Spark Issue Type: Improvement Components: Build, MLlib Affects Versions: 3.5.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42669) Short circuit local relation rpcs
[ https://issues.apache.org/jira/browse/SPARK-42669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699213#comment-17699213 ] jiaan.geng commented on SPARK-42669: I will take a look! > Short circuit local relation rpcs > - > > Key: SPARK-42669 > URL: https://issues.apache.org/jira/browse/SPARK-42669 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > Operations on LocalRelation can mostly be done locally (without sending > rpcs). We should leverage this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42757) Implement textFile for DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-42757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699204#comment-17699204 ] Apache Spark commented on SPARK-42757: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/40377 > Implement textFile for DataFrameReader > -- > > Key: SPARK-42757 > URL: https://issues.apache.org/jira/browse/SPARK-42757 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.1 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42757) Implement textFile for DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-42757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699203#comment-17699203 ] Apache Spark commented on SPARK-42757: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/40377 > Implement textFile for DataFrameReader > -- > > Key: SPARK-42757 > URL: https://issues.apache.org/jira/browse/SPARK-42757 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.1 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42757) Implement textFile for DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-42757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42757: Assignee: Apache Spark > Implement textFile for DataFrameReader > -- > > Key: SPARK-42757 > URL: https://issues.apache.org/jira/browse/SPARK-42757 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.1 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42757) Implement textFile for DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-42757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42757: Assignee: (was: Apache Spark) > Implement textFile for DataFrameReader > -- > > Key: SPARK-42757 > URL: https://issues.apache.org/jira/browse/SPARK-42757 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.1 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42757) Implement textFile for DataFrameReader
BingKun Pan created SPARK-42757: --- Summary: Implement textFile for DataFrameReader Key: SPARK-42757 URL: https://issues.apache.org/jira/browse/SPARK-42757 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.4.1 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42691) Implement Dataset.semanticHash
[ https://issues.apache.org/jira/browse/SPARK-42691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-42691. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40366 [https://github.com/apache/spark/pull/40366] > Implement Dataset.semanticHash > -- > > Key: SPARK-42691 > URL: https://issues.apache.org/jira/browse/SPARK-42691 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > Fix For: 3.4.0 > > > Implement Dataset.semanticHash: > {code:java} > /** > * Returns a `hashCode` of the logical query plan against this [[Dataset]]. > * > * @note Unlike the standard `hashCode`, the hash is calculated against the > query plan > * simplified by tolerating the cosmetic differences such as attribute names. > * @since 3.4.0 > */ > @DeveloperApi > def semanticHash(): Int{code} > This has to be computed on the spark connect server to do this. Please extend > the > AnalyzePlanRequest and AnalyzePlanResponse messages for this. > Also make sure this works in PySpark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42756) Helper function to convert proto literal to value in Python Client
[ https://issues.apache.org/jira/browse/SPARK-42756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42756: Assignee: Apache Spark > Helper function to convert proto literal to value in Python Client > -- > > Key: SPARK-42756 > URL: https://issues.apache.org/jira/browse/SPARK-42756 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42756) Helper function to convert proto literal to value in Python Client
[ https://issues.apache.org/jira/browse/SPARK-42756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42756: Assignee: (was: Apache Spark) > Helper function to convert proto literal to value in Python Client > -- > > Key: SPARK-42756 > URL: https://issues.apache.org/jira/browse/SPARK-42756 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42756) Helper function to convert proto literal to value in Python Client
[ https://issues.apache.org/jira/browse/SPARK-42756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699196#comment-17699196 ] Apache Spark commented on SPARK-42756: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/40376 > Helper function to convert proto literal to value in Python Client > -- > > Key: SPARK-42756 > URL: https://issues.apache.org/jira/browse/SPARK-42756 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42756) Helper function to convert proto literal to value in Python Client
Ruifeng Zheng created SPARK-42756: - Summary: Helper function to convert proto literal to value in Python Client Key: SPARK-42756 URL: https://issues.apache.org/jira/browse/SPARK-42756 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org