[jira] [Commented] (SPARK-41961) Support table-valued functions with LATERAL
[ https://issues.apache.org/jira/browse/SPARK-41961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656493#comment-17656493 ] Apache Spark commented on SPARK-41961: -- User 'allisonwang-db' has created a pull request for this issue: https://github.com/apache/spark/pull/39479 > Support table-valued functions with LATERAL > --- > > Key: SPARK-41961 > URL: https://issues.apache.org/jira/browse/SPARK-41961 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Allison Wang >Priority: Major > > Support table-valued functions with the LATERAL subquery. For example: > {{select * from t, lateral explode(array(t.c1, t.c2))}} > Currently, this query throws a parse exception. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41961) Support table-valued functions with LATERAL
[ https://issues.apache.org/jira/browse/SPARK-41961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41961: Assignee: Apache Spark > Support table-valued functions with LATERAL > --- > > Key: SPARK-41961 > URL: https://issues.apache.org/jira/browse/SPARK-41961 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Allison Wang >Assignee: Apache Spark >Priority: Major > > Support table-valued functions with the LATERAL subquery. For example: > {{select * from t, lateral explode(array(t.c1, t.c2))}} > Currently, this query throws a parse exception. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41961) Support table-valued functions with LATERAL
[ https://issues.apache.org/jira/browse/SPARK-41961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41961: Assignee: (was: Apache Spark) > Support table-valued functions with LATERAL > --- > > Key: SPARK-41961 > URL: https://issues.apache.org/jira/browse/SPARK-41961 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Allison Wang >Priority: Major > > Support table-valued functions with the LATERAL subquery. For example: > {{select * from t, lateral explode(array(t.c1, t.c2))}} > Currently, this query throws a parse exception. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41961) Support table-valued functions with LATERAL
[ https://issues.apache.org/jira/browse/SPARK-41961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656494#comment-17656494 ] Apache Spark commented on SPARK-41961: -- User 'allisonwang-db' has created a pull request for this issue: https://github.com/apache/spark/pull/39479 > Support table-valued functions with LATERAL > --- > > Key: SPARK-41961 > URL: https://issues.apache.org/jira/browse/SPARK-41961 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Allison Wang >Priority: Major > > Support table-valued functions with the LATERAL subquery. For example: > {{select * from t, lateral explode(array(t.c1, t.c2))}} > Currently, this query throws a parse exception. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41962) Update the import order of class SpecificParquetRecordReaderBase
shuyouZZ created SPARK-41962: Summary: Update the import order of class SpecificParquetRecordReaderBase Key: SPARK-41962 URL: https://issues.apache.org/jira/browse/SPARK-41962 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: shuyouZZ Fix For: 3.4.0 There is a check style issue in class {{SpecificParquetRecordReaderBase}}. The import order of scala package is not correct. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41962) Update the import order of scala package in class SpecificParquetRecordReaderBase
[ https://issues.apache.org/jira/browse/SPARK-41962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shuyouZZ updated SPARK-41962: - Summary: Update the import order of scala package in class SpecificParquetRecordReaderBase (was: Update the import order of class SpecificParquetRecordReaderBase) > Update the import order of scala package in class > SpecificParquetRecordReaderBase > - > > Key: SPARK-41962 > URL: https://issues.apache.org/jira/browse/SPARK-41962 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: shuyouZZ >Priority: Major > Fix For: 3.4.0 > > > There is a check style issue in class {{SpecificParquetRecordReaderBase}}. > The import order of scala package is not correct. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41960) Assign name to _LEGACY_ERROR_TEMP_1056
[ https://issues.apache.org/jira/browse/SPARK-41960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41960: Assignee: Apache Spark > Assign name to _LEGACY_ERROR_TEMP_1056 > -- > > Key: SPARK-41960 > URL: https://issues.apache.org/jira/browse/SPARK-41960 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > Assign name to _LEGACY_ERROR_TEMP_1056 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41960) Assign name to _LEGACY_ERROR_TEMP_1056
[ https://issues.apache.org/jira/browse/SPARK-41960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41960: Assignee: (was: Apache Spark) > Assign name to _LEGACY_ERROR_TEMP_1056 > -- > > Key: SPARK-41960 > URL: https://issues.apache.org/jira/browse/SPARK-41960 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Assign name to _LEGACY_ERROR_TEMP_1056 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41960) Assign name to _LEGACY_ERROR_TEMP_1056
[ https://issues.apache.org/jira/browse/SPARK-41960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656498#comment-17656498 ] Apache Spark commented on SPARK-41960: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/39480 > Assign name to _LEGACY_ERROR_TEMP_1056 > -- > > Key: SPARK-41960 > URL: https://issues.apache.org/jira/browse/SPARK-41960 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Assign name to _LEGACY_ERROR_TEMP_1056 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41872) Fix DataFrame createDataframe handling of None
[ https://issues.apache.org/jira/browse/SPARK-41872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41872: Assignee: Ruifeng Zheng > Fix DataFrame createDataframe handling of None > -- > > Key: SPARK-41872 > URL: https://issues.apache.org/jira/browse/SPARK-41872 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Ruifeng Zheng >Priority: Major > > {code:java} > row = self.spark.createDataFrame([("Alice", None, None, None)], > schema).fillna(True).first() > self.assertEqual(row.age, None){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 231, in test_fillna > self.assertEqual(row.age, None) > AssertionError: nan != None{code} > > {code:java} > row = ( > self.spark.createDataFrame([("Alice", 10, None)], schema) > .replace(10, 20, subset=["name", "height"]) > .first() > ) > self.assertEqual(row.name, "Alice") > self.assertEqual(row.age, 10) > self.assertEqual(row.height, None) {code} > {code:java} > Traceback (most recent call last): File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 372, in test_replace self.assertEqual(row.height, None) > AssertionError: nan != None > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41872) Fix DataFrame createDataframe handling of None
[ https://issues.apache.org/jira/browse/SPARK-41872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41872. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39477 [https://github.com/apache/spark/pull/39477] > Fix DataFrame createDataframe handling of None > -- > > Key: SPARK-41872 > URL: https://issues.apache.org/jira/browse/SPARK-41872 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > > {code:java} > row = self.spark.createDataFrame([("Alice", None, None, None)], > schema).fillna(True).first() > self.assertEqual(row.age, None){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 231, in test_fillna > self.assertEqual(row.age, None) > AssertionError: nan != None{code} > > {code:java} > row = ( > self.spark.createDataFrame([("Alice", 10, None)], schema) > .replace(10, 20, subset=["name", "height"]) > .first() > ) > self.assertEqual(row.name, "Alice") > self.assertEqual(row.age, 10) > self.assertEqual(row.height, None) {code} > {code:java} > Traceback (most recent call last): File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 372, in test_replace self.assertEqual(row.height, None) > AssertionError: nan != None > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41958) Disallow arbitrary custom classpath with proxy user in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-41958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41958: Assignee: wuyi > Disallow arbitrary custom classpath with proxy user in cluster mode > --- > > Key: SPARK-41958 > URL: https://issues.apache.org/jira/browse/SPARK-41958 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.3.1, 3.2.3 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > To avoid arbitrary classpath in spark cluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41958) Disallow arbitrary custom classpath with proxy user in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-41958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41958. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39474 [https://github.com/apache/spark/pull/39474] > Disallow arbitrary custom classpath with proxy user in cluster mode > --- > > Key: SPARK-41958 > URL: https://issues.apache.org/jira/browse/SPARK-41958 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.3.1, 3.2.3 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.4.0 > > > To avoid arbitrary classpath in spark cluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41886) `DataFrame.intersect` doctest output has different order
[ https://issues.apache.org/jira/browse/SPARK-41886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656529#comment-17656529 ] jiaan.geng commented on SPARK-41886: I want take a look! > `DataFrame.intersect` doctest output has different order > > > Key: SPARK-41886 > URL: https://issues.apache.org/jira/browse/SPARK-41886 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > not sure whether this needs to be fix: > {code:java} > File > "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/dataframe.py", > line 609, in pyspark.sql.connect.dataframe.DataFrame.intersect > Failed example: > df1.intersect(df2).show() > Expected: > +---+---+ > | C1| C2| > +---+---+ > | b| 3| > | a| 1| > +---+---+ > Got: > +---+---+ > | C1| C2| > +---+---+ > | a| 1| > | b| 3| > +---+---+ > > ** >1 of 3 in pyspark.sql.connect.dataframe.DataFrame.intersect > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41963) Different exception in DataFrame.unpivot
Hyukjin Kwon created SPARK-41963: Summary: Different exception in DataFrame.unpivot Key: SPARK-41963 URL: https://issues.apache.org/jira/browse/SPARK-41963 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Hyukjin Kwon Running {{test_parity_dataframe DataFrameParityTests.test_unpivot_negative}} fails as below: {code} with self.subTest(desc="with no value columns"): for values in [[], ()]: with self.subTest(values=values): with self.assertRaisesRegex( Exception, # (AnalysisException, SparkConnectException) r".*\[UNPIVOT_REQUIRES_VALUE_COLUMNS] At least one value column " r"needs to be specified for UNPIVOT, all columns specified as ids.*", ): > df.unpivot("id", values, "var", "val").collect() E AssertionError: ".*\[UNPIVOT_REQUIRES_VALUE_COLUMNS] At least one value column needs to be specified for UNPIVOT, all columns specified as ids.*" does not match "[UNPIVOT_VALUE_DATA_TYPE_MISMATCH] Unpivot value columns must share a least common type, some types do not: ["BIGINT" (`int`), "DOUBLE" (`double`), "STRING" (`str`)] E Plan: 'Unpivot ArraySeq(id#2947L), List(List(int#2948L), List(double#2949), List(str#2950)), var, [val] E +- Project [id#2939L AS id#2947L, int#2940L AS int#2948L, double#2941 AS double#2949, str#2942 AS str#2950] E +- LocalRelation [id#2939L, int#2940L, double#2941, str#2942] E " {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41963) Different exception message in DataFrame.unpivot
[ https://issues.apache.org/jira/browse/SPARK-41963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41963: - Summary: Different exception message in DataFrame.unpivot (was: Different exception in DataFrame.unpivot) > Different exception message in DataFrame.unpivot > > > Key: SPARK-41963 > URL: https://issues.apache.org/jira/browse/SPARK-41963 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > Running {{test_parity_dataframe DataFrameParityTests.test_unpivot_negative}} > fails as below: > {code} > with self.subTest(desc="with no value columns"): > for values in [[], ()]: > with self.subTest(values=values): > with self.assertRaisesRegex( > Exception, # (AnalysisException, > SparkConnectException) > r".*\[UNPIVOT_REQUIRES_VALUE_COLUMNS] At least one > value column " > r"needs to be specified for UNPIVOT, all columns > specified as ids.*", > ): > > df.unpivot("id", values, "var", "val").collect() > E AssertionError: ".*\[UNPIVOT_REQUIRES_VALUE_COLUMNS] > At least one value column needs to be specified for UNPIVOT, all columns > specified as ids.*" does not match "[UNPIVOT_VALUE_DATA_TYPE_MISMATCH] > Unpivot value columns must share a least common type, some types do not: > ["BIGINT" (`int`), "DOUBLE" (`double`), "STRING" (`str`)] > E Plan: 'Unpivot ArraySeq(id#2947L), > List(List(int#2948L), List(double#2949), List(str#2950)), var, [val] > E +- Project [id#2939L AS id#2947L, int#2940L AS > int#2948L, double#2941 AS double#2949, str#2942 AS str#2950] > E +- LocalRelation [id#2939L, int#2940L, > double#2941, str#2942] > E " > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41877) SparkSession.createDataFrame error parity
[ https://issues.apache.org/jira/browse/SPARK-41877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41877: Assignee: Apache Spark > SparkSession.createDataFrame error parity > - > > Key: SPARK-41877 > URL: https://issues.apache.org/jira/browse/SPARK-41877 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Apache Spark >Priority: Major > > {code:java} > df = self.spark.createDataFrame( > [ > (1, 10, 1.0, "one"), > (2, 20, 2.0, "two"), > (3, 30, 3.0, "three"), > ], > ["id", "int", "double", "str"], > ) > with self.subTest(desc="with none identifier"): > with self.assertRaisesRegex(AssertionError, "ids must not be None"): > df.unpivot(None, ["int", "double"], "var", "val"){code} > Error: > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 575, in test_unpivot > with self.assertRaisesRegex(AssertionError, "ids must not be None"): > AssertionError: AssertionError not raised{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41877) SparkSession.createDataFrame error parity
[ https://issues.apache.org/jira/browse/SPARK-41877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41877: Assignee: (was: Apache Spark) > SparkSession.createDataFrame error parity > - > > Key: SPARK-41877 > URL: https://issues.apache.org/jira/browse/SPARK-41877 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.createDataFrame( > [ > (1, 10, 1.0, "one"), > (2, 20, 2.0, "two"), > (3, 30, 3.0, "three"), > ], > ["id", "int", "double", "str"], > ) > with self.subTest(desc="with none identifier"): > with self.assertRaisesRegex(AssertionError, "ids must not be None"): > df.unpivot(None, ["int", "double"], "var", "val"){code} > Error: > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 575, in test_unpivot > with self.assertRaisesRegex(AssertionError, "ids must not be None"): > AssertionError: AssertionError not raised{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41877) SparkSession.createDataFrame error parity
[ https://issues.apache.org/jira/browse/SPARK-41877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656541#comment-17656541 ] Apache Spark commented on SPARK-41877: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39482 > SparkSession.createDataFrame error parity > - > > Key: SPARK-41877 > URL: https://issues.apache.org/jira/browse/SPARK-41877 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.createDataFrame( > [ > (1, 10, 1.0, "one"), > (2, 20, 2.0, "two"), > (3, 30, 3.0, "three"), > ], > ["id", "int", "double", "str"], > ) > with self.subTest(desc="with none identifier"): > with self.assertRaisesRegex(AssertionError, "ids must not be None"): > df.unpivot(None, ["int", "double"], "var", "val"){code} > Error: > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 575, in test_unpivot > with self.assertRaisesRegex(AssertionError, "ids must not be None"): > AssertionError: AssertionError not raised{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41877) SparkSession.createDataFrame error parity
[ https://issues.apache.org/jira/browse/SPARK-41877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656543#comment-17656543 ] Apache Spark commented on SPARK-41877: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39482 > SparkSession.createDataFrame error parity > - > > Key: SPARK-41877 > URL: https://issues.apache.org/jira/browse/SPARK-41877 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.createDataFrame( > [ > (1, 10, 1.0, "one"), > (2, 20, 2.0, "two"), > (3, 30, 3.0, "three"), > ], > ["id", "int", "double", "str"], > ) > with self.subTest(desc="with none identifier"): > with self.assertRaisesRegex(AssertionError, "ids must not be None"): > df.unpivot(None, ["int", "double"], "var", "val"){code} > Error: > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 575, in test_unpivot > with self.assertRaisesRegex(AssertionError, "ids must not be None"): > AssertionError: AssertionError not raised{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41964) Add the unsupported function list
Ruifeng Zheng created SPARK-41964: - Summary: Add the unsupported function list Key: SPARK-41964 URL: https://issues.apache.org/jira/browse/SPARK-41964 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41886) `DataFrame.intersect` doctest output has different order
[ https://issues.apache.org/jira/browse/SPARK-41886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41886: Assignee: Apache Spark > `DataFrame.intersect` doctest output has different order > > > Key: SPARK-41886 > URL: https://issues.apache.org/jira/browse/SPARK-41886 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > > not sure whether this needs to be fix: > {code:java} > File > "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/dataframe.py", > line 609, in pyspark.sql.connect.dataframe.DataFrame.intersect > Failed example: > df1.intersect(df2).show() > Expected: > +---+---+ > | C1| C2| > +---+---+ > | b| 3| > | a| 1| > +---+---+ > Got: > +---+---+ > | C1| C2| > +---+---+ > | a| 1| > | b| 3| > +---+---+ > > ** >1 of 3 in pyspark.sql.connect.dataframe.DataFrame.intersect > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41886) `DataFrame.intersect` doctest output has different order
[ https://issues.apache.org/jira/browse/SPARK-41886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656581#comment-17656581 ] Apache Spark commented on SPARK-41886: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/39483 > `DataFrame.intersect` doctest output has different order > > > Key: SPARK-41886 > URL: https://issues.apache.org/jira/browse/SPARK-41886 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > not sure whether this needs to be fix: > {code:java} > File > "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/dataframe.py", > line 609, in pyspark.sql.connect.dataframe.DataFrame.intersect > Failed example: > df1.intersect(df2).show() > Expected: > +---+---+ > | C1| C2| > +---+---+ > | b| 3| > | a| 1| > +---+---+ > Got: > +---+---+ > | C1| C2| > +---+---+ > | a| 1| > | b| 3| > +---+---+ > > ** >1 of 3 in pyspark.sql.connect.dataframe.DataFrame.intersect > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41886) `DataFrame.intersect` doctest output has different order
[ https://issues.apache.org/jira/browse/SPARK-41886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41886: Assignee: (was: Apache Spark) > `DataFrame.intersect` doctest output has different order > > > Key: SPARK-41886 > URL: https://issues.apache.org/jira/browse/SPARK-41886 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > not sure whether this needs to be fix: > {code:java} > File > "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/dataframe.py", > line 609, in pyspark.sql.connect.dataframe.DataFrame.intersect > Failed example: > df1.intersect(df2).show() > Expected: > +---+---+ > | C1| C2| > +---+---+ > | b| 3| > | a| 1| > +---+---+ > Got: > +---+---+ > | C1| C2| > +---+---+ > | a| 1| > | b| 3| > +---+---+ > > ** >1 of 3 in pyspark.sql.connect.dataframe.DataFrame.intersect > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-41886) `DataFrame.intersect` doctest output has different order
[ https://issues.apache.org/jira/browse/SPARK-41886 ] jiaan.geng deleted comment on SPARK-41886: was (Author: beliefer): I want take a look! > `DataFrame.intersect` doctest output has different order > > > Key: SPARK-41886 > URL: https://issues.apache.org/jira/browse/SPARK-41886 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > not sure whether this needs to be fix: > {code:java} > File > "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/dataframe.py", > line 609, in pyspark.sql.connect.dataframe.DataFrame.intersect > Failed example: > df1.intersect(df2).show() > Expected: > +---+---+ > | C1| C2| > +---+---+ > | b| 3| > | a| 1| > +---+---+ > Got: > +---+---+ > | C1| C2| > +---+---+ > | a| 1| > | b| 3| > +---+---+ > > ** >1 of 3 in pyspark.sql.connect.dataframe.DataFrame.intersect > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41964) Add the unsupported function list
[ https://issues.apache.org/jira/browse/SPARK-41964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41964: Assignee: (was: Apache Spark) > Add the unsupported function list > - > > Key: SPARK-41964 > URL: https://issues.apache.org/jira/browse/SPARK-41964 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41964) Add the unsupported function list
[ https://issues.apache.org/jira/browse/SPARK-41964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656592#comment-17656592 ] Apache Spark commented on SPARK-41964: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39484 > Add the unsupported function list > - > > Key: SPARK-41964 > URL: https://issues.apache.org/jira/browse/SPARK-41964 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41964) Add the unsupported function list
[ https://issues.apache.org/jira/browse/SPARK-41964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41964: Assignee: Apache Spark > Add the unsupported function list > - > > Key: SPARK-41964 > URL: https://issues.apache.org/jira/browse/SPARK-41964 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references
Ruifeng Zheng created SPARK-41965: - Summary: Add DataFrameWriterV2 to PySpark API references Key: SPARK-41965 URL: https://issues.apache.org/jira/browse/SPARK-41965 Project: Spark Issue Type: Improvement Components: Documentation, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references
[ https://issues.apache.org/jira/browse/SPARK-41965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41965: Assignee: Apache Spark > Add DataFrameWriterV2 to PySpark API references > --- > > Key: SPARK-41965 > URL: https://issues.apache.org/jira/browse/SPARK-41965 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references
[ https://issues.apache.org/jira/browse/SPARK-41965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41965: Assignee: (was: Apache Spark) > Add DataFrameWriterV2 to PySpark API references > --- > > Key: SPARK-41965 > URL: https://issues.apache.org/jira/browse/SPARK-41965 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references
[ https://issues.apache.org/jira/browse/SPARK-41965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656599#comment-17656599 ] Apache Spark commented on SPARK-41965: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39485 > Add DataFrameWriterV2 to PySpark API references > --- > > Key: SPARK-41965 > URL: https://issues.apache.org/jira/browse/SPARK-41965 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references
[ https://issues.apache.org/jira/browse/SPARK-41965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656598#comment-17656598 ] Apache Spark commented on SPARK-41965: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39485 > Add DataFrameWriterV2 to PySpark API references > --- > > Key: SPARK-41965 > URL: https://issues.apache.org/jira/browse/SPARK-41965 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41966) Add `CharType` and `TimestampNTZType` to PySpark API references
Ruifeng Zheng created SPARK-41966: - Summary: Add `CharType` and `TimestampNTZType` to PySpark API references Key: SPARK-41966 URL: https://issues.apache.org/jira/browse/SPARK-41966 Project: Spark Issue Type: Improvement Components: Documentation, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41966) Add `CharType` and `TimestampNTZType` to PySpark API references
[ https://issues.apache.org/jira/browse/SPARK-41966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41966: Assignee: Apache Spark > Add `CharType` and `TimestampNTZType` to PySpark API references > --- > > Key: SPARK-41966 > URL: https://issues.apache.org/jira/browse/SPARK-41966 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41966) Add `CharType` and `TimestampNTZType` to PySpark API references
[ https://issues.apache.org/jira/browse/SPARK-41966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656601#comment-17656601 ] Apache Spark commented on SPARK-41966: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39486 > Add `CharType` and `TimestampNTZType` to PySpark API references > --- > > Key: SPARK-41966 > URL: https://issues.apache.org/jira/browse/SPARK-41966 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41966) Add `CharType` and `TimestampNTZType` to PySpark API references
[ https://issues.apache.org/jira/browse/SPARK-41966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41966: Assignee: (was: Apache Spark) > Add `CharType` and `TimestampNTZType` to PySpark API references > --- > > Key: SPARK-41966 > URL: https://issues.apache.org/jira/browse/SPARK-41966 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41966) Add `CharType` and `TimestampNTZType` to PySpark API references
[ https://issues.apache.org/jira/browse/SPARK-41966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41966: Assignee: Apache Spark > Add `CharType` and `TimestampNTZType` to PySpark API references > --- > > Key: SPARK-41966 > URL: https://issues.apache.org/jira/browse/SPARK-41966 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41907) Function `sampleby` return parity
[ https://issues.apache.org/jira/browse/SPARK-41907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-41907. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39476 [https://github.com/apache/spark/pull/39476] > Function `sampleby` return parity > - > > Key: SPARK-41907 > URL: https://issues.apache.org/jira/browse/SPARK-41907 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: jiaan.geng >Priority: Major > Fix For: 3.4.0 > > > {code:java} > df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)]) > sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0) > self.assertTrue(sampled.count() == 35){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 202, in test_sampleby > self.assertTrue(sampled.count() == 35) > AssertionError: False is not true {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41907) Function `sampleby` return parity
[ https://issues.apache.org/jira/browse/SPARK-41907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-41907: - Assignee: jiaan.geng > Function `sampleby` return parity > - > > Key: SPARK-41907 > URL: https://issues.apache.org/jira/browse/SPARK-41907 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: jiaan.geng >Priority: Major > > {code:java} > df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)]) > sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0) > self.assertTrue(sampled.count() == 35){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 202, in test_sampleby > self.assertTrue(sampled.count() == 35) > AssertionError: False is not true {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41752) UI improvement for nested SQL executions
[ https://issues.apache.org/jira/browse/SPARK-41752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-41752: --- Assignee: Linhong Liu > UI improvement for nested SQL executions > > > Key: SPARK-41752 > URL: https://issues.apache.org/jira/browse/SPARK-41752 > Project: Spark > Issue Type: Task > Components: SQL, Web UI >Affects Versions: 3.4.0 >Reporter: Linhong Liu >Assignee: Linhong Liu >Priority: Major > > in SPARK-41713, the CTAS will trigger a sub-execution to perform the data > insertion. But the UI will display two independent queries, it will confuse > users. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41752) UI improvement for nested SQL executions
[ https://issues.apache.org/jira/browse/SPARK-41752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-41752. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39268 [https://github.com/apache/spark/pull/39268] > UI improvement for nested SQL executions > > > Key: SPARK-41752 > URL: https://issues.apache.org/jira/browse/SPARK-41752 > Project: Spark > Issue Type: Task > Components: SQL, Web UI >Affects Versions: 3.4.0 >Reporter: Linhong Liu >Assignee: Linhong Liu >Priority: Major > Fix For: 3.4.0 > > > in SPARK-41713, the CTAS will trigger a sub-execution to perform the data > insertion. But the UI will display two independent queries, it will confuse > users. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41949) Make stage scheduling support local-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-41949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu reassigned SPARK-41949: -- Assignee: Weichen Xu > Make stage scheduling support local-cluster mode > > > Key: SPARK-41949 > URL: https://issues.apache.org/jira/browse/SPARK-41949 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.4.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > > Make stage scheduling support local-cluster mode. > This is useful in testing, especially for test code of third-party python > libraries that depends on pyspark, many tests are written with pytest, but > pytest is hard to integrate with a standalone spark cluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41949) Make stage scheduling support local-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-41949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu resolved SPARK-41949. Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39424 [https://github.com/apache/spark/pull/39424] > Make stage scheduling support local-cluster mode > > > Key: SPARK-41949 > URL: https://issues.apache.org/jira/browse/SPARK-41949 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.4.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Fix For: 3.4.0 > > > Make stage scheduling support local-cluster mode. > This is useful in testing, especially for test code of third-party python > libraries that depends on pyspark, many tests are written with pytest, but > pytest is hard to integrate with a standalone spark cluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41575) Assign name to _LEGACY_ERROR_TEMP_2054
[ https://issues.apache.org/jira/browse/SPARK-41575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-41575: Assignee: Haejoon Lee > Assign name to _LEGACY_ERROR_TEMP_2054 > -- > > Key: SPARK-41575 > URL: https://issues.apache.org/jira/browse/SPARK-41575 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > We should use proper error class name rather than `_LEGACY_ERROR_TEMP_xxx`. > > *NOTE:* Please reply to this ticket before start working on it, to avoid > working on same ticket at a time -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41575) Assign name to _LEGACY_ERROR_TEMP_2054
[ https://issues.apache.org/jira/browse/SPARK-41575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-41575. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39394 [https://github.com/apache/spark/pull/39394] > Assign name to _LEGACY_ERROR_TEMP_2054 > -- > > Key: SPARK-41575 > URL: https://issues.apache.org/jira/browse/SPARK-41575 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > We should use proper error class name rather than `_LEGACY_ERROR_TEMP_xxx`. > > *NOTE:* Please reply to this ticket before start working on it, to avoid > working on same ticket at a time -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41967) SBT unable to resolve particular packages from the imported maven build
Venkata Sai Akhil Gudesa created SPARK-41967: Summary: SBT unable to resolve particular packages from the imported maven build Key: SPARK-41967 URL: https://issues.apache.org/jira/browse/SPARK-41967 Project: Spark Issue Type: Bug Components: Connect Affects Versions: 3.4.0 Reporter: Venkata Sai Akhil Gudesa An SBT issue causes the resolution from the imported maven build for particular packages to not work for an unknown reason. This affects Spark-Connect-related projects (see [here|https://github.com/apache/spark/blob/6cae6aa5156655c79eb3f20292ccec6c479c3b1b/project/SparkBuild.scala#L667-L668] and [here|https://github.com/apache/spark/blob/6cae6aa5156655c79eb3f20292ccec6c479c3b1b/project/SparkBuild.scala#L902-L904] for example) by forcing duplicate deps. The pom build works fine when removing the affected dep (like guava for example) but the sbt build then fails. Thus, we are forced to explicitly mention the versions of the affected packages so that SBT can then parse the version(s) to manually include them (and they're also added as a dep in maven to ensure version consistency with sbt) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41822) Setup Scala/JVM Client Connection
[ https://issues.apache.org/jira/browse/SPARK-41822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-41822: - Assignee: Venkata Sai Akhil Gudesa > Setup Scala/JVM Client Connection > - > > Key: SPARK-41822 > URL: https://issues.apache.org/jira/browse/SPARK-41822 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Assignee: Venkata Sai Akhil Gudesa >Priority: Major > > Set up the gRPC connection for the Scala/JVM client to enable communication > with the Spark Connect server. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41822) Setup Scala/JVM Client Connection
[ https://issues.apache.org/jira/browse/SPARK-41822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-41822. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39361 [https://github.com/apache/spark/pull/39361] > Setup Scala/JVM Client Connection > - > > Key: SPARK-41822 > URL: https://issues.apache.org/jira/browse/SPARK-41822 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Assignee: Venkata Sai Akhil Gudesa >Priority: Major > Fix For: 3.4.0 > > > Set up the gRPC connection for the Scala/JVM client to enable communication > with the Spark Connect server. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41877) SparkSession.createDataFrame error parity
[ https://issues.apache.org/jira/browse/SPARK-41877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-41877. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39482 [https://github.com/apache/spark/pull/39482] > SparkSession.createDataFrame error parity > - > > Key: SPARK-41877 > URL: https://issues.apache.org/jira/browse/SPARK-41877 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > df = self.spark.createDataFrame( > [ > (1, 10, 1.0, "one"), > (2, 20, 2.0, "two"), > (3, 30, 3.0, "three"), > ], > ["id", "int", "double", "str"], > ) > with self.subTest(desc="with none identifier"): > with self.assertRaisesRegex(AssertionError, "ids must not be None"): > df.unpivot(None, ["int", "double"], "var", "val"){code} > Error: > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 575, in test_unpivot > with self.assertRaisesRegex(AssertionError, "ids must not be None"): > AssertionError: AssertionError not raised{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41877) SparkSession.createDataFrame error parity
[ https://issues.apache.org/jira/browse/SPARK-41877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-41877: - Assignee: Hyukjin Kwon > SparkSession.createDataFrame error parity > - > > Key: SPARK-41877 > URL: https://issues.apache.org/jira/browse/SPARK-41877 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > {code:java} > df = self.spark.createDataFrame( > [ > (1, 10, 1.0, "one"), > (2, 20, 2.0, "two"), > (3, 30, 3.0, "three"), > ], > ["id", "int", "double", "str"], > ) > with self.subTest(desc="with none identifier"): > with self.assertRaisesRegex(AssertionError, "ids must not be None"): > df.unpivot(None, ["int", "double"], "var", "val"){code} > Error: > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 575, in test_unpivot > with self.assertRaisesRegex(AssertionError, "ids must not be None"): > AssertionError: AssertionError not raised{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41886) `DataFrame.intersect` doctest output has different order
[ https://issues.apache.org/jira/browse/SPARK-41886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-41886. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39483 [https://github.com/apache/spark/pull/39483] > `DataFrame.intersect` doctest output has different order > > > Key: SPARK-41886 > URL: https://issues.apache.org/jira/browse/SPARK-41886 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > > not sure whether this needs to be fix: > {code:java} > File > "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/dataframe.py", > line 609, in pyspark.sql.connect.dataframe.DataFrame.intersect > Failed example: > df1.intersect(df2).show() > Expected: > +---+---+ > | C1| C2| > +---+---+ > | b| 3| > | a| 1| > +---+---+ > Got: > +---+---+ > | C1| C2| > +---+---+ > | a| 1| > | b| 3| > +---+---+ > > ** >1 of 3 in pyspark.sql.connect.dataframe.DataFrame.intersect > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41886) `DataFrame.intersect` doctest output has different order
[ https://issues.apache.org/jira/browse/SPARK-41886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-41886: - Assignee: jiaan.geng > `DataFrame.intersect` doctest output has different order > > > Key: SPARK-41886 > URL: https://issues.apache.org/jira/browse/SPARK-41886 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.4.0 > > > not sure whether this needs to be fix: > {code:java} > File > "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/dataframe.py", > line 609, in pyspark.sql.connect.dataframe.DataFrame.intersect > Failed example: > df1.intersect(df2).show() > Expected: > +---+---+ > | C1| C2| > +---+---+ > | b| 3| > | a| 1| > +---+---+ > Got: > +---+---+ > | C1| C2| > +---+---+ > | a| 1| > | b| 3| > +---+---+ > > ** >1 of 3 in pyspark.sql.connect.dataframe.DataFrame.intersect > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41968) Refactor ProtobufSerDe to ProtobufSerDe[T]
Yang Jie created SPARK-41968: Summary: Refactor ProtobufSerDe to ProtobufSerDe[T] Key: SPARK-41968 URL: https://issues.apache.org/jira/browse/SPARK-41968 Project: Spark Issue Type: Sub-task Components: Spark Core, SQL Affects Versions: 3.4.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41968) Refactor ProtobufSerDe to ProtobufSerDe[T]
[ https://issues.apache.org/jira/browse/SPARK-41968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41968: Assignee: Apache Spark > Refactor ProtobufSerDe to ProtobufSerDe[T] > -- > > Key: SPARK-41968 > URL: https://issues.apache.org/jira/browse/SPARK-41968 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41968) Refactor ProtobufSerDe to ProtobufSerDe[T]
[ https://issues.apache.org/jira/browse/SPARK-41968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656701#comment-17656701 ] Apache Spark commented on SPARK-41968: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/39487 > Refactor ProtobufSerDe to ProtobufSerDe[T] > -- > > Key: SPARK-41968 > URL: https://issues.apache.org/jira/browse/SPARK-41968 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41968) Refactor ProtobufSerDe to ProtobufSerDe[T]
[ https://issues.apache.org/jira/browse/SPARK-41968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656700#comment-17656700 ] Apache Spark commented on SPARK-41968: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/39487 > Refactor ProtobufSerDe to ProtobufSerDe[T] > -- > > Key: SPARK-41968 > URL: https://issues.apache.org/jira/browse/SPARK-41968 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41968) Refactor ProtobufSerDe to ProtobufSerDe[T]
[ https://issues.apache.org/jira/browse/SPARK-41968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41968: Assignee: (was: Apache Spark) > Refactor ProtobufSerDe to ProtobufSerDe[T] > -- > > Key: SPARK-41968 > URL: https://issues.apache.org/jira/browse/SPARK-41968 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40588) Sorting issue with partitioned-writing and AQE turned on
[ https://issues.apache.org/jira/browse/SPARK-40588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Krogen updated SPARK-40588: Labels: correctness (was: ) > Sorting issue with partitioned-writing and AQE turned on > > > Key: SPARK-40588 > URL: https://issues.apache.org/jira/browse/SPARK-40588 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.3 > Environment: Spark v3.1.3 > Scala v2.12.13 >Reporter: Swetha Baskaran >Assignee: Enrico Minack >Priority: Major > Labels: correctness > Fix For: 3.2.3, 3.3.2 > > Attachments: image-2022-10-16-22-05-47-159.png > > > We are attempting to partition data by a few columns, sort by a particular > _sortCol_ and write out one file per partition. > {code:java} > df > .repartition(col("day"), col("month"), col("year")) > .withColumn("partitionId",spark_partition_id) > .withColumn("monotonicallyIncreasingIdUnsorted",monotonicallyIncreasingId) > .sortWithinPartitions("year", "month", "day", "sortCol") > .withColumn("monotonicallyIncreasingIdSorted",monotonicallyIncreasingId) > .write > .partitionBy("year", "month", "day") > .parquet(path){code} > When inspecting the results, we observe one file per partition, however we > see an _alternating_ pattern of unsorted rows in some files. > {code:java} > {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832121344,"monotonicallyIncreasingIdSorted":6287832121344} > {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877022389,"monotonicallyIncreasingIdSorted":6287876860586} > {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877567881,"monotonicallyIncreasingIdSorted":6287832121345} > {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287835105553,"monotonicallyIncreasingIdSorted":6287876860587} > {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832570127,"monotonicallyIncreasingIdSorted":6287832121346} > {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287879965760,"monotonicallyIncreasingIdSorted":6287876860588} > {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287878762347,"monotonicallyIncreasingIdSorted":6287832121347} > {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287837165012,"monotonicallyIncreasingIdSorted":6287876860589} > {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832910545,"monotonicallyIncreasingIdSorted":6287832121348} > {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287881244758,"monotonicallyIncreasingIdSorted":6287876860590} > {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287880041345,"monotonicallyIncreasingIdSorted":6287832121349}{code} > Here is a > [gist|https://gist.github.com/Swebask/543030748a768be92d3c0ae343d2ae89] to > reproduce the issue. > Turning off AQE with spark.conf.set("spark.sql.adaptive.enabled", false) > fixes the issue. > I'm working on identifying why AQE affects the sort order. Any leads or > thoughts would be appreciated! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40588) Sorting issue with partitioned-writing and AQE turned on
[ https://issues.apache.org/jira/browse/SPARK-40588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656705#comment-17656705 ] Erik Krogen commented on SPARK-40588: - Labeling with 'correctness' since this breaks correctness of output by breaking the sort ordering. > Sorting issue with partitioned-writing and AQE turned on > > > Key: SPARK-40588 > URL: https://issues.apache.org/jira/browse/SPARK-40588 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.3 > Environment: Spark v3.1.3 > Scala v2.12.13 >Reporter: Swetha Baskaran >Assignee: Enrico Minack >Priority: Major > Labels: correctness > Fix For: 3.2.3, 3.3.2 > > Attachments: image-2022-10-16-22-05-47-159.png > > > We are attempting to partition data by a few columns, sort by a particular > _sortCol_ and write out one file per partition. > {code:java} > df > .repartition(col("day"), col("month"), col("year")) > .withColumn("partitionId",spark_partition_id) > .withColumn("monotonicallyIncreasingIdUnsorted",monotonicallyIncreasingId) > .sortWithinPartitions("year", "month", "day", "sortCol") > .withColumn("monotonicallyIncreasingIdSorted",monotonicallyIncreasingId) > .write > .partitionBy("year", "month", "day") > .parquet(path){code} > When inspecting the results, we observe one file per partition, however we > see an _alternating_ pattern of unsorted rows in some files. > {code:java} > {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832121344,"monotonicallyIncreasingIdSorted":6287832121344} > {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877022389,"monotonicallyIncreasingIdSorted":6287876860586} > {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877567881,"monotonicallyIncreasingIdSorted":6287832121345} > {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287835105553,"monotonicallyIncreasingIdSorted":6287876860587} > {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832570127,"monotonicallyIncreasingIdSorted":6287832121346} > {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287879965760,"monotonicallyIncreasingIdSorted":6287876860588} > {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287878762347,"monotonicallyIncreasingIdSorted":6287832121347} > {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287837165012,"monotonicallyIncreasingIdSorted":6287876860589} > {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832910545,"monotonicallyIncreasingIdSorted":6287832121348} > {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287881244758,"monotonicallyIncreasingIdSorted":6287876860590} > {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287880041345,"monotonicallyIncreasingIdSorted":6287832121349}{code} > Here is a > [gist|https://gist.github.com/Swebask/543030748a768be92d3c0ae343d2ae89] to > reproduce the issue. > Turning off AQE with spark.conf.set("spark.sql.adaptive.enabled", false) > fixes the issue. > I'm working on identifying why AQE affects the sort order. Any leads or > thoughts would be appreciated! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41969) Flaky Test: StreamingQueryStatusListenerSuite.test small retained queries
Dongjoon Hyun created SPARK-41969: - Summary: Flaky Test: StreamingQueryStatusListenerSuite.test small retained queries Key: SPARK-41969 URL: https://issues.apache.org/jira/browse/SPARK-41969 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Dongjoon Hyun I saw this failures on master branch frequently. https://github.com/apache/spark/actions/runs/3884439263/jobs/6631404948 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41969) Flaky Test: StreamingQueryStatusListenerSuite.test small retained queries
[ https://issues.apache.org/jira/browse/SPARK-41969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-41969: -- Description: I saw this failures on master branch frequently. https://github.com/apache/spark/actions/runs/3884439263/jobs/6631404948 https://github.com/apache/spark/runs/10556299549 was: I saw this failures on master branch frequently. https://github.com/apache/spark/actions/runs/3884439263/jobs/6631404948 > Flaky Test: StreamingQueryStatusListenerSuite.test small retained queries > - > > Key: SPARK-41969 > URL: https://issues.apache.org/jira/browse/SPARK-41969 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > I saw this failures on master branch frequently. > https://github.com/apache/spark/actions/runs/3884439263/jobs/6631404948 > https://github.com/apache/spark/runs/10556299549 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41969) Flaky Test: StreamingQueryStatusListenerSuite.test small retained queries
[ https://issues.apache.org/jira/browse/SPARK-41969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-41969: -- Description: I saw this failures on master branch frequently. https://github.com/apache/spark/actions/runs/3884439263/jobs/6631404948 https://github.com/apache/spark/runs/10556299549 https://github.com/apache/spark/runs/10551101022 was: I saw this failures on master branch frequently. https://github.com/apache/spark/actions/runs/3884439263/jobs/6631404948 https://github.com/apache/spark/runs/10556299549 > Flaky Test: StreamingQueryStatusListenerSuite.test small retained queries > - > > Key: SPARK-41969 > URL: https://issues.apache.org/jira/browse/SPARK-41969 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > I saw this failures on master branch frequently. > https://github.com/apache/spark/actions/runs/3884439263/jobs/6631404948 > https://github.com/apache/spark/runs/10556299549 > https://github.com/apache/spark/runs/10551101022 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41969) Flaky Test: StreamingQueryStatusListenerSuite.test small retained queries
[ https://issues.apache.org/jira/browse/SPARK-41969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-41969: -- Description: I saw this failures on master branch frequently. https://github.com/apache/spark/runs/10560025461 https://github.com/apache/spark/runs/10556299549 https://github.com/apache/spark/runs/10551101022 was: I saw this failures on master branch frequently. https://github.com/apache/spark/actions/runs/3884439263/jobs/6631404948 https://github.com/apache/spark/runs/10556299549 https://github.com/apache/spark/runs/10551101022 > Flaky Test: StreamingQueryStatusListenerSuite.test small retained queries > - > > Key: SPARK-41969 > URL: https://issues.apache.org/jira/browse/SPARK-41969 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > I saw this failures on master branch frequently. > https://github.com/apache/spark/runs/10560025461 > https://github.com/apache/spark/runs/10556299549 > https://github.com/apache/spark/runs/10551101022 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38173) Quoted column cannot be recognized correctly when quotedRegexColumnNames is true
[ https://issues.apache.org/jira/browse/SPARK-38173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38173: -- Fix Version/s: 3.2.4 > Quoted column cannot be recognized correctly when quotedRegexColumnNames is > true > > > Key: SPARK-38173 > URL: https://issues.apache.org/jira/browse/SPARK-38173 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: Tongwei >Assignee: Tongwei >Priority: Major > Fix For: 3.3.0, 3.2.4 > > > When spark.sql.parser.quotedRegexColumnNames=true > {code:java} > SELECT `(C3)?+.+`,`C1` * C2 FROM (SELECT 3 AS C1,2 AS C2,1 AS C3) T;{code} > The above query will throw an exception > {code:java} > Error: org.apache.hive.service.cli.HiveSQLException: Error running query: > org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression > 'multiply' > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:370) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:266) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78) > at > org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:44) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:266) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:261) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:275) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.spark.sql.AnalysisException: Invalid usage of '*' in > expression 'multiply' > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:155) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1700) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1671) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:342) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:342) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:339) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:408) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:339) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.expandStarExpression(Analyzer.scala:1671) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.$anonfun$buildExpandedProjectList$1(Analyzer.scala:1656) > {code} > It works fine in hive > {code:java} > 0: jdbc:hive2://hiveserver-inc.> set hive.support.quoted.identifiers=
[jira] [Created] (SPARK-41970) SparkPath
David Lewis created SPARK-41970: --- Summary: SparkPath Key: SPARK-41970 URL: https://issues.apache.org/jira/browse/SPARK-41970 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0 Reporter: David Lewis Today, Spark represents file paths in various ways. Sometimes they are Hadoop `Path`s, sometimes they are `Path.toString`s, and sometimes they are `Path.toUri.toString`s. This discrepancy means that Spark does not always work when user provided strings have special characters. Sometimes Spark will try to create a URI with an unescaped string; sometimes Spark will double-escape a path and try to access the wrong file. This issue proposes a new `SparkPath` class meant to provide type safety when Spark is dealing with paths. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41970) SparkPath
[ https://issues.apache.org/jira/browse/SPARK-41970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656986#comment-17656986 ] Apache Spark commented on SPARK-41970: -- User 'databricks-david-lewis' has created a pull request for this issue: https://github.com/apache/spark/pull/39488 > SparkPath > - > > Key: SPARK-41970 > URL: https://issues.apache.org/jira/browse/SPARK-41970 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: David Lewis >Priority: Major > > Today, Spark represents file paths in various ways. Sometimes they are Hadoop > `Path`s, sometimes they are `Path.toString`s, and sometimes they are > `Path.toUri.toString`s. > This discrepancy means that Spark does not always work when user provided > strings have special characters. Sometimes Spark will try to create a URI > with an unescaped string; sometimes Spark will double-escape a path and try > to access the wrong file. > > This issue proposes a new `SparkPath` class meant to provide type safety when > Spark is dealing with paths. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41970) SparkPath
[ https://issues.apache.org/jira/browse/SPARK-41970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41970: Assignee: (was: Apache Spark) > SparkPath > - > > Key: SPARK-41970 > URL: https://issues.apache.org/jira/browse/SPARK-41970 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: David Lewis >Priority: Major > > Today, Spark represents file paths in various ways. Sometimes they are Hadoop > `Path`s, sometimes they are `Path.toString`s, and sometimes they are > `Path.toUri.toString`s. > This discrepancy means that Spark does not always work when user provided > strings have special characters. Sometimes Spark will try to create a URI > with an unescaped string; sometimes Spark will double-escape a path and try > to access the wrong file. > > This issue proposes a new `SparkPath` class meant to provide type safety when > Spark is dealing with paths. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41970) SparkPath
[ https://issues.apache.org/jira/browse/SPARK-41970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41970: Assignee: Apache Spark > SparkPath > - > > Key: SPARK-41970 > URL: https://issues.apache.org/jira/browse/SPARK-41970 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: David Lewis >Assignee: Apache Spark >Priority: Major > > Today, Spark represents file paths in various ways. Sometimes they are Hadoop > `Path`s, sometimes they are `Path.toString`s, and sometimes they are > `Path.toUri.toString`s. > This discrepancy means that Spark does not always work when user provided > strings have special characters. Sometimes Spark will try to create a URI > with an unescaped string; sometimes Spark will double-escape a path and try > to access the wrong file. > > This issue proposes a new `SparkPath` class meant to provide type safety when > Spark is dealing with paths. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41970) SparkPath
[ https://issues.apache.org/jira/browse/SPARK-41970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656988#comment-17656988 ] Apache Spark commented on SPARK-41970: -- User 'databricks-david-lewis' has created a pull request for this issue: https://github.com/apache/spark/pull/39488 > SparkPath > - > > Key: SPARK-41970 > URL: https://issues.apache.org/jira/browse/SPARK-41970 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: David Lewis >Priority: Major > > Today, Spark represents file paths in various ways. Sometimes they are Hadoop > `Path`s, sometimes they are `Path.toString`s, and sometimes they are > `Path.toUri.toString`s. > This discrepancy means that Spark does not always work when user provided > strings have special characters. Sometimes Spark will try to create a URI > with an unescaped string; sometimes Spark will double-escape a path and try > to access the wrong file. > > This issue proposes a new `SparkPath` class meant to provide type safety when > Spark is dealing with paths. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41752) UI improvement for nested SQL executions
[ https://issues.apache.org/jira/browse/SPARK-41752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17657015#comment-17657015 ] Apache Spark commented on SPARK-41752: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/39489 > UI improvement for nested SQL executions > > > Key: SPARK-41752 > URL: https://issues.apache.org/jira/browse/SPARK-41752 > Project: Spark > Issue Type: Task > Components: SQL, Web UI >Affects Versions: 3.4.0 >Reporter: Linhong Liu >Assignee: Linhong Liu >Priority: Major > Fix For: 3.4.0 > > > in SPARK-41713, the CTAS will trigger a sub-execution to perform the data > insertion. But the UI will display two independent queries, it will confuse > users. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41964) Add the unsupported function list
[ https://issues.apache.org/jira/browse/SPARK-41964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41964: Assignee: Ruifeng Zheng > Add the unsupported function list > - > > Key: SPARK-41964 > URL: https://issues.apache.org/jira/browse/SPARK-41964 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41964) Add the unsupported function list
[ https://issues.apache.org/jira/browse/SPARK-41964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41964. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39484 [https://github.com/apache/spark/pull/39484] > Add the unsupported function list > - > > Key: SPARK-41964 > URL: https://issues.apache.org/jira/browse/SPARK-41964 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41966) Add `CharType` and `TimestampNTZType` to PySpark API references
[ https://issues.apache.org/jira/browse/SPARK-41966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41966. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39486 [https://github.com/apache/spark/pull/39486] > Add `CharType` and `TimestampNTZType` to PySpark API references > --- > > Key: SPARK-41966 > URL: https://issues.apache.org/jira/browse/SPARK-41966 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references
[ https://issues.apache.org/jira/browse/SPARK-41965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41965: Assignee: Ruifeng Zheng > Add DataFrameWriterV2 to PySpark API references > --- > > Key: SPARK-41965 > URL: https://issues.apache.org/jira/browse/SPARK-41965 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references
[ https://issues.apache.org/jira/browse/SPARK-41965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41965. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39485 [https://github.com/apache/spark/pull/39485] > Add DataFrameWriterV2 to PySpark API references > --- > > Key: SPARK-41965 > URL: https://issues.apache.org/jira/browse/SPARK-41965 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41876) Implement DataFrame `toLocalIterator`
[ https://issues.apache.org/jira/browse/SPARK-41876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17657028#comment-17657028 ] jiaan.geng commented on SPARK-41876: I will take a look! > Implement DataFrame `toLocalIterator` > - > > Key: SPARK-41876 > URL: https://issues.apache.org/jira/browse/SPARK-41876 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > schema = StructType( > [StructField("i", StringType(), True), StructField("j", IntegerType(), > True)] > ) > df = self.spark.createDataFrame([("a", 1)], schema) > schema1 = StructType([StructField("j", StringType()), StructField("i", > StringType())]) > df1 = df.to(schema1) > self.assertEqual(schema1, df1.schema) > self.assertEqual(df.count(), df1.count()) > schema2 = StructType([StructField("j", LongType())]) > df2 = df.to(schema2) > self.assertEqual(schema2, df2.schema) > self.assertEqual(df.count(), df2.count()) > schema3 = StructType([StructField("struct", schema1, False)]) > df3 = df.select(struct("i", "j").alias("struct")).to(schema3) > self.assertEqual(schema3, df3.schema) > self.assertEqual(df.count(), df3.count()) > # incompatible field nullability > schema4 = StructType([StructField("j", LongType(), False)]) > self.assertRaisesRegex( > AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4) > ){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 1486, in test_to > self.assertRaisesRegex( > AssertionError: AnalysisException not raised by {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41589) PyTorch Distributor
[ https://issues.apache.org/jira/browse/SPARK-41589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17657029#comment-17657029 ] Apache Spark commented on SPARK-41589: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39490 > PyTorch Distributor > --- > > Key: SPARK-41589 > URL: https://issues.apache.org/jira/browse/SPARK-41589 > Project: Spark > Issue Type: Umbrella > Components: ML, PySpark >Affects Versions: 3.4.0 >Reporter: Rithwik Ediga Lakhamsani >Priority: Major > > This is a project to make it easier for PySpark users to distribute PyTorch > code using PySpark. The corresponding [Design > Document|https://docs.google.com/document/d/1QPO1Ly8WteL6aIPvVcR7Xne9qVtJiB3fdrRn7NwBcpA/edit?usp=sharing] > can give more context. This was a project determined by the Databricks ML > Training Team; please reach out to [~gurwls223] (Spark-side) or [~erithwik] > for more context. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41887) Support DataFrame hint parameter to be list
[ https://issues.apache.org/jira/browse/SPARK-41887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17657031#comment-17657031 ] Apache Spark commented on SPARK-41887: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39491 > Support DataFrame hint parameter to be list > --- > > Key: SPARK-41887 > URL: https://issues.apache.org/jira/browse/SPARK-41887 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.range(10e10).toDF("id") > such_a_nice_list = ["itworks1", "itworks2", "itworks3"] > hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41887) Support DataFrame hint parameter to be list
[ https://issues.apache.org/jira/browse/SPARK-41887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41887: Assignee: Apache Spark > Support DataFrame hint parameter to be list > --- > > Key: SPARK-41887 > URL: https://issues.apache.org/jira/browse/SPARK-41887 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Apache Spark >Priority: Major > > {code:java} > df = self.spark.range(10e10).toDF("id") > such_a_nice_list = ["itworks1", "itworks2", "itworks3"] > hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41887) Support DataFrame hint parameter to be list
[ https://issues.apache.org/jira/browse/SPARK-41887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17657032#comment-17657032 ] Apache Spark commented on SPARK-41887: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39491 > Support DataFrame hint parameter to be list > --- > > Key: SPARK-41887 > URL: https://issues.apache.org/jira/browse/SPARK-41887 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.range(10e10).toDF("id") > such_a_nice_list = ["itworks1", "itworks2", "itworks3"] > hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41887) Support DataFrame hint parameter to be list
[ https://issues.apache.org/jira/browse/SPARK-41887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41887: Assignee: (was: Apache Spark) > Support DataFrame hint parameter to be list > --- > > Key: SPARK-41887 > URL: https://issues.apache.org/jira/browse/SPARK-41887 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.range(10e10).toDF("id") > such_a_nice_list = ["itworks1", "itworks2", "itworks3"] > hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41876) Implement DataFrame `toLocalIterator`
[ https://issues.apache.org/jira/browse/SPARK-41876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17657040#comment-17657040 ] Apache Spark commented on SPARK-41876: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/39492 > Implement DataFrame `toLocalIterator` > - > > Key: SPARK-41876 > URL: https://issues.apache.org/jira/browse/SPARK-41876 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > schema = StructType( > [StructField("i", StringType(), True), StructField("j", IntegerType(), > True)] > ) > df = self.spark.createDataFrame([("a", 1)], schema) > schema1 = StructType([StructField("j", StringType()), StructField("i", > StringType())]) > df1 = df.to(schema1) > self.assertEqual(schema1, df1.schema) > self.assertEqual(df.count(), df1.count()) > schema2 = StructType([StructField("j", LongType())]) > df2 = df.to(schema2) > self.assertEqual(schema2, df2.schema) > self.assertEqual(df.count(), df2.count()) > schema3 = StructType([StructField("struct", schema1, False)]) > df3 = df.select(struct("i", "j").alias("struct")).to(schema3) > self.assertEqual(schema3, df3.schema) > self.assertEqual(df.count(), df3.count()) > # incompatible field nullability > schema4 = StructType([StructField("j", LongType(), False)]) > self.assertRaisesRegex( > AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4) > ){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 1486, in test_to > self.assertRaisesRegex( > AssertionError: AnalysisException not raised by {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41876) Implement DataFrame `toLocalIterator`
[ https://issues.apache.org/jira/browse/SPARK-41876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41876: Assignee: Apache Spark > Implement DataFrame `toLocalIterator` > - > > Key: SPARK-41876 > URL: https://issues.apache.org/jira/browse/SPARK-41876 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Apache Spark >Priority: Major > > {code:java} > schema = StructType( > [StructField("i", StringType(), True), StructField("j", IntegerType(), > True)] > ) > df = self.spark.createDataFrame([("a", 1)], schema) > schema1 = StructType([StructField("j", StringType()), StructField("i", > StringType())]) > df1 = df.to(schema1) > self.assertEqual(schema1, df1.schema) > self.assertEqual(df.count(), df1.count()) > schema2 = StructType([StructField("j", LongType())]) > df2 = df.to(schema2) > self.assertEqual(schema2, df2.schema) > self.assertEqual(df.count(), df2.count()) > schema3 = StructType([StructField("struct", schema1, False)]) > df3 = df.select(struct("i", "j").alias("struct")).to(schema3) > self.assertEqual(schema3, df3.schema) > self.assertEqual(df.count(), df3.count()) > # incompatible field nullability > schema4 = StructType([StructField("j", LongType(), False)]) > self.assertRaisesRegex( > AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4) > ){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 1486, in test_to > self.assertRaisesRegex( > AssertionError: AnalysisException not raised by {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-41876) Implement DataFrame `toLocalIterator`
[ https://issues.apache.org/jira/browse/SPARK-41876 ] jiaan.geng deleted comment on SPARK-41876: was (Author: beliefer): I will take a look! > Implement DataFrame `toLocalIterator` > - > > Key: SPARK-41876 > URL: https://issues.apache.org/jira/browse/SPARK-41876 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > schema = StructType( > [StructField("i", StringType(), True), StructField("j", IntegerType(), > True)] > ) > df = self.spark.createDataFrame([("a", 1)], schema) > schema1 = StructType([StructField("j", StringType()), StructField("i", > StringType())]) > df1 = df.to(schema1) > self.assertEqual(schema1, df1.schema) > self.assertEqual(df.count(), df1.count()) > schema2 = StructType([StructField("j", LongType())]) > df2 = df.to(schema2) > self.assertEqual(schema2, df2.schema) > self.assertEqual(df.count(), df2.count()) > schema3 = StructType([StructField("struct", schema1, False)]) > df3 = df.select(struct("i", "j").alias("struct")).to(schema3) > self.assertEqual(schema3, df3.schema) > self.assertEqual(df.count(), df3.count()) > # incompatible field nullability > schema4 = StructType([StructField("j", LongType(), False)]) > self.assertRaisesRegex( > AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4) > ){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 1486, in test_to > self.assertRaisesRegex( > AssertionError: AnalysisException not raised by {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41876) Implement DataFrame `toLocalIterator`
[ https://issues.apache.org/jira/browse/SPARK-41876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41876: Assignee: (was: Apache Spark) > Implement DataFrame `toLocalIterator` > - > > Key: SPARK-41876 > URL: https://issues.apache.org/jira/browse/SPARK-41876 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > schema = StructType( > [StructField("i", StringType(), True), StructField("j", IntegerType(), > True)] > ) > df = self.spark.createDataFrame([("a", 1)], schema) > schema1 = StructType([StructField("j", StringType()), StructField("i", > StringType())]) > df1 = df.to(schema1) > self.assertEqual(schema1, df1.schema) > self.assertEqual(df.count(), df1.count()) > schema2 = StructType([StructField("j", LongType())]) > df2 = df.to(schema2) > self.assertEqual(schema2, df2.schema) > self.assertEqual(df.count(), df2.count()) > schema3 = StructType([StructField("struct", schema1, False)]) > df3 = df.select(struct("i", "j").alias("struct")).to(schema3) > self.assertEqual(schema3, df3.schema) > self.assertEqual(df.count(), df3.count()) > # incompatible field nullability > schema4 = StructType([StructField("j", LongType(), False)]) > self.assertRaisesRegex( > AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4) > ){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 1486, in test_to > self.assertRaisesRegex( > AssertionError: AnalysisException not raised by {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41838) DataFrame.show() fix map printing
[ https://issues.apache.org/jira/browse/SPARK-41838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17657041#comment-17657041 ] jiaan.geng commented on SPARK-41838: I want take a look! > DataFrame.show() fix map printing > - > > Key: SPARK-41838 > URL: https://issues.apache.org/jira/browse/SPARK-41838 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1472, in pyspark.sql.connect.functions.posexplode_outer > Failed example: > df.select("id", "a_map", posexplode_outer("an_array")).show() > Expected: > +---+--+++ > | id| a_map| pos| col| > +---+--+++ > | 1|{x -> 1.0}| 0| foo| > | 1|{x -> 1.0}| 1| bar| > | 2| {}|null|null| > | 3| null|null|null| > +---+--+++ > Got: > +---+--+++ > | id| a_map| pos| col| > +---+--+++ > | 1| {1.0}| 0| foo| > | 1| {1.0}| 1| bar| > | 2|{null}|null|null| > | 3| null|null|null| > +---+--+++ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references
[ https://issues.apache.org/jira/browse/SPARK-41965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-41965: -- Assignee: (was: Ruifeng Zheng) Reverted in https://github.com/apache/spark/commit/d8ea5ee7697dc1df720d0faa3e12ccbb94a1f0f0 > Add DataFrameWriterV2 to PySpark API references > --- > > Key: SPARK-41965 > URL: https://issues.apache.org/jira/browse/SPARK-41965 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references
[ https://issues.apache.org/jira/browse/SPARK-41965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41965: Assignee: Apache Spark > Add DataFrameWriterV2 to PySpark API references > --- > > Key: SPARK-41965 > URL: https://issues.apache.org/jira/browse/SPARK-41965 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references
[ https://issues.apache.org/jira/browse/SPARK-41965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41965: - Fix Version/s: (was: 3.4.0) > Add DataFrameWriterV2 to PySpark API references > --- > > Key: SPARK-41965 > URL: https://issues.apache.org/jira/browse/SPARK-41965 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references
[ https://issues.apache.org/jira/browse/SPARK-41965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41965: Assignee: (was: Apache Spark) > Add DataFrameWriterV2 to PySpark API references > --- > > Key: SPARK-41965 > URL: https://issues.apache.org/jira/browse/SPARK-41965 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41971) `toPandas` should support duplicate filed names when arrow-optimization is on
Ruifeng Zheng created SPARK-41971: - Summary: `toPandas` should support duplicate filed names when arrow-optimization is on Key: SPARK-41971 URL: https://issues.apache.org/jira/browse/SPARK-41971 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng toPandas support duplicate columns name, but for a struct column, it doesnot support duplicate field names. {code:java} In [27]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", False) In [28]: spark.sql("select 1 v, 1 v").toPandas() Out[28]: v v 0 1 1 In [29]: spark.sql("select struct(1 v, 1 v)").toPandas() Out[29]: struct(1 AS v, 1 AS v) 0 (1, 1) In [30]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True) In [31]: spark.sql("select 1 v, 1 v").toPandas() Out[31]: v v 0 1 1 In [32]: spark.sql("select struct(1 v, 1 v)").toPandas() /Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/pandas/conversion.py:204: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on failures in the middle of computation. Ran out of field metadata, likely malformed warn(msg) --- ArrowInvalid Traceback (most recent call last) Cell In[32], line 1 > 1 spark.sql("select struct(1 v, 1 v)").toPandas() File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:143, in PandasConversionMixin.toPandas(self) 141 tmp_column_names = ["col_{}".format(i) for i in range(len(self.columns))] 142 self_destruct = jconf.arrowPySparkSelfDestructEnabled() --> 143 batches = self.toDF(*tmp_column_names)._collect_as_arrow( 144 split_batches=self_destruct 145 ) 146 if len(batches) > 0: 147 table = pyarrow.Table.from_batches(batches) File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:358, in PandasConversionMixin._collect_as_arrow(self, split_batches) 356 results.append(batch_or_indices) 357 else: --> 358 results = list(batch_stream) 359 finally: 360 # Join serving thread and raise any exceptions from collectAsArrowToPython 361 jsocket_auth_server.getResult() File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:55, in ArrowCollectSerializer.load_stream(self, stream) 50 """ 51 Load a stream of un-ordered Arrow RecordBatches, where the last iteration yields 52 a list of indices that can be used to put the RecordBatches in the correct order. 53 """ 54 # load the batches ---> 55 for batch in self.serializer.load_stream(stream): 56 yield batch 58 # load the batch order indices or propagate any error that occurred in the JVM File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:98, in ArrowStreamSerializer.load_stream(self, stream) 95 import pyarrow as pa 97 reader = pa.ipc.open_stream(stream) ---> 98 for batch in reader: 99 yield batch File ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:638, in __iter__() File ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:674, in pyarrow.lib.RecordBatchReader.read_next_batch() File ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status() ArrowInvalid: Ran out of field metadata, likely malformed {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41971) `toPandas` should support duplicate filed names when arrow-optimization is on
[ https://issues.apache.org/jira/browse/SPARK-41971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17657059#comment-17657059 ] Ruifeng Zheng commented on SPARK-41971: --- I think that is due to something is wrong in `ArrowConverter`. In Spark, a schema is just a StructType, but in arrow this is not the case, a schema is a class other than datatype. This difference maybe the cause. > `toPandas` should support duplicate filed names when arrow-optimization is on > - > > Key: SPARK-41971 > URL: https://issues.apache.org/jira/browse/SPARK-41971 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > toPandas support duplicate columns name, but for a struct column, it doesnot > support duplicate field names. > {code:java} > In [27]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", False) > In [28]: spark.sql("select 1 v, 1 v").toPandas() > Out[28]: >v v > 0 1 1 > In [29]: spark.sql("select struct(1 v, 1 v)").toPandas() > Out[29]: > struct(1 AS v, 1 AS v) > 0 (1, 1) > In [30]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True) > In [31]: spark.sql("select 1 v, 1 v").toPandas() > Out[31]: >v v > 0 1 1 > In [32]: spark.sql("select struct(1 v, 1 v)").toPandas() > /Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/pandas/conversion.py:204: > UserWarning: toPandas attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached > the error below and can not continue. Note that > 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect > on failures in the middle of computation. > Ran out of field metadata, likely malformed > warn(msg) > --- > ArrowInvalid Traceback (most recent call last) > Cell In[32], line 1 > > 1 spark.sql("select struct(1 v, 1 v)").toPandas() > File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:143, in > PandasConversionMixin.toPandas(self) > 141 tmp_column_names = ["col_{}".format(i) for i in > range(len(self.columns))] > 142 self_destruct = jconf.arrowPySparkSelfDestructEnabled() > --> 143 batches = self.toDF(*tmp_column_names)._collect_as_arrow( > 144 split_batches=self_destruct > 145 ) > 146 if len(batches) > 0: > 147 table = pyarrow.Table.from_batches(batches) > File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:358, in > PandasConversionMixin._collect_as_arrow(self, split_batches) > 356 results.append(batch_or_indices) > 357 else: > --> 358 results = list(batch_stream) > 359 finally: > 360 # Join serving thread and raise any exceptions from > collectAsArrowToPython > 361 jsocket_auth_server.getResult() > File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:55, in > ArrowCollectSerializer.load_stream(self, stream) > 50 """ > 51 Load a stream of un-ordered Arrow RecordBatches, where the last > iteration yields > 52 a list of indices that can be used to put the RecordBatches in the > correct order. > 53 """ > 54 # load the batches > ---> 55 for batch in self.serializer.load_stream(stream): > 56 yield batch > 58 # load the batch order indices or propagate any error that occurred > in the JVM > File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:98, in > ArrowStreamSerializer.load_stream(self, stream) > 95 import pyarrow as pa > 97 reader = pa.ipc.open_stream(stream) > ---> 98 for batch in reader: > 99 yield batch > File > ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:638, > in __iter__() > File > ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:674, > in pyarrow.lib.RecordBatchReader.read_next_batch() > File > ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/error.pxi:100, > in pyarrow.lib.check_status() > ArrowInvalid: Ran out of field metadata, likely malformed > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41971) `toPandas` should support duplicate filed names when arrow-optimization is on
[ https://issues.apache.org/jira/browse/SPARK-41971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-41971: -- Priority: Minor (was: Major) > `toPandas` should support duplicate filed names when arrow-optimization is on > - > > Key: SPARK-41971 > URL: https://issues.apache.org/jira/browse/SPARK-41971 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > > toPandas support duplicate columns name, but for a struct column, it doesnot > support duplicate field names. > {code:java} > In [27]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", False) > In [28]: spark.sql("select 1 v, 1 v").toPandas() > Out[28]: >v v > 0 1 1 > In [29]: spark.sql("select struct(1 v, 1 v)").toPandas() > Out[29]: > struct(1 AS v, 1 AS v) > 0 (1, 1) > In [30]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True) > In [31]: spark.sql("select 1 v, 1 v").toPandas() > Out[31]: >v v > 0 1 1 > In [32]: spark.sql("select struct(1 v, 1 v)").toPandas() > /Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/pandas/conversion.py:204: > UserWarning: toPandas attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached > the error below and can not continue. Note that > 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect > on failures in the middle of computation. > Ran out of field metadata, likely malformed > warn(msg) > --- > ArrowInvalid Traceback (most recent call last) > Cell In[32], line 1 > > 1 spark.sql("select struct(1 v, 1 v)").toPandas() > File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:143, in > PandasConversionMixin.toPandas(self) > 141 tmp_column_names = ["col_{}".format(i) for i in > range(len(self.columns))] > 142 self_destruct = jconf.arrowPySparkSelfDestructEnabled() > --> 143 batches = self.toDF(*tmp_column_names)._collect_as_arrow( > 144 split_batches=self_destruct > 145 ) > 146 if len(batches) > 0: > 147 table = pyarrow.Table.from_batches(batches) > File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:358, in > PandasConversionMixin._collect_as_arrow(self, split_batches) > 356 results.append(batch_or_indices) > 357 else: > --> 358 results = list(batch_stream) > 359 finally: > 360 # Join serving thread and raise any exceptions from > collectAsArrowToPython > 361 jsocket_auth_server.getResult() > File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:55, in > ArrowCollectSerializer.load_stream(self, stream) > 50 """ > 51 Load a stream of un-ordered Arrow RecordBatches, where the last > iteration yields > 52 a list of indices that can be used to put the RecordBatches in the > correct order. > 53 """ > 54 # load the batches > ---> 55 for batch in self.serializer.load_stream(stream): > 56 yield batch > 58 # load the batch order indices or propagate any error that occurred > in the JVM > File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:98, in > ArrowStreamSerializer.load_stream(self, stream) > 95 import pyarrow as pa > 97 reader = pa.ipc.open_stream(stream) > ---> 98 for batch in reader: > 99 yield batch > File > ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:638, > in __iter__() > File > ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:674, > in pyarrow.lib.RecordBatchReader.read_next_batch() > File > ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/error.pxi:100, > in pyarrow.lib.check_status() > ArrowInvalid: Ran out of field metadata, likely malformed > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41971) `toPandas` should support duplicate filed names when arrow-optimization is on
[ https://issues.apache.org/jira/browse/SPARK-41971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-41971: -- Issue Type: Bug (was: Improvement) > `toPandas` should support duplicate filed names when arrow-optimization is on > - > > Key: SPARK-41971 > URL: https://issues.apache.org/jira/browse/SPARK-41971 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > toPandas support duplicate columns name, but for a struct column, it doesnot > support duplicate field names. > {code:java} > In [27]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", False) > In [28]: spark.sql("select 1 v, 1 v").toPandas() > Out[28]: >v v > 0 1 1 > In [29]: spark.sql("select struct(1 v, 1 v)").toPandas() > Out[29]: > struct(1 AS v, 1 AS v) > 0 (1, 1) > In [30]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True) > In [31]: spark.sql("select 1 v, 1 v").toPandas() > Out[31]: >v v > 0 1 1 > In [32]: spark.sql("select struct(1 v, 1 v)").toPandas() > /Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/pandas/conversion.py:204: > UserWarning: toPandas attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached > the error below and can not continue. Note that > 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect > on failures in the middle of computation. > Ran out of field metadata, likely malformed > warn(msg) > --- > ArrowInvalid Traceback (most recent call last) > Cell In[32], line 1 > > 1 spark.sql("select struct(1 v, 1 v)").toPandas() > File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:143, in > PandasConversionMixin.toPandas(self) > 141 tmp_column_names = ["col_{}".format(i) for i in > range(len(self.columns))] > 142 self_destruct = jconf.arrowPySparkSelfDestructEnabled() > --> 143 batches = self.toDF(*tmp_column_names)._collect_as_arrow( > 144 split_batches=self_destruct > 145 ) > 146 if len(batches) > 0: > 147 table = pyarrow.Table.from_batches(batches) > File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:358, in > PandasConversionMixin._collect_as_arrow(self, split_batches) > 356 results.append(batch_or_indices) > 357 else: > --> 358 results = list(batch_stream) > 359 finally: > 360 # Join serving thread and raise any exceptions from > collectAsArrowToPython > 361 jsocket_auth_server.getResult() > File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:55, in > ArrowCollectSerializer.load_stream(self, stream) > 50 """ > 51 Load a stream of un-ordered Arrow RecordBatches, where the last > iteration yields > 52 a list of indices that can be used to put the RecordBatches in the > correct order. > 53 """ > 54 # load the batches > ---> 55 for batch in self.serializer.load_stream(stream): > 56 yield batch > 58 # load the batch order indices or propagate any error that occurred > in the JVM > File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:98, in > ArrowStreamSerializer.load_stream(self, stream) > 95 import pyarrow as pa > 97 reader = pa.ipc.open_stream(stream) > ---> 98 for batch in reader: > 99 yield batch > File > ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:638, > in __iter__() > File > ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:674, > in pyarrow.lib.RecordBatchReader.read_next_batch() > File > ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/error.pxi:100, > in pyarrow.lib.check_status() > ArrowInvalid: Ran out of field metadata, likely malformed > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-41971) `toPandas` should support duplicate filed names when arrow-optimization is on
[ https://issues.apache.org/jira/browse/SPARK-41971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17657059#comment-17657059 ] Ruifeng Zheng edited comment on SPARK-41971 at 1/11/23 3:17 AM: I think that is due to something is wrong in `ArrowConverter`. In Spark, a schema is just a StructType, but in arrow that is not the case, a schema is a class other than datatype. This difference maybe the cause. was (Author: podongfeng): I think that is due to something is wrong in `ArrowConverter`. In Spark, a schema is just a StructType, but in arrow this is not the case, a schema is a class other than datatype. This difference maybe the cause. > `toPandas` should support duplicate filed names when arrow-optimization is on > - > > Key: SPARK-41971 > URL: https://issues.apache.org/jira/browse/SPARK-41971 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > > toPandas support duplicate columns name, but for a struct column, it doesnot > support duplicate field names. > {code:java} > In [27]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", False) > In [28]: spark.sql("select 1 v, 1 v").toPandas() > Out[28]: >v v > 0 1 1 > In [29]: spark.sql("select struct(1 v, 1 v)").toPandas() > Out[29]: > struct(1 AS v, 1 AS v) > 0 (1, 1) > In [30]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True) > In [31]: spark.sql("select 1 v, 1 v").toPandas() > Out[31]: >v v > 0 1 1 > In [32]: spark.sql("select struct(1 v, 1 v)").toPandas() > /Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/pandas/conversion.py:204: > UserWarning: toPandas attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached > the error below and can not continue. Note that > 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect > on failures in the middle of computation. > Ran out of field metadata, likely malformed > warn(msg) > --- > ArrowInvalid Traceback (most recent call last) > Cell In[32], line 1 > > 1 spark.sql("select struct(1 v, 1 v)").toPandas() > File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:143, in > PandasConversionMixin.toPandas(self) > 141 tmp_column_names = ["col_{}".format(i) for i in > range(len(self.columns))] > 142 self_destruct = jconf.arrowPySparkSelfDestructEnabled() > --> 143 batches = self.toDF(*tmp_column_names)._collect_as_arrow( > 144 split_batches=self_destruct > 145 ) > 146 if len(batches) > 0: > 147 table = pyarrow.Table.from_batches(batches) > File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:358, in > PandasConversionMixin._collect_as_arrow(self, split_batches) > 356 results.append(batch_or_indices) > 357 else: > --> 358 results = list(batch_stream) > 359 finally: > 360 # Join serving thread and raise any exceptions from > collectAsArrowToPython > 361 jsocket_auth_server.getResult() > File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:55, in > ArrowCollectSerializer.load_stream(self, stream) > 50 """ > 51 Load a stream of un-ordered Arrow RecordBatches, where the last > iteration yields > 52 a list of indices that can be used to put the RecordBatches in the > correct order. > 53 """ > 54 # load the batches > ---> 55 for batch in self.serializer.load_stream(stream): > 56 yield batch > 58 # load the batch order indices or propagate any error that occurred > in the JVM > File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:98, in > ArrowStreamSerializer.load_stream(self, stream) > 95 import pyarrow as pa > 97 reader = pa.ipc.open_stream(stream) > ---> 98 for batch in reader: > 99 yield batch > File > ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:638, > in __iter__() > File > ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:674, > in pyarrow.lib.RecordBatchReader.read_next_batch() > File > ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/error.pxi:100, > in pyarrow.lib.check_status() > ArrowInvalid: Ran out of field metadata, likely malformed > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41879) `DataFrame.collect` should support nested types
[ https://issues.apache.org/jira/browse/SPARK-41879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-41879. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39462 [https://github.com/apache/spark/pull/39462] > `DataFrame.collect` should support nested types > --- > > Key: SPARK-41879 > URL: https://issues.apache.org/jira/browse/SPARK-41879 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/functions.py", > line 1578, in pyspark.sql.connect.functions.struct > Failed example: > df.select(struct('age', 'name').alias("struct")).collect() > Expected: > [Row(struct=Row(age=2, name='Alice')), Row(struct=Row(age=5, name='Bob'))] > Got: > [Row(struct={'age': 2, 'name': 'Alice'}), Row(struct={'age': 5, 'name': > 'Bob'})] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references
[ https://issues.apache.org/jira/browse/SPARK-41965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17657066#comment-17657066 ] Apache Spark commented on SPARK-41965: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39493 > Add DataFrameWriterV2 to PySpark API references > --- > > Key: SPARK-41965 > URL: https://issues.apache.org/jira/browse/SPARK-41965 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references
[ https://issues.apache.org/jira/browse/SPARK-41965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17657067#comment-17657067 ] Apache Spark commented on SPARK-41965: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39493 > Add DataFrameWriterV2 to PySpark API references > --- > > Key: SPARK-41965 > URL: https://issues.apache.org/jira/browse/SPARK-41965 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41972) Fix flaky test in StreamingQueryStatusListenerSuite
Gengliang Wang created SPARK-41972: -- Summary: Fix flaky test in StreamingQueryStatusListenerSuite Key: SPARK-41972 URL: https://issues.apache.org/jira/browse/SPARK-41972 Project: Spark Issue Type: Task Components: Tests Affects Versions: 3.4.0 Reporter: Gengliang Wang Assignee: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org