[jira] [Commented] (SPARK-41961) Support table-valued functions with LATERAL

2023-01-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656493#comment-17656493
 ] 

Apache Spark commented on SPARK-41961:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/39479

> Support table-valued functions with LATERAL
> ---
>
> Key: SPARK-41961
> URL: https://issues.apache.org/jira/browse/SPARK-41961
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Priority: Major
>
> Support table-valued functions with the LATERAL subquery. For example:
> {{select * from t, lateral explode(array(t.c1, t.c2))}}
> Currently, this query throws a parse exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41961) Support table-valued functions with LATERAL

2023-01-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41961:


Assignee: Apache Spark

> Support table-valued functions with LATERAL
> ---
>
> Key: SPARK-41961
> URL: https://issues.apache.org/jira/browse/SPARK-41961
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Assignee: Apache Spark
>Priority: Major
>
> Support table-valued functions with the LATERAL subquery. For example:
> {{select * from t, lateral explode(array(t.c1, t.c2))}}
> Currently, this query throws a parse exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41961) Support table-valued functions with LATERAL

2023-01-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41961:


Assignee: (was: Apache Spark)

> Support table-valued functions with LATERAL
> ---
>
> Key: SPARK-41961
> URL: https://issues.apache.org/jira/browse/SPARK-41961
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Priority: Major
>
> Support table-valued functions with the LATERAL subquery. For example:
> {{select * from t, lateral explode(array(t.c1, t.c2))}}
> Currently, this query throws a parse exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41961) Support table-valued functions with LATERAL

2023-01-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656494#comment-17656494
 ] 

Apache Spark commented on SPARK-41961:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/39479

> Support table-valued functions with LATERAL
> ---
>
> Key: SPARK-41961
> URL: https://issues.apache.org/jira/browse/SPARK-41961
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Priority: Major
>
> Support table-valued functions with the LATERAL subquery. For example:
> {{select * from t, lateral explode(array(t.c1, t.c2))}}
> Currently, this query throws a parse exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41962) Update the import order of class SpecificParquetRecordReaderBase

2023-01-10 Thread shuyouZZ (Jira)
shuyouZZ created SPARK-41962:


 Summary: Update the import order of class 
SpecificParquetRecordReaderBase
 Key: SPARK-41962
 URL: https://issues.apache.org/jira/browse/SPARK-41962
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: shuyouZZ
 Fix For: 3.4.0


There is a check style issue in class {{SpecificParquetRecordReaderBase}}. The 
import order of scala package is not correct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41962) Update the import order of scala package in class SpecificParquetRecordReaderBase

2023-01-10 Thread shuyouZZ (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shuyouZZ updated SPARK-41962:
-
Summary: Update the import order of scala package in class 
SpecificParquetRecordReaderBase  (was: Update the import order of class 
SpecificParquetRecordReaderBase)

> Update the import order of scala package in class 
> SpecificParquetRecordReaderBase
> -
>
> Key: SPARK-41962
> URL: https://issues.apache.org/jira/browse/SPARK-41962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: shuyouZZ
>Priority: Major
> Fix For: 3.4.0
>
>
> There is a check style issue in class {{SpecificParquetRecordReaderBase}}. 
> The import order of scala package is not correct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41960) Assign name to _LEGACY_ERROR_TEMP_1056

2023-01-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41960:


Assignee: Apache Spark

> Assign name to _LEGACY_ERROR_TEMP_1056
> --
>
> Key: SPARK-41960
> URL: https://issues.apache.org/jira/browse/SPARK-41960
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> Assign name to _LEGACY_ERROR_TEMP_1056



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41960) Assign name to _LEGACY_ERROR_TEMP_1056

2023-01-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41960:


Assignee: (was: Apache Spark)

> Assign name to _LEGACY_ERROR_TEMP_1056
> --
>
> Key: SPARK-41960
> URL: https://issues.apache.org/jira/browse/SPARK-41960
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Assign name to _LEGACY_ERROR_TEMP_1056



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41960) Assign name to _LEGACY_ERROR_TEMP_1056

2023-01-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656498#comment-17656498
 ] 

Apache Spark commented on SPARK-41960:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/39480

> Assign name to _LEGACY_ERROR_TEMP_1056
> --
>
> Key: SPARK-41960
> URL: https://issues.apache.org/jira/browse/SPARK-41960
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Assign name to _LEGACY_ERROR_TEMP_1056



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41872) Fix DataFrame createDataframe handling of None

2023-01-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41872:


Assignee: Ruifeng Zheng

> Fix DataFrame createDataframe handling of None
> --
>
> Key: SPARK-41872
> URL: https://issues.apache.org/jira/browse/SPARK-41872
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Ruifeng Zheng
>Priority: Major
>
> {code:java}
> row = self.spark.createDataFrame([("Alice", None, None, None)], 
> schema).fillna(True).first()
> self.assertEqual(row.age, None){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 231, in test_fillna
>     self.assertEqual(row.age, None)
> AssertionError: nan != None{code}
>  
> {code:java}
> row = (
> self.spark.createDataFrame([("Alice", 10, None)], schema)
> .replace(10, 20, subset=["name", "height"])
> .first()
> )
> self.assertEqual(row.name, "Alice")
> self.assertEqual(row.age, 10)
> self.assertEqual(row.height, None) {code}
> {code:java}
> Traceback (most recent call last):   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 372, in test_replace     self.assertEqual(row.height, None) 
> AssertionError: nan != None
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41872) Fix DataFrame createDataframe handling of None

2023-01-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41872.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39477
[https://github.com/apache/spark/pull/39477]

> Fix DataFrame createDataframe handling of None
> --
>
> Key: SPARK-41872
> URL: https://issues.apache.org/jira/browse/SPARK-41872
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> row = self.spark.createDataFrame([("Alice", None, None, None)], 
> schema).fillna(True).first()
> self.assertEqual(row.age, None){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 231, in test_fillna
>     self.assertEqual(row.age, None)
> AssertionError: nan != None{code}
>  
> {code:java}
> row = (
> self.spark.createDataFrame([("Alice", 10, None)], schema)
> .replace(10, 20, subset=["name", "height"])
> .first()
> )
> self.assertEqual(row.name, "Alice")
> self.assertEqual(row.age, 10)
> self.assertEqual(row.height, None) {code}
> {code:java}
> Traceback (most recent call last):   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 372, in test_replace     self.assertEqual(row.height, None) 
> AssertionError: nan != None
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41958) Disallow arbitrary custom classpath with proxy user in cluster mode

2023-01-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41958:


Assignee: wuyi

> Disallow arbitrary custom classpath with proxy user in cluster mode
> ---
>
> Key: SPARK-41958
> URL: https://issues.apache.org/jira/browse/SPARK-41958
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.3.1, 3.2.3
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> To avoid arbitrary classpath in spark cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41958) Disallow arbitrary custom classpath with proxy user in cluster mode

2023-01-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41958.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39474
[https://github.com/apache/spark/pull/39474]

> Disallow arbitrary custom classpath with proxy user in cluster mode
> ---
>
> Key: SPARK-41958
> URL: https://issues.apache.org/jira/browse/SPARK-41958
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.3.1, 3.2.3
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.4.0
>
>
> To avoid arbitrary classpath in spark cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41886) `DataFrame.intersect` doctest output has different order

2023-01-10 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656529#comment-17656529
 ] 

jiaan.geng commented on SPARK-41886:


I want take a look!

> `DataFrame.intersect` doctest output has different order
> 
>
> Key: SPARK-41886
> URL: https://issues.apache.org/jira/browse/SPARK-41886
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> not sure whether this needs to be fix:
> {code:java}
> File 
> "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/dataframe.py", 
> line 609, in pyspark.sql.connect.dataframe.DataFrame.intersect
> Failed example:
> df1.intersect(df2).show()
> Expected:
> +---+---+
> | C1| C2|
> +---+---+
> |  b|  3|
> |  a|  1|
> +---+---+
> Got:
> +---+---+
> | C1| C2|
> +---+---+
> |  a|  1|
> |  b|  3|
> +---+---+
> 
> **
>1 of   3 in pyspark.sql.connect.dataframe.DataFrame.intersect
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41963) Different exception in DataFrame.unpivot

2023-01-10 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-41963:


 Summary: Different exception in DataFrame.unpivot
 Key: SPARK-41963
 URL: https://issues.apache.org/jira/browse/SPARK-41963
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


Running {{test_parity_dataframe DataFrameParityTests.test_unpivot_negative}} 
fails as below:

{code}
with self.subTest(desc="with no value columns"):
for values in [[], ()]:
with self.subTest(values=values):
with self.assertRaisesRegex(
Exception,  # (AnalysisException, SparkConnectException)
r".*\[UNPIVOT_REQUIRES_VALUE_COLUMNS] At least one 
value column "
r"needs to be specified for UNPIVOT, all columns 
specified as ids.*",
):
>   df.unpivot("id", values, "var", "val").collect()
E   AssertionError: ".*\[UNPIVOT_REQUIRES_VALUE_COLUMNS] At 
least one value column needs to be specified for UNPIVOT, all columns specified 
as ids.*" does not match "[UNPIVOT_VALUE_DATA_TYPE_MISMATCH] Unpivot value 
columns must share a least common type, some types do not: ["BIGINT" (`int`), 
"DOUBLE" (`double`), "STRING" (`str`)]
E   Plan: 'Unpivot ArraySeq(id#2947L), 
List(List(int#2948L), List(double#2949), List(str#2950)), var, [val]
E   +- Project [id#2939L AS id#2947L, int#2940L AS 
int#2948L, double#2941 AS double#2949, str#2942 AS str#2950]
E  +- LocalRelation [id#2939L, int#2940L, double#2941, 
str#2942]
E   "
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41963) Different exception message in DataFrame.unpivot

2023-01-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41963:
-
Summary: Different exception message in DataFrame.unpivot  (was: Different 
exception in DataFrame.unpivot)

> Different exception message in DataFrame.unpivot
> 
>
> Key: SPARK-41963
> URL: https://issues.apache.org/jira/browse/SPARK-41963
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Running {{test_parity_dataframe DataFrameParityTests.test_unpivot_negative}} 
> fails as below:
> {code}
> with self.subTest(desc="with no value columns"):
> for values in [[], ()]:
> with self.subTest(values=values):
> with self.assertRaisesRegex(
> Exception,  # (AnalysisException, 
> SparkConnectException)
> r".*\[UNPIVOT_REQUIRES_VALUE_COLUMNS] At least one 
> value column "
> r"needs to be specified for UNPIVOT, all columns 
> specified as ids.*",
> ):
> >   df.unpivot("id", values, "var", "val").collect()
> E   AssertionError: ".*\[UNPIVOT_REQUIRES_VALUE_COLUMNS] 
> At least one value column needs to be specified for UNPIVOT, all columns 
> specified as ids.*" does not match "[UNPIVOT_VALUE_DATA_TYPE_MISMATCH] 
> Unpivot value columns must share a least common type, some types do not: 
> ["BIGINT" (`int`), "DOUBLE" (`double`), "STRING" (`str`)]
> E   Plan: 'Unpivot ArraySeq(id#2947L), 
> List(List(int#2948L), List(double#2949), List(str#2950)), var, [val]
> E   +- Project [id#2939L AS id#2947L, int#2940L AS 
> int#2948L, double#2941 AS double#2949, str#2942 AS str#2950]
> E  +- LocalRelation [id#2939L, int#2940L, 
> double#2941, str#2942]
> E   "
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41877) SparkSession.createDataFrame error parity

2023-01-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41877:


Assignee: Apache Spark

> SparkSession.createDataFrame error parity
> -
>
> Key: SPARK-41877
> URL: https://issues.apache.org/jira/browse/SPARK-41877
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame(
> [
> (1, 10, 1.0, "one"),
> (2, 20, 2.0, "two"),
> (3, 30, 3.0, "three"),
> ],
> ["id", "int", "double", "str"],
> )
> with self.subTest(desc="with none identifier"):
> with self.assertRaisesRegex(AssertionError, "ids must not be None"):
> df.unpivot(None, ["int", "double"], "var", "val"){code}
> Error:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 575, in test_unpivot
>     with self.assertRaisesRegex(AssertionError, "ids must not be None"):
> AssertionError: AssertionError not raised{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41877) SparkSession.createDataFrame error parity

2023-01-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41877:


Assignee: (was: Apache Spark)

> SparkSession.createDataFrame error parity
> -
>
> Key: SPARK-41877
> URL: https://issues.apache.org/jira/browse/SPARK-41877
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame(
> [
> (1, 10, 1.0, "one"),
> (2, 20, 2.0, "two"),
> (3, 30, 3.0, "three"),
> ],
> ["id", "int", "double", "str"],
> )
> with self.subTest(desc="with none identifier"):
> with self.assertRaisesRegex(AssertionError, "ids must not be None"):
> df.unpivot(None, ["int", "double"], "var", "val"){code}
> Error:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 575, in test_unpivot
>     with self.assertRaisesRegex(AssertionError, "ids must not be None"):
> AssertionError: AssertionError not raised{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41877) SparkSession.createDataFrame error parity

2023-01-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656541#comment-17656541
 ] 

Apache Spark commented on SPARK-41877:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39482

> SparkSession.createDataFrame error parity
> -
>
> Key: SPARK-41877
> URL: https://issues.apache.org/jira/browse/SPARK-41877
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame(
> [
> (1, 10, 1.0, "one"),
> (2, 20, 2.0, "two"),
> (3, 30, 3.0, "three"),
> ],
> ["id", "int", "double", "str"],
> )
> with self.subTest(desc="with none identifier"):
> with self.assertRaisesRegex(AssertionError, "ids must not be None"):
> df.unpivot(None, ["int", "double"], "var", "val"){code}
> Error:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 575, in test_unpivot
>     with self.assertRaisesRegex(AssertionError, "ids must not be None"):
> AssertionError: AssertionError not raised{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41877) SparkSession.createDataFrame error parity

2023-01-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656543#comment-17656543
 ] 

Apache Spark commented on SPARK-41877:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39482

> SparkSession.createDataFrame error parity
> -
>
> Key: SPARK-41877
> URL: https://issues.apache.org/jira/browse/SPARK-41877
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame(
> [
> (1, 10, 1.0, "one"),
> (2, 20, 2.0, "two"),
> (3, 30, 3.0, "three"),
> ],
> ["id", "int", "double", "str"],
> )
> with self.subTest(desc="with none identifier"):
> with self.assertRaisesRegex(AssertionError, "ids must not be None"):
> df.unpivot(None, ["int", "double"], "var", "val"){code}
> Error:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 575, in test_unpivot
>     with self.assertRaisesRegex(AssertionError, "ids must not be None"):
> AssertionError: AssertionError not raised{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41964) Add the unsupported function list

2023-01-10 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-41964:
-

 Summary: Add the unsupported function list
 Key: SPARK-41964
 URL: https://issues.apache.org/jira/browse/SPARK-41964
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41886) `DataFrame.intersect` doctest output has different order

2023-01-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41886:


Assignee: Apache Spark

> `DataFrame.intersect` doctest output has different order
> 
>
> Key: SPARK-41886
> URL: https://issues.apache.org/jira/browse/SPARK-41886
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>
> not sure whether this needs to be fix:
> {code:java}
> File 
> "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/dataframe.py", 
> line 609, in pyspark.sql.connect.dataframe.DataFrame.intersect
> Failed example:
> df1.intersect(df2).show()
> Expected:
> +---+---+
> | C1| C2|
> +---+---+
> |  b|  3|
> |  a|  1|
> +---+---+
> Got:
> +---+---+
> | C1| C2|
> +---+---+
> |  a|  1|
> |  b|  3|
> +---+---+
> 
> **
>1 of   3 in pyspark.sql.connect.dataframe.DataFrame.intersect
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41886) `DataFrame.intersect` doctest output has different order

2023-01-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656581#comment-17656581
 ] 

Apache Spark commented on SPARK-41886:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/39483

> `DataFrame.intersect` doctest output has different order
> 
>
> Key: SPARK-41886
> URL: https://issues.apache.org/jira/browse/SPARK-41886
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> not sure whether this needs to be fix:
> {code:java}
> File 
> "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/dataframe.py", 
> line 609, in pyspark.sql.connect.dataframe.DataFrame.intersect
> Failed example:
> df1.intersect(df2).show()
> Expected:
> +---+---+
> | C1| C2|
> +---+---+
> |  b|  3|
> |  a|  1|
> +---+---+
> Got:
> +---+---+
> | C1| C2|
> +---+---+
> |  a|  1|
> |  b|  3|
> +---+---+
> 
> **
>1 of   3 in pyspark.sql.connect.dataframe.DataFrame.intersect
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41886) `DataFrame.intersect` doctest output has different order

2023-01-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41886:


Assignee: (was: Apache Spark)

> `DataFrame.intersect` doctest output has different order
> 
>
> Key: SPARK-41886
> URL: https://issues.apache.org/jira/browse/SPARK-41886
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> not sure whether this needs to be fix:
> {code:java}
> File 
> "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/dataframe.py", 
> line 609, in pyspark.sql.connect.dataframe.DataFrame.intersect
> Failed example:
> df1.intersect(df2).show()
> Expected:
> +---+---+
> | C1| C2|
> +---+---+
> |  b|  3|
> |  a|  1|
> +---+---+
> Got:
> +---+---+
> | C1| C2|
> +---+---+
> |  a|  1|
> |  b|  3|
> +---+---+
> 
> **
>1 of   3 in pyspark.sql.connect.dataframe.DataFrame.intersect
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-41886) `DataFrame.intersect` doctest output has different order

2023-01-10 Thread jiaan.geng (Jira)


[ https://issues.apache.org/jira/browse/SPARK-41886 ]


jiaan.geng deleted comment on SPARK-41886:


was (Author: beliefer):
I want take a look!

> `DataFrame.intersect` doctest output has different order
> 
>
> Key: SPARK-41886
> URL: https://issues.apache.org/jira/browse/SPARK-41886
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> not sure whether this needs to be fix:
> {code:java}
> File 
> "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/dataframe.py", 
> line 609, in pyspark.sql.connect.dataframe.DataFrame.intersect
> Failed example:
> df1.intersect(df2).show()
> Expected:
> +---+---+
> | C1| C2|
> +---+---+
> |  b|  3|
> |  a|  1|
> +---+---+
> Got:
> +---+---+
> | C1| C2|
> +---+---+
> |  a|  1|
> |  b|  3|
> +---+---+
> 
> **
>1 of   3 in pyspark.sql.connect.dataframe.DataFrame.intersect
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41964) Add the unsupported function list

2023-01-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41964:


Assignee: (was: Apache Spark)

> Add the unsupported function list
> -
>
> Key: SPARK-41964
> URL: https://issues.apache.org/jira/browse/SPARK-41964
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41964) Add the unsupported function list

2023-01-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656592#comment-17656592
 ] 

Apache Spark commented on SPARK-41964:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39484

> Add the unsupported function list
> -
>
> Key: SPARK-41964
> URL: https://issues.apache.org/jira/browse/SPARK-41964
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41964) Add the unsupported function list

2023-01-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41964:


Assignee: Apache Spark

> Add the unsupported function list
> -
>
> Key: SPARK-41964
> URL: https://issues.apache.org/jira/browse/SPARK-41964
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references

2023-01-10 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-41965:
-

 Summary: Add DataFrameWriterV2 to PySpark API references
 Key: SPARK-41965
 URL: https://issues.apache.org/jira/browse/SPARK-41965
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references

2023-01-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41965:


Assignee: Apache Spark

> Add DataFrameWriterV2 to PySpark API references
> ---
>
> Key: SPARK-41965
> URL: https://issues.apache.org/jira/browse/SPARK-41965
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references

2023-01-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41965:


Assignee: (was: Apache Spark)

> Add DataFrameWriterV2 to PySpark API references
> ---
>
> Key: SPARK-41965
> URL: https://issues.apache.org/jira/browse/SPARK-41965
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references

2023-01-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656599#comment-17656599
 ] 

Apache Spark commented on SPARK-41965:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39485

> Add DataFrameWriterV2 to PySpark API references
> ---
>
> Key: SPARK-41965
> URL: https://issues.apache.org/jira/browse/SPARK-41965
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references

2023-01-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656598#comment-17656598
 ] 

Apache Spark commented on SPARK-41965:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39485

> Add DataFrameWriterV2 to PySpark API references
> ---
>
> Key: SPARK-41965
> URL: https://issues.apache.org/jira/browse/SPARK-41965
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41966) Add `CharType` and `TimestampNTZType` to PySpark API references

2023-01-10 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-41966:
-

 Summary: Add `CharType` and `TimestampNTZType` to PySpark API 
references
 Key: SPARK-41966
 URL: https://issues.apache.org/jira/browse/SPARK-41966
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41966) Add `CharType` and `TimestampNTZType` to PySpark API references

2023-01-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41966:


Assignee: Apache Spark

> Add `CharType` and `TimestampNTZType` to PySpark API references
> ---
>
> Key: SPARK-41966
> URL: https://issues.apache.org/jira/browse/SPARK-41966
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41966) Add `CharType` and `TimestampNTZType` to PySpark API references

2023-01-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656601#comment-17656601
 ] 

Apache Spark commented on SPARK-41966:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39486

> Add `CharType` and `TimestampNTZType` to PySpark API references
> ---
>
> Key: SPARK-41966
> URL: https://issues.apache.org/jira/browse/SPARK-41966
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41966) Add `CharType` and `TimestampNTZType` to PySpark API references

2023-01-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41966:


Assignee: (was: Apache Spark)

> Add `CharType` and `TimestampNTZType` to PySpark API references
> ---
>
> Key: SPARK-41966
> URL: https://issues.apache.org/jira/browse/SPARK-41966
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41966) Add `CharType` and `TimestampNTZType` to PySpark API references

2023-01-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41966:


Assignee: Apache Spark

> Add `CharType` and `TimestampNTZType` to PySpark API references
> ---
>
> Key: SPARK-41966
> URL: https://issues.apache.org/jira/browse/SPARK-41966
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41907) Function `sampleby` return parity

2023-01-10 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41907.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39476
[https://github.com/apache/spark/pull/39476]

> Function `sampleby` return parity
> -
>
> Key: SPARK-41907
> URL: https://issues.apache.org/jira/browse/SPARK-41907
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)])
> sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0)
> self.assertTrue(sampled.count() == 35){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 202, in test_sampleby
> self.assertTrue(sampled.count() == 35)
> AssertionError: False is not true {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41907) Function `sampleby` return parity

2023-01-10 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-41907:
-

Assignee: jiaan.geng

> Function `sampleby` return parity
> -
>
> Key: SPARK-41907
> URL: https://issues.apache.org/jira/browse/SPARK-41907
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: jiaan.geng
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)])
> sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0)
> self.assertTrue(sampled.count() == 35){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 202, in test_sampleby
> self.assertTrue(sampled.count() == 35)
> AssertionError: False is not true {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41752) UI improvement for nested SQL executions

2023-01-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-41752:
---

Assignee: Linhong Liu

> UI improvement for nested SQL executions
> 
>
> Key: SPARK-41752
> URL: https://issues.apache.org/jira/browse/SPARK-41752
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Web UI
>Affects Versions: 3.4.0
>Reporter: Linhong Liu
>Assignee: Linhong Liu
>Priority: Major
>
> in SPARK-41713, the CTAS will trigger a sub-execution to perform the data 
> insertion. But the UI will display two independent queries, it will confuse 
> users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41752) UI improvement for nested SQL executions

2023-01-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-41752.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39268
[https://github.com/apache/spark/pull/39268]

> UI improvement for nested SQL executions
> 
>
> Key: SPARK-41752
> URL: https://issues.apache.org/jira/browse/SPARK-41752
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Web UI
>Affects Versions: 3.4.0
>Reporter: Linhong Liu
>Assignee: Linhong Liu
>Priority: Major
> Fix For: 3.4.0
>
>
> in SPARK-41713, the CTAS will trigger a sub-execution to perform the data 
> insertion. But the UI will display two independent queries, it will confuse 
> users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41949) Make stage scheduling support local-cluster mode

2023-01-10 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu reassigned SPARK-41949:
--

Assignee: Weichen Xu

> Make stage scheduling support local-cluster mode
> 
>
> Key: SPARK-41949
> URL: https://issues.apache.org/jira/browse/SPARK-41949
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.4.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>
> Make stage scheduling support local-cluster mode.
> This is useful in testing, especially for test code of third-party python 
> libraries that depends on pyspark, many tests are written with pytest, but 
> pytest is hard to integrate with a standalone spark cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41949) Make stage scheduling support local-cluster mode

2023-01-10 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu resolved SPARK-41949.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39424
[https://github.com/apache/spark/pull/39424]

> Make stage scheduling support local-cluster mode
> 
>
> Key: SPARK-41949
> URL: https://issues.apache.org/jira/browse/SPARK-41949
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.4.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.4.0
>
>
> Make stage scheduling support local-cluster mode.
> This is useful in testing, especially for test code of third-party python 
> libraries that depends on pyspark, many tests are written with pytest, but 
> pytest is hard to integrate with a standalone spark cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41575) Assign name to _LEGACY_ERROR_TEMP_2054

2023-01-10 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-41575:


Assignee: Haejoon Lee

> Assign name to _LEGACY_ERROR_TEMP_2054
> --
>
> Key: SPARK-41575
> URL: https://issues.apache.org/jira/browse/SPARK-41575
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> We should use proper error class name rather than `_LEGACY_ERROR_TEMP_xxx`.
>  
> *NOTE:* Please reply to this ticket before start working on it, to avoid 
> working on same ticket at a time



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41575) Assign name to _LEGACY_ERROR_TEMP_2054

2023-01-10 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-41575.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39394
[https://github.com/apache/spark/pull/39394]

> Assign name to _LEGACY_ERROR_TEMP_2054
> --
>
> Key: SPARK-41575
> URL: https://issues.apache.org/jira/browse/SPARK-41575
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> We should use proper error class name rather than `_LEGACY_ERROR_TEMP_xxx`.
>  
> *NOTE:* Please reply to this ticket before start working on it, to avoid 
> working on same ticket at a time



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41967) SBT unable to resolve particular packages from the imported maven build

2023-01-10 Thread Venkata Sai Akhil Gudesa (Jira)
Venkata Sai Akhil Gudesa created SPARK-41967:


 Summary: SBT unable to resolve particular packages from the 
imported maven build
 Key: SPARK-41967
 URL: https://issues.apache.org/jira/browse/SPARK-41967
 Project: Spark
  Issue Type: Bug
  Components: Connect
Affects Versions: 3.4.0
Reporter: Venkata Sai Akhil Gudesa


An SBT issue causes the resolution from the imported maven build for particular 
packages to not work for an unknown reason. This affects Spark-Connect-related 
projects (see 
[here|https://github.com/apache/spark/blob/6cae6aa5156655c79eb3f20292ccec6c479c3b1b/project/SparkBuild.scala#L667-L668]
 and 
[here|https://github.com/apache/spark/blob/6cae6aa5156655c79eb3f20292ccec6c479c3b1b/project/SparkBuild.scala#L902-L904]
 for example) by forcing duplicate deps.

The pom build works fine when removing the affected dep (like guava for 
example) but the sbt build then fails. Thus, we are forced to explicitly 
mention the versions of the affected packages so that SBT can then parse the 
version(s) to manually include them (and they're also added as a dep in maven 
to ensure version consistency with sbt)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41822) Setup Scala/JVM Client Connection

2023-01-10 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-41822:
-

Assignee: Venkata Sai Akhil Gudesa

> Setup Scala/JVM Client Connection
> -
>
> Key: SPARK-41822
> URL: https://issues.apache.org/jira/browse/SPARK-41822
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Venkata Sai Akhil Gudesa
>Priority: Major
>
> Set up the gRPC connection for the Scala/JVM client to enable communication 
> with the Spark Connect server. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41822) Setup Scala/JVM Client Connection

2023-01-10 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41822.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39361
[https://github.com/apache/spark/pull/39361]

> Setup Scala/JVM Client Connection
> -
>
> Key: SPARK-41822
> URL: https://issues.apache.org/jira/browse/SPARK-41822
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Venkata Sai Akhil Gudesa
>Priority: Major
> Fix For: 3.4.0
>
>
> Set up the gRPC connection for the Scala/JVM client to enable communication 
> with the Spark Connect server. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41877) SparkSession.createDataFrame error parity

2023-01-10 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41877.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39482
[https://github.com/apache/spark/pull/39482]

> SparkSession.createDataFrame error parity
> -
>
> Key: SPARK-41877
> URL: https://issues.apache.org/jira/browse/SPARK-41877
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> df = self.spark.createDataFrame(
> [
> (1, 10, 1.0, "one"),
> (2, 20, 2.0, "two"),
> (3, 30, 3.0, "three"),
> ],
> ["id", "int", "double", "str"],
> )
> with self.subTest(desc="with none identifier"):
> with self.assertRaisesRegex(AssertionError, "ids must not be None"):
> df.unpivot(None, ["int", "double"], "var", "val"){code}
> Error:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 575, in test_unpivot
>     with self.assertRaisesRegex(AssertionError, "ids must not be None"):
> AssertionError: AssertionError not raised{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41877) SparkSession.createDataFrame error parity

2023-01-10 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-41877:
-

Assignee: Hyukjin Kwon

> SparkSession.createDataFrame error parity
> -
>
> Key: SPARK-41877
> URL: https://issues.apache.org/jira/browse/SPARK-41877
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> df = self.spark.createDataFrame(
> [
> (1, 10, 1.0, "one"),
> (2, 20, 2.0, "two"),
> (3, 30, 3.0, "three"),
> ],
> ["id", "int", "double", "str"],
> )
> with self.subTest(desc="with none identifier"):
> with self.assertRaisesRegex(AssertionError, "ids must not be None"):
> df.unpivot(None, ["int", "double"], "var", "val"){code}
> Error:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 575, in test_unpivot
>     with self.assertRaisesRegex(AssertionError, "ids must not be None"):
> AssertionError: AssertionError not raised{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41886) `DataFrame.intersect` doctest output has different order

2023-01-10 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41886.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39483
[https://github.com/apache/spark/pull/39483]

> `DataFrame.intersect` doctest output has different order
> 
>
> Key: SPARK-41886
> URL: https://issues.apache.org/jira/browse/SPARK-41886
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>
> not sure whether this needs to be fix:
> {code:java}
> File 
> "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/dataframe.py", 
> line 609, in pyspark.sql.connect.dataframe.DataFrame.intersect
> Failed example:
> df1.intersect(df2).show()
> Expected:
> +---+---+
> | C1| C2|
> +---+---+
> |  b|  3|
> |  a|  1|
> +---+---+
> Got:
> +---+---+
> | C1| C2|
> +---+---+
> |  a|  1|
> |  b|  3|
> +---+---+
> 
> **
>1 of   3 in pyspark.sql.connect.dataframe.DataFrame.intersect
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41886) `DataFrame.intersect` doctest output has different order

2023-01-10 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-41886:
-

Assignee: jiaan.geng

> `DataFrame.intersect` doctest output has different order
> 
>
> Key: SPARK-41886
> URL: https://issues.apache.org/jira/browse/SPARK-41886
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.4.0
>
>
> not sure whether this needs to be fix:
> {code:java}
> File 
> "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/dataframe.py", 
> line 609, in pyspark.sql.connect.dataframe.DataFrame.intersect
> Failed example:
> df1.intersect(df2).show()
> Expected:
> +---+---+
> | C1| C2|
> +---+---+
> |  b|  3|
> |  a|  1|
> +---+---+
> Got:
> +---+---+
> | C1| C2|
> +---+---+
> |  a|  1|
> |  b|  3|
> +---+---+
> 
> **
>1 of   3 in pyspark.sql.connect.dataframe.DataFrame.intersect
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41968) Refactor ProtobufSerDe to ProtobufSerDe[T]

2023-01-10 Thread Yang Jie (Jira)
Yang Jie created SPARK-41968:


 Summary: Refactor ProtobufSerDe to ProtobufSerDe[T]
 Key: SPARK-41968
 URL: https://issues.apache.org/jira/browse/SPARK-41968
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Affects Versions: 3.4.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41968) Refactor ProtobufSerDe to ProtobufSerDe[T]

2023-01-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41968:


Assignee: Apache Spark

> Refactor ProtobufSerDe to ProtobufSerDe[T]
> --
>
> Key: SPARK-41968
> URL: https://issues.apache.org/jira/browse/SPARK-41968
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41968) Refactor ProtobufSerDe to ProtobufSerDe[T]

2023-01-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656701#comment-17656701
 ] 

Apache Spark commented on SPARK-41968:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/39487

> Refactor ProtobufSerDe to ProtobufSerDe[T]
> --
>
> Key: SPARK-41968
> URL: https://issues.apache.org/jira/browse/SPARK-41968
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41968) Refactor ProtobufSerDe to ProtobufSerDe[T]

2023-01-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656700#comment-17656700
 ] 

Apache Spark commented on SPARK-41968:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/39487

> Refactor ProtobufSerDe to ProtobufSerDe[T]
> --
>
> Key: SPARK-41968
> URL: https://issues.apache.org/jira/browse/SPARK-41968
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41968) Refactor ProtobufSerDe to ProtobufSerDe[T]

2023-01-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41968:


Assignee: (was: Apache Spark)

> Refactor ProtobufSerDe to ProtobufSerDe[T]
> --
>
> Key: SPARK-41968
> URL: https://issues.apache.org/jira/browse/SPARK-41968
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40588) Sorting issue with partitioned-writing and AQE turned on

2023-01-10 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated SPARK-40588:

Labels: correctness  (was: )

> Sorting issue with partitioned-writing and AQE turned on
> 
>
> Key: SPARK-40588
> URL: https://issues.apache.org/jira/browse/SPARK-40588
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.3
> Environment: Spark v3.1.3
> Scala v2.12.13
>Reporter: Swetha Baskaran
>Assignee: Enrico Minack
>Priority: Major
>  Labels: correctness
> Fix For: 3.2.3, 3.3.2
>
> Attachments: image-2022-10-16-22-05-47-159.png
>
>
> We are attempting to partition data by a few columns, sort by a particular 
> _sortCol_ and write out one file per partition. 
> {code:java}
> df
>     .repartition(col("day"), col("month"), col("year"))
>     .withColumn("partitionId",spark_partition_id)
>     .withColumn("monotonicallyIncreasingIdUnsorted",monotonicallyIncreasingId)
>     .sortWithinPartitions("year", "month", "day", "sortCol")
>     .withColumn("monotonicallyIncreasingIdSorted",monotonicallyIncreasingId)
>     .write
>     .partitionBy("year", "month", "day")
>     .parquet(path){code}
> When inspecting the results, we observe one file per partition, however we 
> see an _alternating_ pattern of unsorted rows in some files.
> {code:java}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832121344,"monotonicallyIncreasingIdSorted":6287832121344}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877022389,"monotonicallyIncreasingIdSorted":6287876860586}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877567881,"monotonicallyIncreasingIdSorted":6287832121345}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287835105553,"monotonicallyIncreasingIdSorted":6287876860587}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832570127,"monotonicallyIncreasingIdSorted":6287832121346}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287879965760,"monotonicallyIncreasingIdSorted":6287876860588}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287878762347,"monotonicallyIncreasingIdSorted":6287832121347}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287837165012,"monotonicallyIncreasingIdSorted":6287876860589}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832910545,"monotonicallyIncreasingIdSorted":6287832121348}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287881244758,"monotonicallyIncreasingIdSorted":6287876860590}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287880041345,"monotonicallyIncreasingIdSorted":6287832121349}{code}
> Here is a 
> [gist|https://gist.github.com/Swebask/543030748a768be92d3c0ae343d2ae89] to 
> reproduce the issue. 
> Turning off AQE with spark.conf.set("spark.sql.adaptive.enabled", false) 
> fixes the issue.
> I'm working on identifying why AQE affects the sort order. Any leads or 
> thoughts would be appreciated!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40588) Sorting issue with partitioned-writing and AQE turned on

2023-01-10 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656705#comment-17656705
 ] 

Erik Krogen commented on SPARK-40588:
-

Labeling with 'correctness' since this breaks correctness of output by breaking 
the sort ordering.

> Sorting issue with partitioned-writing and AQE turned on
> 
>
> Key: SPARK-40588
> URL: https://issues.apache.org/jira/browse/SPARK-40588
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.3
> Environment: Spark v3.1.3
> Scala v2.12.13
>Reporter: Swetha Baskaran
>Assignee: Enrico Minack
>Priority: Major
>  Labels: correctness
> Fix For: 3.2.3, 3.3.2
>
> Attachments: image-2022-10-16-22-05-47-159.png
>
>
> We are attempting to partition data by a few columns, sort by a particular 
> _sortCol_ and write out one file per partition. 
> {code:java}
> df
>     .repartition(col("day"), col("month"), col("year"))
>     .withColumn("partitionId",spark_partition_id)
>     .withColumn("monotonicallyIncreasingIdUnsorted",monotonicallyIncreasingId)
>     .sortWithinPartitions("year", "month", "day", "sortCol")
>     .withColumn("monotonicallyIncreasingIdSorted",monotonicallyIncreasingId)
>     .write
>     .partitionBy("year", "month", "day")
>     .parquet(path){code}
> When inspecting the results, we observe one file per partition, however we 
> see an _alternating_ pattern of unsorted rows in some files.
> {code:java}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832121344,"monotonicallyIncreasingIdSorted":6287832121344}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877022389,"monotonicallyIncreasingIdSorted":6287876860586}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877567881,"monotonicallyIncreasingIdSorted":6287832121345}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287835105553,"monotonicallyIncreasingIdSorted":6287876860587}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832570127,"monotonicallyIncreasingIdSorted":6287832121346}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287879965760,"monotonicallyIncreasingIdSorted":6287876860588}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287878762347,"monotonicallyIncreasingIdSorted":6287832121347}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287837165012,"monotonicallyIncreasingIdSorted":6287876860589}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832910545,"monotonicallyIncreasingIdSorted":6287832121348}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287881244758,"monotonicallyIncreasingIdSorted":6287876860590}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287880041345,"monotonicallyIncreasingIdSorted":6287832121349}{code}
> Here is a 
> [gist|https://gist.github.com/Swebask/543030748a768be92d3c0ae343d2ae89] to 
> reproduce the issue. 
> Turning off AQE with spark.conf.set("spark.sql.adaptive.enabled", false) 
> fixes the issue.
> I'm working on identifying why AQE affects the sort order. Any leads or 
> thoughts would be appreciated!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41969) Flaky Test: StreamingQueryStatusListenerSuite.test small retained queries

2023-01-10 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-41969:
-

 Summary: Flaky Test: StreamingQueryStatusListenerSuite.test small 
retained queries
 Key: SPARK-41969
 URL: https://issues.apache.org/jira/browse/SPARK-41969
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun


I saw this failures on master branch frequently.

https://github.com/apache/spark/actions/runs/3884439263/jobs/6631404948





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41969) Flaky Test: StreamingQueryStatusListenerSuite.test small retained queries

2023-01-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-41969:
--
Description: 
I saw this failures on master branch frequently.

https://github.com/apache/spark/actions/runs/3884439263/jobs/6631404948
https://github.com/apache/spark/runs/10556299549



  was:
I saw this failures on master branch frequently.

https://github.com/apache/spark/actions/runs/3884439263/jobs/6631404948




> Flaky Test: StreamingQueryStatusListenerSuite.test small retained queries
> -
>
> Key: SPARK-41969
> URL: https://issues.apache.org/jira/browse/SPARK-41969
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> I saw this failures on master branch frequently.
> https://github.com/apache/spark/actions/runs/3884439263/jobs/6631404948
> https://github.com/apache/spark/runs/10556299549



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41969) Flaky Test: StreamingQueryStatusListenerSuite.test small retained queries

2023-01-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-41969:
--
Description: 
I saw this failures on master branch frequently.

https://github.com/apache/spark/actions/runs/3884439263/jobs/6631404948
https://github.com/apache/spark/runs/10556299549
https://github.com/apache/spark/runs/10551101022



  was:
I saw this failures on master branch frequently.

https://github.com/apache/spark/actions/runs/3884439263/jobs/6631404948
https://github.com/apache/spark/runs/10556299549




> Flaky Test: StreamingQueryStatusListenerSuite.test small retained queries
> -
>
> Key: SPARK-41969
> URL: https://issues.apache.org/jira/browse/SPARK-41969
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> I saw this failures on master branch frequently.
> https://github.com/apache/spark/actions/runs/3884439263/jobs/6631404948
> https://github.com/apache/spark/runs/10556299549
> https://github.com/apache/spark/runs/10551101022



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41969) Flaky Test: StreamingQueryStatusListenerSuite.test small retained queries

2023-01-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-41969:
--
Description: 
I saw this failures on master branch frequently.

https://github.com/apache/spark/runs/10560025461
https://github.com/apache/spark/runs/10556299549
https://github.com/apache/spark/runs/10551101022



  was:
I saw this failures on master branch frequently.

https://github.com/apache/spark/actions/runs/3884439263/jobs/6631404948
https://github.com/apache/spark/runs/10556299549
https://github.com/apache/spark/runs/10551101022




> Flaky Test: StreamingQueryStatusListenerSuite.test small retained queries
> -
>
> Key: SPARK-41969
> URL: https://issues.apache.org/jira/browse/SPARK-41969
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> I saw this failures on master branch frequently.
> https://github.com/apache/spark/runs/10560025461
> https://github.com/apache/spark/runs/10556299549
> https://github.com/apache/spark/runs/10551101022



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38173) Quoted column cannot be recognized correctly when quotedRegexColumnNames is true

2023-01-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38173:
--
Fix Version/s: 3.2.4

> Quoted column cannot be recognized correctly when quotedRegexColumnNames is 
> true
> 
>
> Key: SPARK-38173
> URL: https://issues.apache.org/jira/browse/SPARK-38173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Tongwei
>Assignee: Tongwei
>Priority: Major
> Fix For: 3.3.0, 3.2.4
>
>
> When spark.sql.parser.quotedRegexColumnNames=true
> {code:java}
>  SELECT `(C3)?+.+`,`C1` * C2 FROM (SELECT 3 AS C1,2 AS C2,1 AS C3) T;{code}
> The above query will throw an exception
> {code:java}
> Error: org.apache.hive.service.cli.HiveSQLException: Error running query: 
> org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 
> 'multiply'
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:370)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:266)
>         at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:44)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:266)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:261)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:275)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.spark.sql.AnalysisException: Invalid usage of '*' in 
> expression 'multiply'
>         at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50)
>         at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:49)
>         at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:155)
>         at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1700)
>         at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1671)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:342)
>         at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:342)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:339)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:408)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:339)
>         at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.expandStarExpression(Analyzer.scala:1671)
>         at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.$anonfun$buildExpandedProjectList$1(Analyzer.scala:1656)
>  {code}
> It works fine in hive
> {code:java}
> 0: jdbc:hive2://hiveserver-inc.> set hive.support.quoted.identifiers=

[jira] [Created] (SPARK-41970) SparkPath

2023-01-10 Thread David Lewis (Jira)
David Lewis created SPARK-41970:
---

 Summary: SparkPath
 Key: SPARK-41970
 URL: https://issues.apache.org/jira/browse/SPARK-41970
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: David Lewis


Today, Spark represents file paths in various ways. Sometimes they are Hadoop 
`Path`s, sometimes they are `Path.toString`s, and sometimes they are 
`Path.toUri.toString`s.

This discrepancy means that Spark does not always work when user provided 
strings have special characters. Sometimes Spark will try to create a URI with 
an unescaped string; sometimes Spark will double-escape a path and try to 
access the wrong file.

 

This issue proposes a new `SparkPath` class meant to provide type safety when 
Spark is dealing with paths.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41970) SparkPath

2023-01-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656986#comment-17656986
 ] 

Apache Spark commented on SPARK-41970:
--

User 'databricks-david-lewis' has created a pull request for this issue:
https://github.com/apache/spark/pull/39488

> SparkPath
> -
>
> Key: SPARK-41970
> URL: https://issues.apache.org/jira/browse/SPARK-41970
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: David Lewis
>Priority: Major
>
> Today, Spark represents file paths in various ways. Sometimes they are Hadoop 
> `Path`s, sometimes they are `Path.toString`s, and sometimes they are 
> `Path.toUri.toString`s.
> This discrepancy means that Spark does not always work when user provided 
> strings have special characters. Sometimes Spark will try to create a URI 
> with an unescaped string; sometimes Spark will double-escape a path and try 
> to access the wrong file.
>  
> This issue proposes a new `SparkPath` class meant to provide type safety when 
> Spark is dealing with paths.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41970) SparkPath

2023-01-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41970:


Assignee: (was: Apache Spark)

> SparkPath
> -
>
> Key: SPARK-41970
> URL: https://issues.apache.org/jira/browse/SPARK-41970
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: David Lewis
>Priority: Major
>
> Today, Spark represents file paths in various ways. Sometimes they are Hadoop 
> `Path`s, sometimes they are `Path.toString`s, and sometimes they are 
> `Path.toUri.toString`s.
> This discrepancy means that Spark does not always work when user provided 
> strings have special characters. Sometimes Spark will try to create a URI 
> with an unescaped string; sometimes Spark will double-escape a path and try 
> to access the wrong file.
>  
> This issue proposes a new `SparkPath` class meant to provide type safety when 
> Spark is dealing with paths.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41970) SparkPath

2023-01-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41970:


Assignee: Apache Spark

> SparkPath
> -
>
> Key: SPARK-41970
> URL: https://issues.apache.org/jira/browse/SPARK-41970
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: David Lewis
>Assignee: Apache Spark
>Priority: Major
>
> Today, Spark represents file paths in various ways. Sometimes they are Hadoop 
> `Path`s, sometimes they are `Path.toString`s, and sometimes they are 
> `Path.toUri.toString`s.
> This discrepancy means that Spark does not always work when user provided 
> strings have special characters. Sometimes Spark will try to create a URI 
> with an unescaped string; sometimes Spark will double-escape a path and try 
> to access the wrong file.
>  
> This issue proposes a new `SparkPath` class meant to provide type safety when 
> Spark is dealing with paths.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41970) SparkPath

2023-01-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656988#comment-17656988
 ] 

Apache Spark commented on SPARK-41970:
--

User 'databricks-david-lewis' has created a pull request for this issue:
https://github.com/apache/spark/pull/39488

> SparkPath
> -
>
> Key: SPARK-41970
> URL: https://issues.apache.org/jira/browse/SPARK-41970
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: David Lewis
>Priority: Major
>
> Today, Spark represents file paths in various ways. Sometimes they are Hadoop 
> `Path`s, sometimes they are `Path.toString`s, and sometimes they are 
> `Path.toUri.toString`s.
> This discrepancy means that Spark does not always work when user provided 
> strings have special characters. Sometimes Spark will try to create a URI 
> with an unescaped string; sometimes Spark will double-escape a path and try 
> to access the wrong file.
>  
> This issue proposes a new `SparkPath` class meant to provide type safety when 
> Spark is dealing with paths.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41752) UI improvement for nested SQL executions

2023-01-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17657015#comment-17657015
 ] 

Apache Spark commented on SPARK-41752:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/39489

> UI improvement for nested SQL executions
> 
>
> Key: SPARK-41752
> URL: https://issues.apache.org/jira/browse/SPARK-41752
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Web UI
>Affects Versions: 3.4.0
>Reporter: Linhong Liu
>Assignee: Linhong Liu
>Priority: Major
> Fix For: 3.4.0
>
>
> in SPARK-41713, the CTAS will trigger a sub-execution to perform the data 
> insertion. But the UI will display two independent queries, it will confuse 
> users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41964) Add the unsupported function list

2023-01-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41964:


Assignee: Ruifeng Zheng

> Add the unsupported function list
> -
>
> Key: SPARK-41964
> URL: https://issues.apache.org/jira/browse/SPARK-41964
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41964) Add the unsupported function list

2023-01-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41964.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39484
[https://github.com/apache/spark/pull/39484]

> Add the unsupported function list
> -
>
> Key: SPARK-41964
> URL: https://issues.apache.org/jira/browse/SPARK-41964
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41966) Add `CharType` and `TimestampNTZType` to PySpark API references

2023-01-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41966.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39486
[https://github.com/apache/spark/pull/39486]

> Add `CharType` and `TimestampNTZType` to PySpark API references
> ---
>
> Key: SPARK-41966
> URL: https://issues.apache.org/jira/browse/SPARK-41966
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references

2023-01-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41965:


Assignee: Ruifeng Zheng

> Add DataFrameWriterV2 to PySpark API references
> ---
>
> Key: SPARK-41965
> URL: https://issues.apache.org/jira/browse/SPARK-41965
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references

2023-01-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41965.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39485
[https://github.com/apache/spark/pull/39485]

> Add DataFrameWriterV2 to PySpark API references
> ---
>
> Key: SPARK-41965
> URL: https://issues.apache.org/jira/browse/SPARK-41965
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41876) Implement DataFrame `toLocalIterator`

2023-01-10 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17657028#comment-17657028
 ] 

jiaan.geng commented on SPARK-41876:


I will take a look!

> Implement DataFrame `toLocalIterator`
> -
>
> Key: SPARK-41876
> URL: https://issues.apache.org/jira/browse/SPARK-41876
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> schema = StructType(
> [StructField("i", StringType(), True), StructField("j", IntegerType(), 
> True)]
> )
> df = self.spark.createDataFrame([("a", 1)], schema)
> schema1 = StructType([StructField("j", StringType()), StructField("i", 
> StringType())])
> df1 = df.to(schema1)
> self.assertEqual(schema1, df1.schema)
> self.assertEqual(df.count(), df1.count())
> schema2 = StructType([StructField("j", LongType())])
> df2 = df.to(schema2)
> self.assertEqual(schema2, df2.schema)
> self.assertEqual(df.count(), df2.count())
> schema3 = StructType([StructField("struct", schema1, False)])
> df3 = df.select(struct("i", "j").alias("struct")).to(schema3)
> self.assertEqual(schema3, df3.schema)
> self.assertEqual(df.count(), df3.count())
> # incompatible field nullability
> schema4 = StructType([StructField("j", LongType(), False)])
> self.assertRaisesRegex(
> AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4)
> ){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 1486, in test_to
>     self.assertRaisesRegex(
> AssertionError: AnalysisException not raised by  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41589) PyTorch Distributor

2023-01-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17657029#comment-17657029
 ] 

Apache Spark commented on SPARK-41589:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39490

> PyTorch Distributor
> ---
>
> Key: SPARK-41589
> URL: https://issues.apache.org/jira/browse/SPARK-41589
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> This is a project to make it easier for PySpark users to distribute PyTorch 
> code using PySpark. The corresponding [Design 
> Document|https://docs.google.com/document/d/1QPO1Ly8WteL6aIPvVcR7Xne9qVtJiB3fdrRn7NwBcpA/edit?usp=sharing]
>  can give more context. This was a project determined by the Databricks ML 
> Training Team; please reach out to [~gurwls223] (Spark-side) or [~erithwik] 
> for more context.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41887) Support DataFrame hint parameter to be list

2023-01-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17657031#comment-17657031
 ] 

Apache Spark commented on SPARK-41887:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39491

> Support DataFrame hint parameter to be list
> ---
>
> Key: SPARK-41887
> URL: https://issues.apache.org/jira/browse/SPARK-41887
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.range(10e10).toDF("id")
> such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
> hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41887) Support DataFrame hint parameter to be list

2023-01-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41887:


Assignee: Apache Spark

> Support DataFrame hint parameter to be list
> ---
>
> Key: SPARK-41887
> URL: https://issues.apache.org/jira/browse/SPARK-41887
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> df = self.spark.range(10e10).toDF("id")
> such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
> hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41887) Support DataFrame hint parameter to be list

2023-01-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17657032#comment-17657032
 ] 

Apache Spark commented on SPARK-41887:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39491

> Support DataFrame hint parameter to be list
> ---
>
> Key: SPARK-41887
> URL: https://issues.apache.org/jira/browse/SPARK-41887
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.range(10e10).toDF("id")
> such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
> hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41887) Support DataFrame hint parameter to be list

2023-01-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41887:


Assignee: (was: Apache Spark)

> Support DataFrame hint parameter to be list
> ---
>
> Key: SPARK-41887
> URL: https://issues.apache.org/jira/browse/SPARK-41887
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.range(10e10).toDF("id")
> such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
> hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41876) Implement DataFrame `toLocalIterator`

2023-01-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17657040#comment-17657040
 ] 

Apache Spark commented on SPARK-41876:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/39492

> Implement DataFrame `toLocalIterator`
> -
>
> Key: SPARK-41876
> URL: https://issues.apache.org/jira/browse/SPARK-41876
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> schema = StructType(
> [StructField("i", StringType(), True), StructField("j", IntegerType(), 
> True)]
> )
> df = self.spark.createDataFrame([("a", 1)], schema)
> schema1 = StructType([StructField("j", StringType()), StructField("i", 
> StringType())])
> df1 = df.to(schema1)
> self.assertEqual(schema1, df1.schema)
> self.assertEqual(df.count(), df1.count())
> schema2 = StructType([StructField("j", LongType())])
> df2 = df.to(schema2)
> self.assertEqual(schema2, df2.schema)
> self.assertEqual(df.count(), df2.count())
> schema3 = StructType([StructField("struct", schema1, False)])
> df3 = df.select(struct("i", "j").alias("struct")).to(schema3)
> self.assertEqual(schema3, df3.schema)
> self.assertEqual(df.count(), df3.count())
> # incompatible field nullability
> schema4 = StructType([StructField("j", LongType(), False)])
> self.assertRaisesRegex(
> AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4)
> ){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 1486, in test_to
>     self.assertRaisesRegex(
> AssertionError: AnalysisException not raised by  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41876) Implement DataFrame `toLocalIterator`

2023-01-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41876:


Assignee: Apache Spark

> Implement DataFrame `toLocalIterator`
> -
>
> Key: SPARK-41876
> URL: https://issues.apache.org/jira/browse/SPARK-41876
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> schema = StructType(
> [StructField("i", StringType(), True), StructField("j", IntegerType(), 
> True)]
> )
> df = self.spark.createDataFrame([("a", 1)], schema)
> schema1 = StructType([StructField("j", StringType()), StructField("i", 
> StringType())])
> df1 = df.to(schema1)
> self.assertEqual(schema1, df1.schema)
> self.assertEqual(df.count(), df1.count())
> schema2 = StructType([StructField("j", LongType())])
> df2 = df.to(schema2)
> self.assertEqual(schema2, df2.schema)
> self.assertEqual(df.count(), df2.count())
> schema3 = StructType([StructField("struct", schema1, False)])
> df3 = df.select(struct("i", "j").alias("struct")).to(schema3)
> self.assertEqual(schema3, df3.schema)
> self.assertEqual(df.count(), df3.count())
> # incompatible field nullability
> schema4 = StructType([StructField("j", LongType(), False)])
> self.assertRaisesRegex(
> AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4)
> ){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 1486, in test_to
>     self.assertRaisesRegex(
> AssertionError: AnalysisException not raised by  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-41876) Implement DataFrame `toLocalIterator`

2023-01-10 Thread jiaan.geng (Jira)


[ https://issues.apache.org/jira/browse/SPARK-41876 ]


jiaan.geng deleted comment on SPARK-41876:


was (Author: beliefer):
I will take a look!

> Implement DataFrame `toLocalIterator`
> -
>
> Key: SPARK-41876
> URL: https://issues.apache.org/jira/browse/SPARK-41876
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> schema = StructType(
> [StructField("i", StringType(), True), StructField("j", IntegerType(), 
> True)]
> )
> df = self.spark.createDataFrame([("a", 1)], schema)
> schema1 = StructType([StructField("j", StringType()), StructField("i", 
> StringType())])
> df1 = df.to(schema1)
> self.assertEqual(schema1, df1.schema)
> self.assertEqual(df.count(), df1.count())
> schema2 = StructType([StructField("j", LongType())])
> df2 = df.to(schema2)
> self.assertEqual(schema2, df2.schema)
> self.assertEqual(df.count(), df2.count())
> schema3 = StructType([StructField("struct", schema1, False)])
> df3 = df.select(struct("i", "j").alias("struct")).to(schema3)
> self.assertEqual(schema3, df3.schema)
> self.assertEqual(df.count(), df3.count())
> # incompatible field nullability
> schema4 = StructType([StructField("j", LongType(), False)])
> self.assertRaisesRegex(
> AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4)
> ){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 1486, in test_to
>     self.assertRaisesRegex(
> AssertionError: AnalysisException not raised by  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41876) Implement DataFrame `toLocalIterator`

2023-01-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41876:


Assignee: (was: Apache Spark)

> Implement DataFrame `toLocalIterator`
> -
>
> Key: SPARK-41876
> URL: https://issues.apache.org/jira/browse/SPARK-41876
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> schema = StructType(
> [StructField("i", StringType(), True), StructField("j", IntegerType(), 
> True)]
> )
> df = self.spark.createDataFrame([("a", 1)], schema)
> schema1 = StructType([StructField("j", StringType()), StructField("i", 
> StringType())])
> df1 = df.to(schema1)
> self.assertEqual(schema1, df1.schema)
> self.assertEqual(df.count(), df1.count())
> schema2 = StructType([StructField("j", LongType())])
> df2 = df.to(schema2)
> self.assertEqual(schema2, df2.schema)
> self.assertEqual(df.count(), df2.count())
> schema3 = StructType([StructField("struct", schema1, False)])
> df3 = df.select(struct("i", "j").alias("struct")).to(schema3)
> self.assertEqual(schema3, df3.schema)
> self.assertEqual(df.count(), df3.count())
> # incompatible field nullability
> schema4 = StructType([StructField("j", LongType(), False)])
> self.assertRaisesRegex(
> AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4)
> ){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 1486, in test_to
>     self.assertRaisesRegex(
> AssertionError: AnalysisException not raised by  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41838) DataFrame.show() fix map printing

2023-01-10 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17657041#comment-17657041
 ] 

jiaan.geng commented on SPARK-41838:


I want take a look!

> DataFrame.show() fix map printing
> -
>
> Key: SPARK-41838
> URL: https://issues.apache.org/jira/browse/SPARK-41838
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1472, in pyspark.sql.connect.functions.posexplode_outer
> Failed example:
>     df.select("id", "a_map", posexplode_outer("an_array")).show()
> Expected:
>     +---+--+++
>     | id|     a_map| pos| col|
>     +---+--+++
>     |  1|{x -> 1.0}|   0| foo|
>     |  1|{x -> 1.0}|   1| bar|
>     |  2|        {}|null|null|
>     |  3|      null|null|null|
>     +---+--+++
> Got:
>     +---+--+++
>     | id| a_map| pos| col|
>     +---+--+++
>     |  1| {1.0}|   0| foo|
>     |  1| {1.0}|   1| bar|
>     |  2|{null}|null|null|
>     |  3|  null|null|null|
>     +---+--+++
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references

2023-01-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-41965:
--
  Assignee: (was: Ruifeng Zheng)

Reverted in 
https://github.com/apache/spark/commit/d8ea5ee7697dc1df720d0faa3e12ccbb94a1f0f0

> Add DataFrameWriterV2 to PySpark API references
> ---
>
> Key: SPARK-41965
> URL: https://issues.apache.org/jira/browse/SPARK-41965
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references

2023-01-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41965:


Assignee: Apache Spark

> Add DataFrameWriterV2 to PySpark API references
> ---
>
> Key: SPARK-41965
> URL: https://issues.apache.org/jira/browse/SPARK-41965
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references

2023-01-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41965:
-
Fix Version/s: (was: 3.4.0)

> Add DataFrameWriterV2 to PySpark API references
> ---
>
> Key: SPARK-41965
> URL: https://issues.apache.org/jira/browse/SPARK-41965
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references

2023-01-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41965:


Assignee: (was: Apache Spark)

> Add DataFrameWriterV2 to PySpark API references
> ---
>
> Key: SPARK-41965
> URL: https://issues.apache.org/jira/browse/SPARK-41965
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41971) `toPandas` should support duplicate filed names when arrow-optimization is on

2023-01-10 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-41971:
-

 Summary: `toPandas` should support duplicate filed names when 
arrow-optimization is on
 Key: SPARK-41971
 URL: https://issues.apache.org/jira/browse/SPARK-41971
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng


toPandas support duplicate columns name, but for a struct column, it doesnot 
support duplicate field names.

{code:java}
In [27]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", False)

In [28]: spark.sql("select 1 v, 1 v").toPandas()
Out[28]: 
   v  v
0  1  1

In [29]: spark.sql("select struct(1 v, 1 v)").toPandas()
Out[29]: 
  struct(1 AS v, 1 AS v)
0 (1, 1)

In [30]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True)

In [31]: spark.sql("select 1 v, 1 v").toPandas()
Out[31]: 
   v  v
0  1  1

In [32]: spark.sql("select struct(1 v, 1 v)").toPandas()
/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/pandas/conversion.py:204: 
UserWarning: toPandas attempted Arrow optimization because 
'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the 
error below and can not continue. Note that 
'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on 
failures in the middle of computation.
  Ran out of field metadata, likely malformed
  warn(msg)
---
ArrowInvalid  Traceback (most recent call last)
Cell In[32], line 1
> 1 spark.sql("select struct(1 v, 1 v)").toPandas()

File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:143, in 
PandasConversionMixin.toPandas(self)
141 tmp_column_names = ["col_{}".format(i) for i in 
range(len(self.columns))]
142 self_destruct = jconf.arrowPySparkSelfDestructEnabled()
--> 143 batches = self.toDF(*tmp_column_names)._collect_as_arrow(
144 split_batches=self_destruct
145 )
146 if len(batches) > 0:
147 table = pyarrow.Table.from_batches(batches)

File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:358, in 
PandasConversionMixin._collect_as_arrow(self, split_batches)
356 results.append(batch_or_indices)
357 else:
--> 358 results = list(batch_stream)
359 finally:
360 # Join serving thread and raise any exceptions from 
collectAsArrowToPython
361 jsocket_auth_server.getResult()

File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:55, in 
ArrowCollectSerializer.load_stream(self, stream)
 50 """
 51 Load a stream of un-ordered Arrow RecordBatches, where the last 
iteration yields
 52 a list of indices that can be used to put the RecordBatches in the 
correct order.
 53 """
 54 # load the batches
---> 55 for batch in self.serializer.load_stream(stream):
 56 yield batch
 58 # load the batch order indices or propagate any error that occurred in 
the JVM

File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:98, in 
ArrowStreamSerializer.load_stream(self, stream)
 95 import pyarrow as pa
 97 reader = pa.ipc.open_stream(stream)
---> 98 for batch in reader:
 99 yield batch

File 
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:638,
 in __iter__()

File 
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:674,
 in pyarrow.lib.RecordBatchReader.read_next_batch()

File 
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/error.pxi:100,
 in pyarrow.lib.check_status()

ArrowInvalid: Ran out of field metadata, likely malformed

{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41971) `toPandas` should support duplicate filed names when arrow-optimization is on

2023-01-10 Thread Ruifeng Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17657059#comment-17657059
 ] 

Ruifeng Zheng commented on SPARK-41971:
---

I think that is due to something is wrong in `ArrowConverter`.

In Spark, a schema is just a StructType, but in arrow this is not the case, a 
schema is a class other than datatype. This difference maybe the cause.

> `toPandas` should support duplicate filed names when arrow-optimization is on
> -
>
> Key: SPARK-41971
> URL: https://issues.apache.org/jira/browse/SPARK-41971
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> toPandas support duplicate columns name, but for a struct column, it doesnot 
> support duplicate field names.
> {code:java}
> In [27]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", False)
> In [28]: spark.sql("select 1 v, 1 v").toPandas()
> Out[28]: 
>v  v
> 0  1  1
> In [29]: spark.sql("select struct(1 v, 1 v)").toPandas()
> Out[29]: 
>   struct(1 AS v, 1 AS v)
> 0 (1, 1)
> In [30]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True)
> In [31]: spark.sql("select 1 v, 1 v").toPandas()
> Out[31]: 
>v  v
> 0  1  1
> In [32]: spark.sql("select struct(1 v, 1 v)").toPandas()
> /Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/pandas/conversion.py:204: 
> UserWarning: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached 
> the error below and can not continue. Note that 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect 
> on failures in the middle of computation.
>   Ran out of field metadata, likely malformed
>   warn(msg)
> ---
> ArrowInvalid  Traceback (most recent call last)
> Cell In[32], line 1
> > 1 spark.sql("select struct(1 v, 1 v)").toPandas()
> File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:143, in 
> PandasConversionMixin.toPandas(self)
> 141 tmp_column_names = ["col_{}".format(i) for i in 
> range(len(self.columns))]
> 142 self_destruct = jconf.arrowPySparkSelfDestructEnabled()
> --> 143 batches = self.toDF(*tmp_column_names)._collect_as_arrow(
> 144 split_batches=self_destruct
> 145 )
> 146 if len(batches) > 0:
> 147 table = pyarrow.Table.from_batches(batches)
> File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:358, in 
> PandasConversionMixin._collect_as_arrow(self, split_batches)
> 356 results.append(batch_or_indices)
> 357 else:
> --> 358 results = list(batch_stream)
> 359 finally:
> 360 # Join serving thread and raise any exceptions from 
> collectAsArrowToPython
> 361 jsocket_auth_server.getResult()
> File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:55, in 
> ArrowCollectSerializer.load_stream(self, stream)
>  50 """
>  51 Load a stream of un-ordered Arrow RecordBatches, where the last 
> iteration yields
>  52 a list of indices that can be used to put the RecordBatches in the 
> correct order.
>  53 """
>  54 # load the batches
> ---> 55 for batch in self.serializer.load_stream(stream):
>  56 yield batch
>  58 # load the batch order indices or propagate any error that occurred 
> in the JVM
> File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:98, in 
> ArrowStreamSerializer.load_stream(self, stream)
>  95 import pyarrow as pa
>  97 reader = pa.ipc.open_stream(stream)
> ---> 98 for batch in reader:
>  99 yield batch
> File 
> ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:638,
>  in __iter__()
> File 
> ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:674,
>  in pyarrow.lib.RecordBatchReader.read_next_batch()
> File 
> ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/error.pxi:100,
>  in pyarrow.lib.check_status()
> ArrowInvalid: Ran out of field metadata, likely malformed
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41971) `toPandas` should support duplicate filed names when arrow-optimization is on

2023-01-10 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-41971:
--
Priority: Minor  (was: Major)

> `toPandas` should support duplicate filed names when arrow-optimization is on
> -
>
> Key: SPARK-41971
> URL: https://issues.apache.org/jira/browse/SPARK-41971
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>
> toPandas support duplicate columns name, but for a struct column, it doesnot 
> support duplicate field names.
> {code:java}
> In [27]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", False)
> In [28]: spark.sql("select 1 v, 1 v").toPandas()
> Out[28]: 
>v  v
> 0  1  1
> In [29]: spark.sql("select struct(1 v, 1 v)").toPandas()
> Out[29]: 
>   struct(1 AS v, 1 AS v)
> 0 (1, 1)
> In [30]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True)
> In [31]: spark.sql("select 1 v, 1 v").toPandas()
> Out[31]: 
>v  v
> 0  1  1
> In [32]: spark.sql("select struct(1 v, 1 v)").toPandas()
> /Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/pandas/conversion.py:204: 
> UserWarning: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached 
> the error below and can not continue. Note that 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect 
> on failures in the middle of computation.
>   Ran out of field metadata, likely malformed
>   warn(msg)
> ---
> ArrowInvalid  Traceback (most recent call last)
> Cell In[32], line 1
> > 1 spark.sql("select struct(1 v, 1 v)").toPandas()
> File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:143, in 
> PandasConversionMixin.toPandas(self)
> 141 tmp_column_names = ["col_{}".format(i) for i in 
> range(len(self.columns))]
> 142 self_destruct = jconf.arrowPySparkSelfDestructEnabled()
> --> 143 batches = self.toDF(*tmp_column_names)._collect_as_arrow(
> 144 split_batches=self_destruct
> 145 )
> 146 if len(batches) > 0:
> 147 table = pyarrow.Table.from_batches(batches)
> File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:358, in 
> PandasConversionMixin._collect_as_arrow(self, split_batches)
> 356 results.append(batch_or_indices)
> 357 else:
> --> 358 results = list(batch_stream)
> 359 finally:
> 360 # Join serving thread and raise any exceptions from 
> collectAsArrowToPython
> 361 jsocket_auth_server.getResult()
> File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:55, in 
> ArrowCollectSerializer.load_stream(self, stream)
>  50 """
>  51 Load a stream of un-ordered Arrow RecordBatches, where the last 
> iteration yields
>  52 a list of indices that can be used to put the RecordBatches in the 
> correct order.
>  53 """
>  54 # load the batches
> ---> 55 for batch in self.serializer.load_stream(stream):
>  56 yield batch
>  58 # load the batch order indices or propagate any error that occurred 
> in the JVM
> File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:98, in 
> ArrowStreamSerializer.load_stream(self, stream)
>  95 import pyarrow as pa
>  97 reader = pa.ipc.open_stream(stream)
> ---> 98 for batch in reader:
>  99 yield batch
> File 
> ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:638,
>  in __iter__()
> File 
> ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:674,
>  in pyarrow.lib.RecordBatchReader.read_next_batch()
> File 
> ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/error.pxi:100,
>  in pyarrow.lib.check_status()
> ArrowInvalid: Ran out of field metadata, likely malformed
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41971) `toPandas` should support duplicate filed names when arrow-optimization is on

2023-01-10 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-41971:
--
Issue Type: Bug  (was: Improvement)

> `toPandas` should support duplicate filed names when arrow-optimization is on
> -
>
> Key: SPARK-41971
> URL: https://issues.apache.org/jira/browse/SPARK-41971
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> toPandas support duplicate columns name, but for a struct column, it doesnot 
> support duplicate field names.
> {code:java}
> In [27]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", False)
> In [28]: spark.sql("select 1 v, 1 v").toPandas()
> Out[28]: 
>v  v
> 0  1  1
> In [29]: spark.sql("select struct(1 v, 1 v)").toPandas()
> Out[29]: 
>   struct(1 AS v, 1 AS v)
> 0 (1, 1)
> In [30]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True)
> In [31]: spark.sql("select 1 v, 1 v").toPandas()
> Out[31]: 
>v  v
> 0  1  1
> In [32]: spark.sql("select struct(1 v, 1 v)").toPandas()
> /Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/pandas/conversion.py:204: 
> UserWarning: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached 
> the error below and can not continue. Note that 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect 
> on failures in the middle of computation.
>   Ran out of field metadata, likely malformed
>   warn(msg)
> ---
> ArrowInvalid  Traceback (most recent call last)
> Cell In[32], line 1
> > 1 spark.sql("select struct(1 v, 1 v)").toPandas()
> File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:143, in 
> PandasConversionMixin.toPandas(self)
> 141 tmp_column_names = ["col_{}".format(i) for i in 
> range(len(self.columns))]
> 142 self_destruct = jconf.arrowPySparkSelfDestructEnabled()
> --> 143 batches = self.toDF(*tmp_column_names)._collect_as_arrow(
> 144 split_batches=self_destruct
> 145 )
> 146 if len(batches) > 0:
> 147 table = pyarrow.Table.from_batches(batches)
> File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:358, in 
> PandasConversionMixin._collect_as_arrow(self, split_batches)
> 356 results.append(batch_or_indices)
> 357 else:
> --> 358 results = list(batch_stream)
> 359 finally:
> 360 # Join serving thread and raise any exceptions from 
> collectAsArrowToPython
> 361 jsocket_auth_server.getResult()
> File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:55, in 
> ArrowCollectSerializer.load_stream(self, stream)
>  50 """
>  51 Load a stream of un-ordered Arrow RecordBatches, where the last 
> iteration yields
>  52 a list of indices that can be used to put the RecordBatches in the 
> correct order.
>  53 """
>  54 # load the batches
> ---> 55 for batch in self.serializer.load_stream(stream):
>  56 yield batch
>  58 # load the batch order indices or propagate any error that occurred 
> in the JVM
> File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:98, in 
> ArrowStreamSerializer.load_stream(self, stream)
>  95 import pyarrow as pa
>  97 reader = pa.ipc.open_stream(stream)
> ---> 98 for batch in reader:
>  99 yield batch
> File 
> ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:638,
>  in __iter__()
> File 
> ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:674,
>  in pyarrow.lib.RecordBatchReader.read_next_batch()
> File 
> ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/error.pxi:100,
>  in pyarrow.lib.check_status()
> ArrowInvalid: Ran out of field metadata, likely malformed
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-41971) `toPandas` should support duplicate filed names when arrow-optimization is on

2023-01-10 Thread Ruifeng Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17657059#comment-17657059
 ] 

Ruifeng Zheng edited comment on SPARK-41971 at 1/11/23 3:17 AM:


I think that is due to something is wrong in `ArrowConverter`.

In Spark, a schema is just a StructType, but in arrow that is not the case, a 
schema is a class other than datatype. This difference maybe the cause.


was (Author: podongfeng):
I think that is due to something is wrong in `ArrowConverter`.

In Spark, a schema is just a StructType, but in arrow this is not the case, a 
schema is a class other than datatype. This difference maybe the cause.

> `toPandas` should support duplicate filed names when arrow-optimization is on
> -
>
> Key: SPARK-41971
> URL: https://issues.apache.org/jira/browse/SPARK-41971
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>
> toPandas support duplicate columns name, but for a struct column, it doesnot 
> support duplicate field names.
> {code:java}
> In [27]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", False)
> In [28]: spark.sql("select 1 v, 1 v").toPandas()
> Out[28]: 
>v  v
> 0  1  1
> In [29]: spark.sql("select struct(1 v, 1 v)").toPandas()
> Out[29]: 
>   struct(1 AS v, 1 AS v)
> 0 (1, 1)
> In [30]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True)
> In [31]: spark.sql("select 1 v, 1 v").toPandas()
> Out[31]: 
>v  v
> 0  1  1
> In [32]: spark.sql("select struct(1 v, 1 v)").toPandas()
> /Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/pandas/conversion.py:204: 
> UserWarning: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached 
> the error below and can not continue. Note that 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect 
> on failures in the middle of computation.
>   Ran out of field metadata, likely malformed
>   warn(msg)
> ---
> ArrowInvalid  Traceback (most recent call last)
> Cell In[32], line 1
> > 1 spark.sql("select struct(1 v, 1 v)").toPandas()
> File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:143, in 
> PandasConversionMixin.toPandas(self)
> 141 tmp_column_names = ["col_{}".format(i) for i in 
> range(len(self.columns))]
> 142 self_destruct = jconf.arrowPySparkSelfDestructEnabled()
> --> 143 batches = self.toDF(*tmp_column_names)._collect_as_arrow(
> 144 split_batches=self_destruct
> 145 )
> 146 if len(batches) > 0:
> 147 table = pyarrow.Table.from_batches(batches)
> File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:358, in 
> PandasConversionMixin._collect_as_arrow(self, split_batches)
> 356 results.append(batch_or_indices)
> 357 else:
> --> 358 results = list(batch_stream)
> 359 finally:
> 360 # Join serving thread and raise any exceptions from 
> collectAsArrowToPython
> 361 jsocket_auth_server.getResult()
> File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:55, in 
> ArrowCollectSerializer.load_stream(self, stream)
>  50 """
>  51 Load a stream of un-ordered Arrow RecordBatches, where the last 
> iteration yields
>  52 a list of indices that can be used to put the RecordBatches in the 
> correct order.
>  53 """
>  54 # load the batches
> ---> 55 for batch in self.serializer.load_stream(stream):
>  56 yield batch
>  58 # load the batch order indices or propagate any error that occurred 
> in the JVM
> File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:98, in 
> ArrowStreamSerializer.load_stream(self, stream)
>  95 import pyarrow as pa
>  97 reader = pa.ipc.open_stream(stream)
> ---> 98 for batch in reader:
>  99 yield batch
> File 
> ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:638,
>  in __iter__()
> File 
> ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:674,
>  in pyarrow.lib.RecordBatchReader.read_next_batch()
> File 
> ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/error.pxi:100,
>  in pyarrow.lib.check_status()
> ArrowInvalid: Ran out of field metadata, likely malformed
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41879) `DataFrame.collect` should support nested types

2023-01-10 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41879.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39462
[https://github.com/apache/spark/pull/39462]

> `DataFrame.collect` should support nested types
> ---
>
> Key: SPARK-41879
> URL: https://issues.apache.org/jira/browse/SPARK-41879
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/functions.py", 
> line 1578, in pyspark.sql.connect.functions.struct
> Failed example:
> df.select(struct('age', 'name').alias("struct")).collect()
> Expected:
> [Row(struct=Row(age=2, name='Alice')), Row(struct=Row(age=5, name='Bob'))]
> Got:
> [Row(struct={'age': 2, 'name': 'Alice'}), Row(struct={'age': 5, 'name': 
> 'Bob'})]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references

2023-01-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17657066#comment-17657066
 ] 

Apache Spark commented on SPARK-41965:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39493

> Add DataFrameWriterV2 to PySpark API references
> ---
>
> Key: SPARK-41965
> URL: https://issues.apache.org/jira/browse/SPARK-41965
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41965) Add DataFrameWriterV2 to PySpark API references

2023-01-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17657067#comment-17657067
 ] 

Apache Spark commented on SPARK-41965:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39493

> Add DataFrameWriterV2 to PySpark API references
> ---
>
> Key: SPARK-41965
> URL: https://issues.apache.org/jira/browse/SPARK-41965
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41972) Fix flaky test in StreamingQueryStatusListenerSuite

2023-01-10 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-41972:
--

 Summary: Fix flaky test in StreamingQueryStatusListenerSuite
 Key: SPARK-41972
 URL: https://issues.apache.org/jira/browse/SPARK-41972
 Project: Spark
  Issue Type: Task
  Components: Tests
Affects Versions: 3.4.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >