[jira] [Assigned] (SPARK-41455) Resolve dtypes inconsistencies of date/timestamp functions

2023-01-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41455:


Assignee: (was: Apache Spark)

> Resolve dtypes inconsistencies of date/timestamp functions
> --
>
> Key: SPARK-41455
> URL: https://issues.apache.org/jira/browse/SPARK-41455
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> When implementing date/timestamp functions, we notice inconsistent dtypes 
> with PySpark, as shown below.
> {code:python}
> >> sdf.select(SF.current_timestamp()).toPandas().dtypes
> current_timestamp()datetime64[ns]
> dtype: object
> >>> cdf.select(CF.current_timestamp()).toPandas().dtypes
> current_timestamp()datetime64[ns, America/Los_Angeles]
> {code}
> Affected functions include:
> {code:python}
> to_timestamp, from_utc_timestamp, to_utc_timestamp, timestamp_seconds, 
> current_timestamp, date_trunc
> {code}
> We may have to implement `is_timestamp_ntz_preferred` for Connect.
> After the fix, tests of those date/timestamp functions which use 
> `compare_by_show` should be switched to `toPandas` comparison.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41455) Resolve dtypes inconsistencies of date/timestamp functions

2023-01-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41455:


Assignee: Apache Spark

> Resolve dtypes inconsistencies of date/timestamp functions
> --
>
> Key: SPARK-41455
> URL: https://issues.apache.org/jira/browse/SPARK-41455
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> When implementing date/timestamp functions, we notice inconsistent dtypes 
> with PySpark, as shown below.
> {code:python}
> >> sdf.select(SF.current_timestamp()).toPandas().dtypes
> current_timestamp()datetime64[ns]
> dtype: object
> >>> cdf.select(CF.current_timestamp()).toPandas().dtypes
> current_timestamp()datetime64[ns, America/Los_Angeles]
> {code}
> Affected functions include:
> {code:python}
> to_timestamp, from_utc_timestamp, to_utc_timestamp, timestamp_seconds, 
> current_timestamp, date_trunc
> {code}
> We may have to implement `is_timestamp_ntz_preferred` for Connect.
> After the fix, tests of those date/timestamp functions which use 
> `compare_by_show` should be switched to `toPandas` comparison.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41455) Resolve dtypes inconsistencies of date/timestamp functions

2023-01-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655277#comment-17655277
 ] 

Apache Spark commented on SPARK-41455:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39426

> Resolve dtypes inconsistencies of date/timestamp functions
> --
>
> Key: SPARK-41455
> URL: https://issues.apache.org/jira/browse/SPARK-41455
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> When implementing date/timestamp functions, we notice inconsistent dtypes 
> with PySpark, as shown below.
> {code:python}
> >> sdf.select(SF.current_timestamp()).toPandas().dtypes
> current_timestamp()datetime64[ns]
> dtype: object
> >>> cdf.select(CF.current_timestamp()).toPandas().dtypes
> current_timestamp()datetime64[ns, America/Los_Angeles]
> {code}
> Affected functions include:
> {code:python}
> to_timestamp, from_utc_timestamp, to_utc_timestamp, timestamp_seconds, 
> current_timestamp, date_trunc
> {code}
> We may have to implement `is_timestamp_ntz_preferred` for Connect.
> After the fix, tests of those date/timestamp functions which use 
> `compare_by_show` should be switched to `toPandas` comparison.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41905) Function `slice` should handle string in params

2023-01-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41905.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39420
[https://github.com/apache/spark/pull/39420]

> Function `slice` should handle string in params
> ---
>
> Key: SPARK-41905
> URL: https://issues.apache.org/jira/browse/SPARK-41905
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> df = self.spark.createDataFrame(
> [
> (
> [1, 2, 3],
> 2,
> 2,
> ),
> (
> [4, 5],
> 2,
> 2,
> ),
> ],
> ["x", "index", "len"],
> )
> expected = [Row(sliced=[2, 3]), Row(sliced=[5])]
> self.assertTrue(
> all(
> [
> df.select(slice(df.x, 2, 2).alias("sliced")).collect() == 
> expected,
> df.select(slice(df.x, lit(2), lit(2)).alias("sliced")).collect() 
> == expected,
> df.select(slice("x", "index", "len").alias("sliced")).collect() 
> == expected,
> ]
> )
> )
> self.assertEqual(
> df.select(slice(df.x, size(df.x) - 1, lit(1)).alias("sliced")).collect(),
> [Row(sliced=[2]), Row(sliced=[4])],
> )
> self.assertEqual(
> df.select(slice(df.x, lit(1), size(df.x) - 1).alias("sliced")).collect(),
> [Row(sliced=[1, 2]), Row(sliced=[4])],
> ){code}
> {code:java}
>  Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 596, in test_slice
> df.select(slice("x", "index", "len").alias("sliced")).collect() == 
> expected,
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line 
> 332, in wrapped
> return getattr(functions, f.__name__)(*args, **kwargs)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1525, in slice
> raise TypeError(f"start should be a Column or int, but got 
> {type(start).__name__}")
> TypeError: start should be a Column or int, but got str{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41921) Enable doctests in connect.column and connect.functions

2023-01-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41921.
--
Resolution: Fixed

Issue resolved by pull request 39423
[https://github.com/apache/spark/pull/39423]

> Enable doctests in connect.column and connect.functions
> ---
>
> Key: SPARK-41921
> URL: https://issues.apache.org/jira/browse/SPARK-41921
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41905) Function `slice` should handle string in params

2023-01-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41905:


Assignee: Hyukjin Kwon

> Function `slice` should handle string in params
> ---
>
> Key: SPARK-41905
> URL: https://issues.apache.org/jira/browse/SPARK-41905
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Hyukjin Kwon
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame(
> [
> (
> [1, 2, 3],
> 2,
> 2,
> ),
> (
> [4, 5],
> 2,
> 2,
> ),
> ],
> ["x", "index", "len"],
> )
> expected = [Row(sliced=[2, 3]), Row(sliced=[5])]
> self.assertTrue(
> all(
> [
> df.select(slice(df.x, 2, 2).alias("sliced")).collect() == 
> expected,
> df.select(slice(df.x, lit(2), lit(2)).alias("sliced")).collect() 
> == expected,
> df.select(slice("x", "index", "len").alias("sliced")).collect() 
> == expected,
> ]
> )
> )
> self.assertEqual(
> df.select(slice(df.x, size(df.x) - 1, lit(1)).alias("sliced")).collect(),
> [Row(sliced=[2]), Row(sliced=[4])],
> )
> self.assertEqual(
> df.select(slice(df.x, lit(1), size(df.x) - 1).alias("sliced")).collect(),
> [Row(sliced=[1, 2]), Row(sliced=[4])],
> ){code}
> {code:java}
>  Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 596, in test_slice
> df.select(slice("x", "index", "len").alias("sliced")).collect() == 
> expected,
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line 
> 332, in wrapped
> return getattr(functions, f.__name__)(*args, **kwargs)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1525, in slice
> raise TypeError(f"start should be a Column or int, but got 
> {type(start).__name__}")
> TypeError: start should be a Column or int, but got str{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41906) Handle Function `rand() `

2023-01-05 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-41906:
-

Assignee: Hyukjin Kwon

> Handle Function `rand() `
> -
>
> Key: SPARK-41906
> URL: https://issues.apache.org/jira/browse/SPARK-41906
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> df = self.df
> from pyspark.sql import functions
> rnd = df.select("key", functions.rand()).collect()
> for row in rnd:
> assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1]
> rndn = df.select("key", functions.randn(5)).collect()
> for row in rndn:
> assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1]
> # If the specified seed is 0, we should use it.
> # https://issues.apache.org/jira/browse/SPARK-9691
> rnd1 = df.select("key", functions.rand(0)).collect()
> rnd2 = df.select("key", functions.rand(0)).collect()
> self.assertEqual(sorted(rnd1), sorted(rnd2))
> rndn1 = df.select("key", functions.randn(0)).collect()
> rndn2 = df.select("key", functions.randn(0)).collect()
> self.assertEqual(sorted(rndn1), sorted(rndn2)){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 299, in test_rand_functions
> rnd = df.select("key", functions.rand()).collect()
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
> line 2917, in select
> jdf = self._jdf.select(self._jcols(*cols))
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
> line 2537, in _jcols
> return self._jseq(cols, _to_java_column)
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
> line 2524, in _jseq
> return _to_seq(self.sparkSession._sc, cols, converter)
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
> 86, in _to_seq
> cols = [converter(c) for c in cols]
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
> 86, in 
> cols = [converter(c) for c in cols]
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
> 65, in _to_java_column
> raise TypeError(
> TypeError: Invalid argument, not a string or column: Column<'rand()'> of type 
> . For column literals, use 'lit', 
> 'array', 'struct' or 'create_map' function.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41906) Handle Function `rand() `

2023-01-05 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41906.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39421
[https://github.com/apache/spark/pull/39421]

> Handle Function `rand() `
> -
>
> Key: SPARK-41906
> URL: https://issues.apache.org/jira/browse/SPARK-41906
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> df = self.df
> from pyspark.sql import functions
> rnd = df.select("key", functions.rand()).collect()
> for row in rnd:
> assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1]
> rndn = df.select("key", functions.randn(5)).collect()
> for row in rndn:
> assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1]
> # If the specified seed is 0, we should use it.
> # https://issues.apache.org/jira/browse/SPARK-9691
> rnd1 = df.select("key", functions.rand(0)).collect()
> rnd2 = df.select("key", functions.rand(0)).collect()
> self.assertEqual(sorted(rnd1), sorted(rnd2))
> rndn1 = df.select("key", functions.randn(0)).collect()
> rndn2 = df.select("key", functions.randn(0)).collect()
> self.assertEqual(sorted(rndn1), sorted(rndn2)){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 299, in test_rand_functions
> rnd = df.select("key", functions.rand()).collect()
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
> line 2917, in select
> jdf = self._jdf.select(self._jcols(*cols))
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
> line 2537, in _jcols
> return self._jseq(cols, _to_java_column)
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
> line 2524, in _jseq
> return _to_seq(self.sparkSession._sc, cols, converter)
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
> 86, in _to_seq
> cols = [converter(c) for c in cols]
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
> 86, in 
> cols = [converter(c) for c in cols]
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
> 65, in _to_java_column
> raise TypeError(
> TypeError: Invalid argument, not a string or column: Column<'rand()'> of type 
> . For column literals, use 'lit', 
> 'array', 'struct' or 'create_map' function.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41869) DataFrame dropDuplicates should throw error on non list argument

2023-01-05 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41869.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39418
[https://github.com/apache/spark/pull/39418]

> DataFrame dropDuplicates should throw error on non list argument
> 
>
> Key: SPARK-41869
> URL: https://issues.apache.org/jira/browse/SPARK-41869
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> df = self.spark.createDataFrame([("Alice", 50), ("Alice", 60)], ["name", 
> "age"])
> # shouldn't drop a non-null row
> self.assertEqual(df.dropDuplicates().count(), 2)
> self.assertEqual(df.dropDuplicates(["name"]).count(), 1)
> self.assertEqual(df.dropDuplicates(["name", "age"]).count(), 2)
> type_error_msg = "Parameter 'subset' must be a list of columns"
> with self.assertRaisesRegex(TypeError, type_error_msg):
> df.dropDuplicates("name"){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 128, in test_drop_duplicates
>     with self.assertRaisesRegex(TypeError, type_error_msg):
> AssertionError: TypeError not raised{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41869) DataFrame dropDuplicates should throw error on non list argument

2023-01-05 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-41869:
-

Assignee: Hyukjin Kwon

> DataFrame dropDuplicates should throw error on non list argument
> 
>
> Key: SPARK-41869
> URL: https://issues.apache.org/jira/browse/SPARK-41869
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> df = self.spark.createDataFrame([("Alice", 50), ("Alice", 60)], ["name", 
> "age"])
> # shouldn't drop a non-null row
> self.assertEqual(df.dropDuplicates().count(), 2)
> self.assertEqual(df.dropDuplicates(["name"]).count(), 1)
> self.assertEqual(df.dropDuplicates(["name", "age"]).count(), 2)
> type_error_msg = "Parameter 'subset' must be a list of columns"
> with self.assertRaisesRegex(TypeError, type_error_msg):
> df.dropDuplicates("name"){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 128, in test_drop_duplicates
>     with self.assertRaisesRegex(TypeError, type_error_msg):
> AssertionError: TypeError not raised{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-39743) Unable to set zstd compression level while writing parquet files

2023-01-05 Thread zzzzming95 (Jira)


[ https://issues.apache.org/jira/browse/SPARK-39743 ]


ming95 deleted comment on SPARK-39743:


was (Author: zing):
[~euigeun_chung] 

 

 see this Jira : https://issues.apache.org/jira/browse/SPARK-33978

> Unable to set zstd compression level while writing parquet files
> 
>
> Key: SPARK-39743
> URL: https://issues.apache.org/jira/browse/SPARK-39743
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Yeachan Park
>Assignee: ming95
>Priority: Minor
> Fix For: 3.4.0
>
>
> While writing zstd compressed parquet files, the following setting 
> `spark.io.compression.zstd.level` does not have any affect with regards to 
> the compression level of zstd.
> All files seem to be written with the default zstd compression level, and the 
> config option seems to be ignored.
> Using the zstd cli tool, we confirmed that setting a higher compression level 
> for the same file tested in spark resulted in a smaller file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39743) Unable to set zstd compression level while writing parquet files

2023-01-05 Thread zzzzming95 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655244#comment-17655244
 ] 

ming95 commented on SPARK-39743:


[~euigeun_chung] 

 

 see this Jira : https://issues.apache.org/jira/browse/SPARK-33978

> Unable to set zstd compression level while writing parquet files
> 
>
> Key: SPARK-39743
> URL: https://issues.apache.org/jira/browse/SPARK-39743
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Yeachan Park
>Assignee: ming95
>Priority: Minor
> Fix For: 3.4.0
>
>
> While writing zstd compressed parquet files, the following setting 
> `spark.io.compression.zstd.level` does not have any affect with regards to 
> the compression level of zstd.
> All files seem to be written with the default zstd compression level, and the 
> config option seems to be ignored.
> Using the zstd cli tool, we confirmed that setting a higher compression level 
> for the same file tested in spark resulted in a smaller file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41538) Metadata column should be appended at the end of project list

2023-01-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655243#comment-17655243
 ] 

Apache Spark commented on SPARK-41538:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/39425

> Metadata column should be appended at the end of project list
> -
>
> Key: SPARK-41538
> URL: https://issues.apache.org/jira/browse/SPARK-41538
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.2, 3.4.0
>
>
> For the following query:
>  
> {code:java}
> CREATE TABLE table_1 (
>   a ARRAY,
>  s STRUCT)
> USING parquet;
> CREATE VIEW view_1 (id)
> AS WITH source AS (
>     SELECT * FROM table_1
> ),
> renamed AS (
>     SELECT
>      s.id
>     FROM source
> )
> SELECT id FROM renamed;
> with foo AS (
>   SELECT 'a' as id
> ),
> bar AS (
>   SELECT 'a' as id
> )
> SELECT
>   1
> FROM foo
> FULL OUTER JOIN bar USING(id)
> FULL OUTER JOIN view_1 USING(id)
> WHERE foo.id IS NOT NULL{code}
> There will be the following error:
>  
> {code:java}
> class org.apache.spark.sql.types.ArrayType cannot be cast to class 
> org.apache.spark.sql.types.StructType (org.apache.spark.sql.types.ArrayType 
> and org.apache.spark.sql.types.StructType are in unnamed module of loader 
> 'app')
> java.lang.ClassCastException: class org.apache.spark.sql.types.ArrayType 
> cannot be cast to class org.apache.spark.sql.types.StructType 
> (org.apache.spark.sql.types.ArrayType and 
> org.apache.spark.sql.types.StructType are in unnamed module of loader 'app')
>     at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema$lzycompute(complexTypeExtractors.scala:108)
>     at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema(complexTypeExtractors.scala:108)
>     at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.dataType(complexTypeExtractors.scala:114)
>     at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:193)
>     at 
> org.apache.spark.sql.catalyst.expressions.AliasHelper$$anonfun$getAliasMap$1.applyOrElse(AliasHelper.scala:50)
>     at 
> org.apache.spark.sql.catalyst.expressions.AliasHelper$$anonfun$getAliasMap$1.applyOrElse(AliasHelper.scala:50)
>     at scala.collection.immutable.List.collect(List.scala:315)
>     at 
> org.apache.spark.sql.catalyst.expressions.AliasHelper.getAliasMap(AliasHelper.scala:50)
>     at 
> org.apache.spark.sql.catalyst.expressions.AliasHelper.getAliasMap$(AliasHelper.scala:47)
>     at 
> org.apache.spark.sql.catalyst.optimizer.CollapseProject$.getAliasMap(Optimizer.scala:992)
>     at 
> org.apache.spark.sql.catalyst.optimizer.CollapseProject$.canCollapseExpressions(Optimizer.scala:1029){code}
> This is caused by the inconsistent metadata column positions in the following 
> two nodes:
>  * Table relation: at the ending position
>  * Project list: at the beginning position
> When the InlineCTE rule executes, the metadata column in project is wrongly 
> combined with the table output.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41708) Pull v1write information to WriteFiles

2023-01-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-41708:
---

Assignee: XiDuo You

> Pull v1write information to WriteFiles
> --
>
> Key: SPARK-41708
> URL: https://issues.apache.org/jira/browse/SPARK-41708
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
> Make WriteFiles hold v1 write information



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41708) Pull v1write information to WriteFiles

2023-01-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-41708.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39277
[https://github.com/apache/spark/pull/39277]

> Pull v1write information to WriteFiles
> --
>
> Key: SPARK-41708
> URL: https://issues.apache.org/jira/browse/SPARK-41708
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
> Make WriteFiles hold v1 write information



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-41818) Support DataFrameWriter.saveAsTable

2023-01-05 Thread Sandeep Singh (Jira)


[ https://issues.apache.org/jira/browse/SPARK-41818 ]


Sandeep Singh deleted comment on SPARK-41818:
---

was (Author: techaddict):
Could be moved under https://issues.apache.org/jira/browse/SPARK-41279 

> Support DataFrameWriter.saveAsTable
> ---
>
> Key: SPARK-41818
> URL: https://issues.apache.org/jira/browse/SPARK-41818
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", 
> line 369, in pyspark.sql.connect.readwriter.DataFrameWriter.insertInto
> Failed example:
>     df.write.saveAsTable("tblA")
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File " pyspark.sql.connect.readwriter.DataFrameWriter.insertInto[2]>", line 1, in 
> 
>         df.write.saveAsTable("tblA")
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", 
> line 350, in saveAsTable
>         
> self._spark.client.execute_command(self._write.command(self._spark.client))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 459, in execute_command
>         self._execute(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 547, in _execute
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 623, in _handle_error
>         raise SparkConnectException(status.message, info.reason) from None
>     pyspark.sql.connect.client.SparkConnectException: 
> (java.lang.ClassNotFoundException) .DefaultSource{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41921) Enable doctests in connect.column and connect.functions

2023-01-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655234#comment-17655234
 ] 

Apache Spark commented on SPARK-41921:
--

User 'techaddict' has created a pull request for this issue:
https://github.com/apache/spark/pull/39423

> Enable doctests in connect.column and connect.functions
> ---
>
> Key: SPARK-41921
> URL: https://issues.apache.org/jira/browse/SPARK-41921
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41921) Enable doctests in connect.column and connect.functions

2023-01-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41921:


Assignee: Sandeep Singh  (was: Apache Spark)

> Enable doctests in connect.column and connect.functions
> ---
>
> Key: SPARK-41921
> URL: https://issues.apache.org/jira/browse/SPARK-41921
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41921) Enable doctests in connect.column and connect.functions

2023-01-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41921:


Assignee: Apache Spark  (was: Sandeep Singh)

> Enable doctests in connect.column and connect.functions
> ---
>
> Key: SPARK-41921
> URL: https://issues.apache.org/jira/browse/SPARK-41921
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41921) Enable doctests in connect.column and connect.functions

2023-01-05 Thread Sandeep Singh (Jira)
Sandeep Singh created SPARK-41921:
-

 Summary: Enable doctests in connect.column and connect.functions
 Key: SPARK-41921
 URL: https://issues.apache.org/jira/browse/SPARK-41921
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh
Assignee: Sandeep Singh
 Fix For: 3.4.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41875) Throw proper errors in Dataset.to()

2023-01-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655233#comment-17655233
 ] 

Apache Spark commented on SPARK-41875:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/39422

> Throw proper errors in Dataset.to()
> ---
>
> Key: SPARK-41875
> URL: https://issues.apache.org/jira/browse/SPARK-41875
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> schema = StructType(
> [StructField("i", StringType(), True), StructField("j", IntegerType(), 
> True)]
> )
> df = self.spark.createDataFrame([("a", 1)], schema)
> schema1 = StructType([StructField("j", StringType()), StructField("i", 
> StringType())])
> df1 = df.to(schema1)
> self.assertEqual(schema1, df1.schema)
> self.assertEqual(df.count(), df1.count())
> schema2 = StructType([StructField("j", LongType())])
> df2 = df.to(schema2)
> self.assertEqual(schema2, df2.schema)
> self.assertEqual(df.count(), df2.count())
> schema3 = StructType([StructField("struct", schema1, False)])
> df3 = df.select(struct("i", "j").alias("struct")).to(schema3)
> self.assertEqual(schema3, df3.schema)
> self.assertEqual(df.count(), df3.count())
> # incompatible field nullability
> schema4 = StructType([StructField("j", LongType(), False)])
> self.assertRaisesRegex(
> AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4)
> ){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 1486, in test_to
>     self.assertRaisesRegex(
> AssertionError: AnalysisException not raised by  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41875) Throw proper errors in Dataset.to()

2023-01-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41875:


Assignee: Apache Spark

> Throw proper errors in Dataset.to()
> ---
>
> Key: SPARK-41875
> URL: https://issues.apache.org/jira/browse/SPARK-41875
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> schema = StructType(
> [StructField("i", StringType(), True), StructField("j", IntegerType(), 
> True)]
> )
> df = self.spark.createDataFrame([("a", 1)], schema)
> schema1 = StructType([StructField("j", StringType()), StructField("i", 
> StringType())])
> df1 = df.to(schema1)
> self.assertEqual(schema1, df1.schema)
> self.assertEqual(df.count(), df1.count())
> schema2 = StructType([StructField("j", LongType())])
> df2 = df.to(schema2)
> self.assertEqual(schema2, df2.schema)
> self.assertEqual(df.count(), df2.count())
> schema3 = StructType([StructField("struct", schema1, False)])
> df3 = df.select(struct("i", "j").alias("struct")).to(schema3)
> self.assertEqual(schema3, df3.schema)
> self.assertEqual(df.count(), df3.count())
> # incompatible field nullability
> schema4 = StructType([StructField("j", LongType(), False)])
> self.assertRaisesRegex(
> AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4)
> ){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 1486, in test_to
>     self.assertRaisesRegex(
> AssertionError: AnalysisException not raised by  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41875) Throw proper errors in Dataset.to()

2023-01-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41875:


Assignee: (was: Apache Spark)

> Throw proper errors in Dataset.to()
> ---
>
> Key: SPARK-41875
> URL: https://issues.apache.org/jira/browse/SPARK-41875
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> schema = StructType(
> [StructField("i", StringType(), True), StructField("j", IntegerType(), 
> True)]
> )
> df = self.spark.createDataFrame([("a", 1)], schema)
> schema1 = StructType([StructField("j", StringType()), StructField("i", 
> StringType())])
> df1 = df.to(schema1)
> self.assertEqual(schema1, df1.schema)
> self.assertEqual(df.count(), df1.count())
> schema2 = StructType([StructField("j", LongType())])
> df2 = df.to(schema2)
> self.assertEqual(schema2, df2.schema)
> self.assertEqual(df.count(), df2.count())
> schema3 = StructType([StructField("struct", schema1, False)])
> df3 = df.select(struct("i", "j").alias("struct")).to(schema3)
> self.assertEqual(schema3, df3.schema)
> self.assertEqual(df.count(), df3.count())
> # incompatible field nullability
> schema4 = StructType([StructField("j", LongType(), False)])
> self.assertRaisesRegex(
> AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4)
> ){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 1486, in test_to
>     self.assertRaisesRegex(
> AssertionError: AnalysisException not raised by  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41162) Anti-join must not be pushed below aggregation with ambiguous predicates

2023-01-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-41162.
-
Fix Version/s: 3.2.4
   3.3.2
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 39409
[https://github.com/apache/spark/pull/39409]

> Anti-join must not be pushed below aggregation with ambiguous predicates
> 
>
> Key: SPARK-41162
> URL: https://issues.apache.org/jira/browse/SPARK-41162
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.3, 3.3.1, 3.2.3, 3.4.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Major
>  Labels: correctness
> Fix For: 3.2.4, 3.3.2, 3.4.0
>
>
> The following query should return a single row as all values for {{id}} 
> except for the largest will be eliminated by the anti-join:
> {code}
> val ids = Seq(1, 2, 3).toDF("id").distinct()
> val result = ids.withColumn("id", $"id" + 1).join(ids, "id", 
> "left_anti").collect()
> assert(result.length == 1)
> {code}
> Without the {{distinct()}}, the assertion is true. With {{distinct()}}, the 
> assertion should still hold but is false.
> Rule {{PushDownLeftSemiAntiJoin}} pushes the {{Join}} below the left 
> {{Aggregate}} with join condition {{(id#750 + 1) = id#750}}, which can never 
> be true.
> {code}
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin ===
> !Join LeftAnti, (id#752 = id#750)  'Aggregate [id#750], 
> [(id#750 + 1) AS id#752]
> !:- Aggregate [id#750], [(id#750 + 1) AS id#752]   +- 'Join LeftAnti, 
> ((id#750 + 1) = id#750)
> !:  +- LocalRelation [id#750] :- LocalRelation 
> [id#750]
> !+- Aggregate [id#750], [id#750]  +- Aggregate [id#750], 
> [id#750]
> !   +- LocalRelation [id#750]+- LocalRelation 
> [id#750]
> {code}
> The optimizer then rightly removes the left-anti join altogether, returning 
> the left child only.
> Rule {{PushDownLeftSemiAntiJoin}} should not push down predicates that 
> reference left *and* right child.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41162) Anti-join must not be pushed below aggregation with ambiguous predicates

2023-01-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-41162:
---

Assignee: Enrico Minack

> Anti-join must not be pushed below aggregation with ambiguous predicates
> 
>
> Key: SPARK-41162
> URL: https://issues.apache.org/jira/browse/SPARK-41162
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.3, 3.3.1, 3.2.3, 3.4.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Major
>  Labels: correctness
>
> The following query should return a single row as all values for {{id}} 
> except for the largest will be eliminated by the anti-join:
> {code}
> val ids = Seq(1, 2, 3).toDF("id").distinct()
> val result = ids.withColumn("id", $"id" + 1).join(ids, "id", 
> "left_anti").collect()
> assert(result.length == 1)
> {code}
> Without the {{distinct()}}, the assertion is true. With {{distinct()}}, the 
> assertion should still hold but is false.
> Rule {{PushDownLeftSemiAntiJoin}} pushes the {{Join}} below the left 
> {{Aggregate}} with join condition {{(id#750 + 1) = id#750}}, which can never 
> be true.
> {code}
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin ===
> !Join LeftAnti, (id#752 = id#750)  'Aggregate [id#750], 
> [(id#750 + 1) AS id#752]
> !:- Aggregate [id#750], [(id#750 + 1) AS id#752]   +- 'Join LeftAnti, 
> ((id#750 + 1) = id#750)
> !:  +- LocalRelation [id#750] :- LocalRelation 
> [id#750]
> !+- Aggregate [id#750], [id#750]  +- Aggregate [id#750], 
> [id#750]
> !   +- LocalRelation [id#750]+- LocalRelation 
> [id#750]
> {code}
> The optimizer then rightly removes the left-anti join altogether, returning 
> the left child only.
> Rule {{PushDownLeftSemiAntiJoin}} should not push down predicates that 
> reference left *and* right child.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41920) Task throw Exception call cleanUpAllAllocatedMemory cause throw NPE

2023-01-05 Thread Yi Zhu (Jira)
Yi Zhu created SPARK-41920:
--

 Summary: Task throw Exception call cleanUpAllAllocatedMemory cause 
throw NPE 
 Key: SPARK-41920
 URL: https://issues.apache.org/jira/browse/SPARK-41920
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.1
Reporter: Yi Zhu


{code:java}
23/01/03 21:41:18 INFO SortBasedPusher: Pushdata is not empty , do push.
Traceback (most recent call last):
  File 
"/mnt/ssd/0/yarn/nm-local-dir/usercache/rcmd_feature/appcache/application_1671694574014_2488441/container_e260_1671694574014_2488441_01_000107/pyspark.zip/pyspark/daemon.py",
 line 186, in manager
  File 
"/mnt/ssd/0/yarn/nm-local-dir/usercache/rcmd_feature/appcache/application_1671694574014_2488441/container_e260_1671694574014_2488441_01_000107/pyspark.zip/pyspark/daemon.py",
 line 74, in worker
  File 
"/mnt/ssd/0/yarn/nm-local-dir/usercache/rcmd_feature/appcache/application_1671694574014_2488441/container_e260_1671694574014_2488441_01_000107/pyspark.zip/pyspark/worker.py",
 line 643, in main
if read_int(infile) == SpecialLengths.END_OF_STREAM:
  File 
"/mnt/ssd/0/yarn/nm-local-dir/usercache/rcmd_feature/appcache/application_1671694574014_2488441/container_e260_1671694574014_2488441_01_000107/pyspark.zip/pyspark/serializers.py",
 line 564, in read_int
raise EOFError
EOFError
23/01/03 21:41:29 ERROR Executor: Exception in task 605.1 in stage 94.0 (TID 
58026)
java.lang.NullPointerException
at 
org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:399)
at 
org.apache.spark.shuffle.rss.SortBasedPusher.pushData(SortBasedPusher.java:155)
at 
org.apache.spark.shuffle.rss.SortBasedPusher.spill(SortBasedPusher.java:317)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:177)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:289)
at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:116)
at 
org.apache.spark.sql.execution.python.HybridRowQueue.createNewQueue(RowQueue.scala:227)
at 
org.apache.spark.sql.execution.python.HybridRowQueue.add(RowQueue.scala:250)
at 
org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$10(EvalPythonExec.scala:125)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at 
scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1159)
at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1174)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1212)
at 
scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1215)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:307)
at 
org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.writeIteratorToStream(PythonUDFRunner.scala:53)
at 
org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:397)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2066)
at 
org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:232)
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41912) Subquery should not validate CTE

2023-01-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-41912.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39414
[https://github.com/apache/spark/pull/39414]

> Subquery should not validate CTE
> 
>
> Key: SPARK-41912
> URL: https://issues.apache.org/jira/browse/SPARK-41912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41831) DataFrame.transform: Only Column or String can be used for projections

2023-01-05 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-41831:
-

Assignee: Ruifeng Zheng

> DataFrame.transform: Only Column or String can be used for projections
> --
>
> Key: SPARK-41831
> URL: https://issues.apache.org/jira/browse/SPARK-41831
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1168, in pyspark.sql.connect.dataframe.DataFrame.transform
> Failed example:
>     df.transform(cast_all_to_int).transform(sort_columns_asc).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", 
> line 1, in 
>         df.transform(cast_all_to_int).transform(sort_columns_asc).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1102, in transform
>         result = func(self, *args, **kwargs)
>       File "", 
> line 2, in cast_all_to_int
>         return input_df.select([col(col_name).cast("int") for col_name in 
> input_df.columns])
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 86, in select
>         return DataFrame.withPlan(plan.Project(self._plan, *cols), 
> session=self._session)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line 
> 344, in __init__
>         self._verify_expressions()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line 
> 350, in _verify_expressions
>         raise InputValidationError(
>     pyspark.sql.connect.plan.InputValidationError: Only Column or String can 
> be used for projections: '[Column<'(ColumnReference(int) (int))'>, 
> Column<'(ColumnReference(float) (int))'>]'.
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1179, in pyspark.sql.connect.dataframe.DataFrame.transform
> Failed example:
>     df.transform(add_n, 1).transform(add_n, n=10).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", 
> line 1, in 
>         df.transform(add_n, 1).transform(add_n, n=10).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1102, in transform
>         result = func(self, *args, **kwargs)
>       File "", 
> line 2, in add_n
>         return input_df.select([(col(col_name) + n).alias(col_name)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 86, in select
>         return DataFrame.withPlan(plan.Project(self._plan, *cols), 
> session=self._session)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line 
> 344, in __init__
>         self._verify_expressions()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line 
> 350, in _verify_expressions
>         raise InputValidationError(
>     pyspark.sql.connect.plan.InputValidationError: Only Column or String can 
> be used for projections: '[Column<'Alias(+(ColumnReference(int), Literal(1)), 
> (int))'>, Column<'Alias(+(ColumnReference(float), Literal(1)), 
> (float))'>]'.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41831) DataFrame.transform: Only Column or String can be used for projections

2023-01-05 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41831.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39405
[https://github.com/apache/spark/pull/39405]

> DataFrame.transform: Only Column or String can be used for projections
> --
>
> Key: SPARK-41831
> URL: https://issues.apache.org/jira/browse/SPARK-41831
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1168, in pyspark.sql.connect.dataframe.DataFrame.transform
> Failed example:
>     df.transform(cast_all_to_int).transform(sort_columns_asc).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", 
> line 1, in 
>         df.transform(cast_all_to_int).transform(sort_columns_asc).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1102, in transform
>         result = func(self, *args, **kwargs)
>       File "", 
> line 2, in cast_all_to_int
>         return input_df.select([col(col_name).cast("int") for col_name in 
> input_df.columns])
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 86, in select
>         return DataFrame.withPlan(plan.Project(self._plan, *cols), 
> session=self._session)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line 
> 344, in __init__
>         self._verify_expressions()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line 
> 350, in _verify_expressions
>         raise InputValidationError(
>     pyspark.sql.connect.plan.InputValidationError: Only Column or String can 
> be used for projections: '[Column<'(ColumnReference(int) (int))'>, 
> Column<'(ColumnReference(float) (int))'>]'.
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1179, in pyspark.sql.connect.dataframe.DataFrame.transform
> Failed example:
>     df.transform(add_n, 1).transform(add_n, n=10).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", 
> line 1, in 
>         df.transform(add_n, 1).transform(add_n, n=10).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1102, in transform
>         result = func(self, *args, **kwargs)
>       File "", 
> line 2, in add_n
>         return input_df.select([(col(col_name) + n).alias(col_name)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 86, in select
>         return DataFrame.withPlan(plan.Project(self._plan, *cols), 
> session=self._session)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line 
> 344, in __init__
>         self._verify_expressions()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line 
> 350, in _verify_expressions
>         raise InputValidationError(
>     pyspark.sql.connect.plan.InputValidationError: Only Column or String can 
> be used for projections: '[Column<'Alias(+(ColumnReference(int), Literal(1)), 
> (int))'>, Column<'Alias(+(ColumnReference(float), Literal(1)), 
> (float))'>]'.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41906) Handle Function `rand() `

2023-01-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41906:


Assignee: Apache Spark

> Handle Function `rand() `
> -
>
> Key: SPARK-41906
> URL: https://issues.apache.org/jira/browse/SPARK-41906
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> df = self.df
> from pyspark.sql import functions
> rnd = df.select("key", functions.rand()).collect()
> for row in rnd:
> assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1]
> rndn = df.select("key", functions.randn(5)).collect()
> for row in rndn:
> assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1]
> # If the specified seed is 0, we should use it.
> # https://issues.apache.org/jira/browse/SPARK-9691
> rnd1 = df.select("key", functions.rand(0)).collect()
> rnd2 = df.select("key", functions.rand(0)).collect()
> self.assertEqual(sorted(rnd1), sorted(rnd2))
> rndn1 = df.select("key", functions.randn(0)).collect()
> rndn2 = df.select("key", functions.randn(0)).collect()
> self.assertEqual(sorted(rndn1), sorted(rndn2)){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 299, in test_rand_functions
> rnd = df.select("key", functions.rand()).collect()
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
> line 2917, in select
> jdf = self._jdf.select(self._jcols(*cols))
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
> line 2537, in _jcols
> return self._jseq(cols, _to_java_column)
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
> line 2524, in _jseq
> return _to_seq(self.sparkSession._sc, cols, converter)
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
> 86, in _to_seq
> cols = [converter(c) for c in cols]
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
> 86, in 
> cols = [converter(c) for c in cols]
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
> 65, in _to_java_column
> raise TypeError(
> TypeError: Invalid argument, not a string or column: Column<'rand()'> of type 
> . For column literals, use 'lit', 
> 'array', 'struct' or 'create_map' function.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41906) Handle Function `rand() `

2023-01-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655225#comment-17655225
 ] 

Apache Spark commented on SPARK-41906:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39421

> Handle Function `rand() `
> -
>
> Key: SPARK-41906
> URL: https://issues.apache.org/jira/browse/SPARK-41906
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.df
> from pyspark.sql import functions
> rnd = df.select("key", functions.rand()).collect()
> for row in rnd:
> assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1]
> rndn = df.select("key", functions.randn(5)).collect()
> for row in rndn:
> assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1]
> # If the specified seed is 0, we should use it.
> # https://issues.apache.org/jira/browse/SPARK-9691
> rnd1 = df.select("key", functions.rand(0)).collect()
> rnd2 = df.select("key", functions.rand(0)).collect()
> self.assertEqual(sorted(rnd1), sorted(rnd2))
> rndn1 = df.select("key", functions.randn(0)).collect()
> rndn2 = df.select("key", functions.randn(0)).collect()
> self.assertEqual(sorted(rndn1), sorted(rndn2)){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 299, in test_rand_functions
> rnd = df.select("key", functions.rand()).collect()
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
> line 2917, in select
> jdf = self._jdf.select(self._jcols(*cols))
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
> line 2537, in _jcols
> return self._jseq(cols, _to_java_column)
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
> line 2524, in _jseq
> return _to_seq(self.sparkSession._sc, cols, converter)
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
> 86, in _to_seq
> cols = [converter(c) for c in cols]
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
> 86, in 
> cols = [converter(c) for c in cols]
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
> 65, in _to_java_column
> raise TypeError(
> TypeError: Invalid argument, not a string or column: Column<'rand()'> of type 
> . For column literals, use 'lit', 
> 'array', 'struct' or 'create_map' function.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41906) Handle Function `rand() `

2023-01-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41906:


Assignee: (was: Apache Spark)

> Handle Function `rand() `
> -
>
> Key: SPARK-41906
> URL: https://issues.apache.org/jira/browse/SPARK-41906
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.df
> from pyspark.sql import functions
> rnd = df.select("key", functions.rand()).collect()
> for row in rnd:
> assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1]
> rndn = df.select("key", functions.randn(5)).collect()
> for row in rndn:
> assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1]
> # If the specified seed is 0, we should use it.
> # https://issues.apache.org/jira/browse/SPARK-9691
> rnd1 = df.select("key", functions.rand(0)).collect()
> rnd2 = df.select("key", functions.rand(0)).collect()
> self.assertEqual(sorted(rnd1), sorted(rnd2))
> rndn1 = df.select("key", functions.randn(0)).collect()
> rndn2 = df.select("key", functions.randn(0)).collect()
> self.assertEqual(sorted(rndn1), sorted(rndn2)){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 299, in test_rand_functions
> rnd = df.select("key", functions.rand()).collect()
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
> line 2917, in select
> jdf = self._jdf.select(self._jcols(*cols))
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
> line 2537, in _jcols
> return self._jseq(cols, _to_java_column)
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
> line 2524, in _jseq
> return _to_seq(self.sparkSession._sc, cols, converter)
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
> 86, in _to_seq
> cols = [converter(c) for c in cols]
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
> 86, in 
> cols = [converter(c) for c in cols]
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
> 65, in _to_java_column
> raise TypeError(
> TypeError: Invalid argument, not a string or column: Column<'rand()'> of type 
> . For column literals, use 'lit', 
> 'array', 'struct' or 'create_map' function.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41905) Function `slice` should handle string in params

2023-01-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41905:


Assignee: Apache Spark

> Function `slice` should handle string in params
> ---
>
> Key: SPARK-41905
> URL: https://issues.apache.org/jira/browse/SPARK-41905
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame(
> [
> (
> [1, 2, 3],
> 2,
> 2,
> ),
> (
> [4, 5],
> 2,
> 2,
> ),
> ],
> ["x", "index", "len"],
> )
> expected = [Row(sliced=[2, 3]), Row(sliced=[5])]
> self.assertTrue(
> all(
> [
> df.select(slice(df.x, 2, 2).alias("sliced")).collect() == 
> expected,
> df.select(slice(df.x, lit(2), lit(2)).alias("sliced")).collect() 
> == expected,
> df.select(slice("x", "index", "len").alias("sliced")).collect() 
> == expected,
> ]
> )
> )
> self.assertEqual(
> df.select(slice(df.x, size(df.x) - 1, lit(1)).alias("sliced")).collect(),
> [Row(sliced=[2]), Row(sliced=[4])],
> )
> self.assertEqual(
> df.select(slice(df.x, lit(1), size(df.x) - 1).alias("sliced")).collect(),
> [Row(sliced=[1, 2]), Row(sliced=[4])],
> ){code}
> {code:java}
>  Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 596, in test_slice
> df.select(slice("x", "index", "len").alias("sliced")).collect() == 
> expected,
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line 
> 332, in wrapped
> return getattr(functions, f.__name__)(*args, **kwargs)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1525, in slice
> raise TypeError(f"start should be a Column or int, but got 
> {type(start).__name__}")
> TypeError: start should be a Column or int, but got str{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41905) Function `slice` should handle string in params

2023-01-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41905:


Assignee: (was: Apache Spark)

> Function `slice` should handle string in params
> ---
>
> Key: SPARK-41905
> URL: https://issues.apache.org/jira/browse/SPARK-41905
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame(
> [
> (
> [1, 2, 3],
> 2,
> 2,
> ),
> (
> [4, 5],
> 2,
> 2,
> ),
> ],
> ["x", "index", "len"],
> )
> expected = [Row(sliced=[2, 3]), Row(sliced=[5])]
> self.assertTrue(
> all(
> [
> df.select(slice(df.x, 2, 2).alias("sliced")).collect() == 
> expected,
> df.select(slice(df.x, lit(2), lit(2)).alias("sliced")).collect() 
> == expected,
> df.select(slice("x", "index", "len").alias("sliced")).collect() 
> == expected,
> ]
> )
> )
> self.assertEqual(
> df.select(slice(df.x, size(df.x) - 1, lit(1)).alias("sliced")).collect(),
> [Row(sliced=[2]), Row(sliced=[4])],
> )
> self.assertEqual(
> df.select(slice(df.x, lit(1), size(df.x) - 1).alias("sliced")).collect(),
> [Row(sliced=[1, 2]), Row(sliced=[4])],
> ){code}
> {code:java}
>  Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 596, in test_slice
> df.select(slice("x", "index", "len").alias("sliced")).collect() == 
> expected,
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line 
> 332, in wrapped
> return getattr(functions, f.__name__)(*args, **kwargs)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1525, in slice
> raise TypeError(f"start should be a Column or int, but got 
> {type(start).__name__}")
> TypeError: start should be a Column or int, but got str{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41905) Function `slice` should handle string in params

2023-01-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655223#comment-17655223
 ] 

Apache Spark commented on SPARK-41905:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39420

> Function `slice` should handle string in params
> ---
>
> Key: SPARK-41905
> URL: https://issues.apache.org/jira/browse/SPARK-41905
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame(
> [
> (
> [1, 2, 3],
> 2,
> 2,
> ),
> (
> [4, 5],
> 2,
> 2,
> ),
> ],
> ["x", "index", "len"],
> )
> expected = [Row(sliced=[2, 3]), Row(sliced=[5])]
> self.assertTrue(
> all(
> [
> df.select(slice(df.x, 2, 2).alias("sliced")).collect() == 
> expected,
> df.select(slice(df.x, lit(2), lit(2)).alias("sliced")).collect() 
> == expected,
> df.select(slice("x", "index", "len").alias("sliced")).collect() 
> == expected,
> ]
> )
> )
> self.assertEqual(
> df.select(slice(df.x, size(df.x) - 1, lit(1)).alias("sliced")).collect(),
> [Row(sliced=[2]), Row(sliced=[4])],
> )
> self.assertEqual(
> df.select(slice(df.x, lit(1), size(df.x) - 1).alias("sliced")).collect(),
> [Row(sliced=[1, 2]), Row(sliced=[4])],
> ){code}
> {code:java}
>  Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 596, in test_slice
> df.select(slice("x", "index", "len").alias("sliced")).collect() == 
> expected,
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line 
> 332, in wrapped
> return getattr(functions, f.__name__)(*args, **kwargs)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1525, in slice
> raise TypeError(f"start should be a Column or int, but got 
> {type(start).__name__}")
> TypeError: start should be a Column or int, but got str{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41905) Function `slice` should handle string in params

2023-01-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655222#comment-17655222
 ] 

Apache Spark commented on SPARK-41905:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39420

> Function `slice` should handle string in params
> ---
>
> Key: SPARK-41905
> URL: https://issues.apache.org/jira/browse/SPARK-41905
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame(
> [
> (
> [1, 2, 3],
> 2,
> 2,
> ),
> (
> [4, 5],
> 2,
> 2,
> ),
> ],
> ["x", "index", "len"],
> )
> expected = [Row(sliced=[2, 3]), Row(sliced=[5])]
> self.assertTrue(
> all(
> [
> df.select(slice(df.x, 2, 2).alias("sliced")).collect() == 
> expected,
> df.select(slice(df.x, lit(2), lit(2)).alias("sliced")).collect() 
> == expected,
> df.select(slice("x", "index", "len").alias("sliced")).collect() 
> == expected,
> ]
> )
> )
> self.assertEqual(
> df.select(slice(df.x, size(df.x) - 1, lit(1)).alias("sliced")).collect(),
> [Row(sliced=[2]), Row(sliced=[4])],
> )
> self.assertEqual(
> df.select(slice(df.x, lit(1), size(df.x) - 1).alias("sliced")).collect(),
> [Row(sliced=[1, 2]), Row(sliced=[4])],
> ){code}
> {code:java}
>  Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 596, in test_slice
> df.select(slice("x", "index", "len").alias("sliced")).collect() == 
> expected,
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line 
> 332, in wrapped
> return getattr(functions, f.__name__)(*args, **kwargs)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1525, in slice
> raise TypeError(f"start should be a Column or int, but got 
> {type(start).__name__}")
> TypeError: start should be a Column or int, but got str{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41840) DataFrame.show(): 'Column' object is not callable

2023-01-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655219#comment-17655219
 ] 

Apache Spark commented on SPARK-41840:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39419

> DataFrame.show(): 'Column' object is not callable
> -
>
> Key: SPARK-41840
> URL: https://issues.apache.org/jira/browse/SPARK-41840
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 855, in pyspark.sql.connect.functions.first
> Failed example:
>     df.groupby("name").agg(first("age", 
> ignorenulls=True)).orderBy("name").show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.groupby("name").agg(first("age", 
> ignorenulls=True)).orderBy("name").show()
>     TypeError: 'Column' object is not callable{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41840) DataFrame.show(): 'Column' object is not callable

2023-01-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655217#comment-17655217
 ] 

Apache Spark commented on SPARK-41840:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39419

> DataFrame.show(): 'Column' object is not callable
> -
>
> Key: SPARK-41840
> URL: https://issues.apache.org/jira/browse/SPARK-41840
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 855, in pyspark.sql.connect.functions.first
> Failed example:
>     df.groupby("name").agg(first("age", 
> ignorenulls=True)).orderBy("name").show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.groupby("name").agg(first("age", 
> ignorenulls=True)).orderBy("name").show()
>     TypeError: 'Column' object is not callable{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41919) Unify the schema or datatype in protos

2023-01-05 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-41919:
-

 Summary: Unify the schema or datatype in protos
 Key: SPARK-41919
 URL: https://issues.apache.org/jira/browse/SPARK-41919
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng


this ticket only focus on the protos sent from client to server.

we normally use 

{code:java}
  oneof schema {

DataType datatype = 2;

// Server will use Catalyst parser to parse this string to DataType.
string datatype_str = 3;
  }
{code}

to represent a schema or datatype.

actually, we can simplify it with just a string. In the server, we can easily 
parse a DDL-formatted schema or a JSON formatted one.


{code:java}
  // (Optional) The schema of local data.
  // It should be either a DDL-formatted type string or a JSON string.
  //
  // The server side will update the column names and data types according to 
this schema.
  // If the 'data' is not provided, then this schema will be required.
  optional string schema = 2;
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41918) Refine the naming in proto messages

2023-01-05 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-41918:
-

 Summary: Refine the naming in proto messages
 Key: SPARK-41918
 URL: https://issues.apache.org/jira/browse/SPARK-41918
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng


normally, we name the fields after the corresponding LogiclalPlan or DataFrame 
API, but they are not consistent in protos, for example, the column name:


{code:java}
  message UnresolvedRegex {
// (Required) The column name used to extract column with regex.
string col_name = 1;
  }
{code}


{code:java}
  message Alias {
// (Required) The expression that alias will be added on.
Expression expr = 1;

// (Required) a list of name parts for the alias.
//
// Scalar columns only has one name that presents.
repeated string name = 2;

// (Optional) Alias metadata expressed as a JSON map.
optional string metadata = 3;
  }
{code}



{code:java}
// Relation of type [[Deduplicate]] which have duplicate rows removed, could 
consider either only
// the subset of columns or all the columns.
message Deduplicate {
  // (Required) Input relation for a Deduplicate.
  Relation input = 1;

  // (Optional) Deduplicate based on a list of column names.
  //
  // This field does not co-use with `all_columns_as_keys`.
  repeated string column_names = 2;

  // (Optional) Deduplicate based on all the columns of the input relation.
  //
  // This field does not co-use with `column_names`.
  optional bool all_columns_as_keys = 3;
}
{code}


{code:java}
// Computes basic statistics for numeric and string columns, including count, 
mean, stddev, min,
// and max. If no columns are given, this function computes statistics for all 
numerical or
// string columns.
message StatDescribe {
  // (Required) The input relation.
  Relation input = 1;

  // (Optional) Columns to compute statistics on.
  repeated string cols = 2;
}
{code}


we probably should unify the naming:

single column -> `column`

multi columns -> `columns`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41869) DataFrame dropDuplicates should throw error on non list argument

2023-01-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41869:


Assignee: (was: Apache Spark)

> DataFrame dropDuplicates should throw error on non list argument
> 
>
> Key: SPARK-41869
> URL: https://issues.apache.org/jira/browse/SPARK-41869
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame([("Alice", 50), ("Alice", 60)], ["name", 
> "age"])
> # shouldn't drop a non-null row
> self.assertEqual(df.dropDuplicates().count(), 2)
> self.assertEqual(df.dropDuplicates(["name"]).count(), 1)
> self.assertEqual(df.dropDuplicates(["name", "age"]).count(), 2)
> type_error_msg = "Parameter 'subset' must be a list of columns"
> with self.assertRaisesRegex(TypeError, type_error_msg):
> df.dropDuplicates("name"){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 128, in test_drop_duplicates
>     with self.assertRaisesRegex(TypeError, type_error_msg):
> AssertionError: TypeError not raised{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41869) DataFrame dropDuplicates should throw error on non list argument

2023-01-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655216#comment-17655216
 ] 

Apache Spark commented on SPARK-41869:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39418

> DataFrame dropDuplicates should throw error on non list argument
> 
>
> Key: SPARK-41869
> URL: https://issues.apache.org/jira/browse/SPARK-41869
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame([("Alice", 50), ("Alice", 60)], ["name", 
> "age"])
> # shouldn't drop a non-null row
> self.assertEqual(df.dropDuplicates().count(), 2)
> self.assertEqual(df.dropDuplicates(["name"]).count(), 1)
> self.assertEqual(df.dropDuplicates(["name", "age"]).count(), 2)
> type_error_msg = "Parameter 'subset' must be a list of columns"
> with self.assertRaisesRegex(TypeError, type_error_msg):
> df.dropDuplicates("name"){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 128, in test_drop_duplicates
>     with self.assertRaisesRegex(TypeError, type_error_msg):
> AssertionError: TypeError not raised{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41869) DataFrame dropDuplicates should throw error on non list argument

2023-01-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41869:


Assignee: Apache Spark

> DataFrame dropDuplicates should throw error on non list argument
> 
>
> Key: SPARK-41869
> URL: https://issues.apache.org/jira/browse/SPARK-41869
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame([("Alice", 50), ("Alice", 60)], ["name", 
> "age"])
> # shouldn't drop a non-null row
> self.assertEqual(df.dropDuplicates().count(), 2)
> self.assertEqual(df.dropDuplicates(["name"]).count(), 1)
> self.assertEqual(df.dropDuplicates(["name", "age"]).count(), 2)
> type_error_msg = "Parameter 'subset' must be a list of columns"
> with self.assertRaisesRegex(TypeError, type_error_msg):
> df.dropDuplicates("name"){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 128, in test_drop_duplicates
>     with self.assertRaisesRegex(TypeError, type_error_msg):
> AssertionError: TypeError not raised{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41861) Make v2 ScanBuilders' build() return typed scan

2023-01-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-41861.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39367
[https://github.com/apache/spark/pull/39367]

> Make v2 ScanBuilders' build() return typed scan
> ---
>
> Key: SPARK-41861
> URL: https://issues.apache.org/jira/browse/SPARK-41861
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.1
>Reporter: Lorenzo Martini
>Assignee: Lorenzo Martini
>Priority: Trivial
> Fix For: 3.4.0
>
>
> The `ScanBuilder` interface has the `build()` method to return `Scan` 
> objects. All the different implementations of it return scans that are of the 
> type of the builder itself. Eg `ParquetScanBuilder` will return a 
> `ParquetScan`, `TextScanBuilder` a `TextScan` etc. However, in the method 
> overrides declaration, the return type is not changed to the more specific 
> type but left as a generic `Scan`. This is a bit problematic when one has to 
> work with scan object because a manual cast is required, even if we know for 
> a fact that a `ParquetScanBuilder`'s `build()` method will return a 
> `ParquetScan`. For ease of developement (and for stricter correctness checks, 
> as we wouldn't want a `ParquetScanBuilder` to return a non-parquet scan I 
> would assume), it would be very nice if the `build()` methods of each 
> implementation of `ScanBuilder`s returned a more strictly typed object 
> instead of the generic `Scan` object



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41861) Make v2 ScanBuilders' build() return typed scan

2023-01-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-41861:
---

Assignee: Lorenzo Martini

> Make v2 ScanBuilders' build() return typed scan
> ---
>
> Key: SPARK-41861
> URL: https://issues.apache.org/jira/browse/SPARK-41861
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.1
>Reporter: Lorenzo Martini
>Assignee: Lorenzo Martini
>Priority: Trivial
>
> The `ScanBuilder` interface has the `build()` method to return `Scan` 
> objects. All the different implementations of it return scans that are of the 
> type of the builder itself. Eg `ParquetScanBuilder` will return a 
> `ParquetScan`, `TextScanBuilder` a `TextScan` etc. However, in the method 
> overrides declaration, the return type is not changed to the more specific 
> type but left as a generic `Scan`. This is a bit problematic when one has to 
> work with scan object because a manual cast is required, even if we know for 
> a fact that a `ParquetScanBuilder`'s `build()` method will return a 
> `ParquetScan`. For ease of developement (and for stricter correctness checks, 
> as we wouldn't want a `ParquetScanBuilder` to return a non-parquet scan I 
> would assume), it would be very nice if the `build()` methods of each 
> implementation of `ScanBuilder`s returned a more strictly typed object 
> instead of the generic `Scan` object



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41806) Use AppendData.byName for SQL INSERT INTO by name for DSV2 and block ambiguous queries with static partitions columns

2023-01-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-41806:
---

Assignee: Allison Portis

> Use AppendData.byName for SQL INSERT INTO by name for DSV2 and block 
> ambiguous queries with static partitions columns
> -
>
> Key: SPARK-41806
> URL: https://issues.apache.org/jira/browse/SPARK-41806
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Portis
>Assignee: Allison Portis
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently for INSERT INTO by name we reorder the value list and convert it to 
> INSERT INTO by ordinal. Since DSv2 logical nodes have the isByName flag we 
> don't need to do this. The current approach is limiting in that
>  # Users must provide the full list of table columns (this limits the 
> functionality for features like generated columns see SPARK-41290)
>  # It allows ambiguous queries such as INSERT OVERWRITE t PARTITION (c='1') 
> (c) VALUES ('2') where the user provides both the static partition column 'c' 
> and the column 'c' in the column list. We should check that the static 
> partition column is not in the column list.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41806) Use AppendData.byName for SQL INSERT INTO by name for DSV2 and block ambiguous queries with static partitions columns

2023-01-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-41806.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39334
[https://github.com/apache/spark/pull/39334]

> Use AppendData.byName for SQL INSERT INTO by name for DSV2 and block 
> ambiguous queries with static partitions columns
> -
>
> Key: SPARK-41806
> URL: https://issues.apache.org/jira/browse/SPARK-41806
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Portis
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently for INSERT INTO by name we reorder the value list and convert it to 
> INSERT INTO by ordinal. Since DSv2 logical nodes have the isByName flag we 
> don't need to do this. The current approach is limiting in that
>  # Users must provide the full list of table columns (this limits the 
> functionality for features like generated columns see SPARK-41290)
>  # It allows ambiguous queries such as INSERT OVERWRITE t PARTITION (c='1') 
> (c) VALUES ('2') where the user provides both the static partition column 'c' 
> and the column 'c' in the column list. We should check that the static 
> partition column is not in the column list.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41831) DataFrame.transform: Only Column or String can be used for projections

2023-01-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655212#comment-17655212
 ] 

Apache Spark commented on SPARK-41831:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39417

> DataFrame.transform: Only Column or String can be used for projections
> --
>
> Key: SPARK-41831
> URL: https://issues.apache.org/jira/browse/SPARK-41831
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1168, in pyspark.sql.connect.dataframe.DataFrame.transform
> Failed example:
>     df.transform(cast_all_to_int).transform(sort_columns_asc).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", 
> line 1, in 
>         df.transform(cast_all_to_int).transform(sort_columns_asc).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1102, in transform
>         result = func(self, *args, **kwargs)
>       File "", 
> line 2, in cast_all_to_int
>         return input_df.select([col(col_name).cast("int") for col_name in 
> input_df.columns])
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 86, in select
>         return DataFrame.withPlan(plan.Project(self._plan, *cols), 
> session=self._session)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line 
> 344, in __init__
>         self._verify_expressions()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line 
> 350, in _verify_expressions
>         raise InputValidationError(
>     pyspark.sql.connect.plan.InputValidationError: Only Column or String can 
> be used for projections: '[Column<'(ColumnReference(int) (int))'>, 
> Column<'(ColumnReference(float) (int))'>]'.
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1179, in pyspark.sql.connect.dataframe.DataFrame.transform
> Failed example:
>     df.transform(add_n, 1).transform(add_n, n=10).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", 
> line 1, in 
>         df.transform(add_n, 1).transform(add_n, n=10).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1102, in transform
>         result = func(self, *args, **kwargs)
>       File "", 
> line 2, in add_n
>         return input_df.select([(col(col_name) + n).alias(col_name)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 86, in select
>         return DataFrame.withPlan(plan.Project(self._plan, *cols), 
> session=self._session)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line 
> 344, in __init__
>         self._verify_expressions()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line 
> 350, in _verify_expressions
>         raise InputValidationError(
>     pyspark.sql.connect.plan.InputValidationError: Only Column or String can 
> be used for projections: '[Column<'Alias(+(ColumnReference(int), Literal(1)), 
> (int))'>, Column<'Alias(+(ColumnReference(float), Literal(1)), 
> (float))'>]'.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39743) Unable to set zstd compression level while writing parquet files

2023-01-05 Thread Eugene Chung (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655211#comment-17655211
 ] 

Eugene Chung commented on SPARK-39743:
--

I'm sorry to ask a question here but I couldn't find the answer on Internet.

Can I set the zstd compression level for ORC?

> Unable to set zstd compression level while writing parquet files
> 
>
> Key: SPARK-39743
> URL: https://issues.apache.org/jira/browse/SPARK-39743
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Yeachan Park
>Assignee: ming95
>Priority: Minor
> Fix For: 3.4.0
>
>
> While writing zstd compressed parquet files, the following setting 
> `spark.io.compression.zstd.level` does not have any affect with regards to 
> the compression level of zstd.
> All files seem to be written with the default zstd compression level, and the 
> config option seems to be ignored.
> Using the zstd cli tool, we confirmed that setting a higher compression level 
> for the same file tested in spark resulted in a smaller file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41875) Throw proper errors in Dataset.to()

2023-01-05 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655206#comment-17655206
 ] 

Hyukjin Kwon commented on SPARK-41875:
--

Thanks [~beliefer]

> Throw proper errors in Dataset.to()
> ---
>
> Key: SPARK-41875
> URL: https://issues.apache.org/jira/browse/SPARK-41875
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> schema = StructType(
> [StructField("i", StringType(), True), StructField("j", IntegerType(), 
> True)]
> )
> df = self.spark.createDataFrame([("a", 1)], schema)
> schema1 = StructType([StructField("j", StringType()), StructField("i", 
> StringType())])
> df1 = df.to(schema1)
> self.assertEqual(schema1, df1.schema)
> self.assertEqual(df.count(), df1.count())
> schema2 = StructType([StructField("j", LongType())])
> df2 = df.to(schema2)
> self.assertEqual(schema2, df2.schema)
> self.assertEqual(df.count(), df2.count())
> schema3 = StructType([StructField("struct", schema1, False)])
> df3 = df.select(struct("i", "j").alias("struct")).to(schema3)
> self.assertEqual(schema3, df3.schema)
> self.assertEqual(df.count(), df3.count())
> # incompatible field nullability
> schema4 = StructType([StructField("j", LongType(), False)])
> self.assertRaisesRegex(
> AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4)
> ){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 1486, in test_to
>     self.assertRaisesRegex(
> AssertionError: AnalysisException not raised by  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41875) Throw proper errors in Dataset.to()

2023-01-05 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655203#comment-17655203
 ] 

jiaan.geng commented on SPARK-41875:


I will take a look!

> Throw proper errors in Dataset.to()
> ---
>
> Key: SPARK-41875
> URL: https://issues.apache.org/jira/browse/SPARK-41875
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> schema = StructType(
> [StructField("i", StringType(), True), StructField("j", IntegerType(), 
> True)]
> )
> df = self.spark.createDataFrame([("a", 1)], schema)
> schema1 = StructType([StructField("j", StringType()), StructField("i", 
> StringType())])
> df1 = df.to(schema1)
> self.assertEqual(schema1, df1.schema)
> self.assertEqual(df.count(), df1.count())
> schema2 = StructType([StructField("j", LongType())])
> df2 = df.to(schema2)
> self.assertEqual(schema2, df2.schema)
> self.assertEqual(df.count(), df2.count())
> schema3 = StructType([StructField("struct", schema1, False)])
> df3 = df.select(struct("i", "j").alias("struct")).to(schema3)
> self.assertEqual(schema3, df3.schema)
> self.assertEqual(df.count(), df3.count())
> # incompatible field nullability
> schema4 = StructType([StructField("j", LongType(), False)])
> self.assertRaisesRegex(
> AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4)
> ){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 1486, in test_to
>     self.assertRaisesRegex(
> AssertionError: AnalysisException not raised by  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41567) Move configuration of `versions-maven-plugin` to parent pom

2023-01-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-41567:
-

Assignee: Yang Jie

> Move configuration of `versions-maven-plugin` to parent pom
> ---
>
> Key: SPARK-41567
> URL: https://issues.apache.org/jira/browse/SPARK-41567
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> In addition to `test-dependencies.sh`, `release-build.sh` and 
> `release-tag.sh` also using the `build/mvn versions:set` command. So move the 
> configuration of `versions-maven-plugin` to parent pom can unify the 
> `versions-maven-plugin` version used by the whole Spark project.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41567) Move configuration of `versions-maven-plugin` to parent pom

2023-01-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-41567.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39118
[https://github.com/apache/spark/pull/39118]

> Move configuration of `versions-maven-plugin` to parent pom
> ---
>
> Key: SPARK-41567
> URL: https://issues.apache.org/jira/browse/SPARK-41567
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> In addition to `test-dependencies.sh`, `release-build.sh` and 
> `release-tag.sh` also using the `build/mvn versions:set` command. So move the 
> configuration of `versions-maven-plugin` to parent pom can unify the 
> `versions-maven-plugin` version used by the whole Spark project.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41827) DataFrame.groupBy requires all cols be Column or str

2023-01-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41827.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39404
[https://github.com/apache/spark/pull/39404]

> DataFrame.groupBy requires all cols be Column or str
> 
>
> Key: SPARK-41827
> URL: https://issues.apache.org/jira/browse/SPARK-41827
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 302, in pyspark.sql.connect.dataframe.DataFrame.groupBy
> Failed example:
>     df.groupBy(["name", df.age]).count().sort("name", "age").show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", 
> line 1, in 
>         df.groupBy(["name", df.age]).count().sort("name", "age").show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 251, in groupBy
>         raise TypeError(
>     TypeError: groupBy requires all cols be Column or str, but got list 
> ['name', Column<'ColumnReference(age)'>]{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41827) DataFrame.groupBy requires all cols be Column or str

2023-01-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41827:


Assignee: Ruifeng Zheng

> DataFrame.groupBy requires all cols be Column or str
> 
>
> Key: SPARK-41827
> URL: https://issues.apache.org/jira/browse/SPARK-41827
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 302, in pyspark.sql.connect.dataframe.DataFrame.groupBy
> Failed example:
>     df.groupBy(["name", df.age]).count().sort("name", "age").show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", 
> line 1, in 
>         df.groupBy(["name", df.age]).count().sort("name", "age").show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 251, in groupBy
>         raise TypeError(
>     TypeError: groupBy requires all cols be Column or str, but got list 
> ['name', Column<'ColumnReference(age)'>]{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41652) Test parity: pyspark.sql.tests.test_functions

2023-01-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41652:


Assignee: Sandeep Singh  (was: Hyukjin Kwon)

> Test parity: pyspark.sql.tests.test_functions
> -
>
> Key: SPARK-41652
> URL: https://issues.apache.org/jira/browse/SPARK-41652
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Sandeep Singh
>Priority: Major
>
> After https://github.com/apache/spark/pull/39041 (SPARK-41528), we now reuses 
> the same test cases, see 
> {{python/pyspark/sql/tests/connect/test_parity_functions.py}}.
> We should remove all the test cases defined there, and fix Spark Connect 
> behaviours accordingly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41651) Test parity: pyspark.sql.tests.test_dataframe

2023-01-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41651:


Assignee: Sandeep Singh  (was: Hyukjin Kwon)

> Test parity: pyspark.sql.tests.test_dataframe
> -
>
> Key: SPARK-41651
> URL: https://issues.apache.org/jira/browse/SPARK-41651
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Sandeep Singh
>Priority: Major
>
> After https://github.com/apache/spark/pull/39041 (SPARK-41528), we now reuses 
> the same test cases, see 
> {{python/pyspark/sql/tests/connect/test_parity_dataframe.py}}.
> We should remove all the test cases defined there, and fix Spark Connect 
> behaviours accordingly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41849) Implement DataFrameReader.text

2023-01-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41849:


Assignee: Sandeep Singh

> Implement DataFrameReader.text
> --
>
> Key: SPARK-41849
> URL: https://issues.apache.org/jira/browse/SPARK-41849
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 276, in pyspark.sql.connect.functions.input_file_name
> Failed example:
>     df = spark.read.text(path)
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 
> 1, in 
>         df = spark.read.text(path)
>     AttributeError: 'DataFrameReader' object has no attribute 'text'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41849) Implement DataFrameReader.text

2023-01-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41849.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39413
[https://github.com/apache/spark/pull/39413]

> Implement DataFrameReader.text
> --
>
> Key: SPARK-41849
> URL: https://issues.apache.org/jira/browse/SPARK-41849
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 276, in pyspark.sql.connect.functions.input_file_name
> Failed example:
>     df = spark.read.text(path)
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 
> 1, in 
>         df = spark.read.text(path)
>     AttributeError: 'DataFrameReader' object has no attribute 'text'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41677) Protobuf serializer for StreamingQueryProgressWrapper

2023-01-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655194#comment-17655194
 ] 

Apache Spark commented on SPARK-41677:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/39416

> Protobuf serializer for StreamingQueryProgressWrapper
> -
>
> Key: SPARK-41677
> URL: https://issues.apache.org/jira/browse/SPARK-41677
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41895) Add tests for streaming UI with RocksDB backend

2023-01-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655191#comment-17655191
 ] 

Apache Spark commented on SPARK-41895:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/39415

> Add tests for streaming UI with RocksDB backend
> ---
>
> Key: SPARK-41895
> URL: https://issues.apache.org/jira/browse/SPARK-41895
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41895) Add tests for streaming UI with RocksDB backend

2023-01-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41895:


Assignee: Apache Spark  (was: Gengliang Wang)

> Add tests for streaming UI with RocksDB backend
> ---
>
> Key: SPARK-41895
> URL: https://issues.apache.org/jira/browse/SPARK-41895
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41895) Add tests for streaming UI with RocksDB backend

2023-01-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41895:


Assignee: Gengliang Wang  (was: Apache Spark)

> Add tests for streaming UI with RocksDB backend
> ---
>
> Key: SPARK-41895
> URL: https://issues.apache.org/jira/browse/SPARK-41895
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41892) Add JIRAs or messages for skipped messages

2023-01-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41892.
--
Resolution: Fixed

Issue resolved by pull request 39412
[https://github.com/apache/spark/pull/39412]

> Add JIRAs or messages for skipped messages
> --
>
> Key: SPARK-41892
> URL: https://issues.apache.org/jira/browse/SPARK-41892
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41305) Connect Proto Completeness

2023-01-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41305:
-
Fix Version/s: (was: 3.4.0)

> Connect Proto Completeness
> --
>
> Key: SPARK-41305
> URL: https://issues.apache.org/jira/browse/SPARK-41305
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41893) Publish SBOM artifacts

2023-01-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-41893.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39401
[https://github.com/apache/spark/pull/39401]

> Publish SBOM artifacts
> --
>
> Key: SPARK-41893
> URL: https://issues.apache.org/jira/browse/SPARK-41893
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41893) Publish SBOM artifacts

2023-01-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-41893:
-

Assignee: Dongjoon Hyun

> Publish SBOM artifacts
> --
>
> Key: SPARK-41893
> URL: https://issues.apache.org/jira/browse/SPARK-41893
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: releasenotes
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41917) Support SSL and Auth token in connection channel for JVM/Scala Client

2023-01-05 Thread Venkata Sai Akhil Gudesa (Jira)
Venkata Sai Akhil Gudesa created SPARK-41917:


 Summary: Support SSL and Auth token in connection channel for 
JVM/Scala Client
 Key: SPARK-41917
 URL: https://issues.apache.org/jira/browse/SPARK-41917
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.4.0
Reporter: Venkata Sai Akhil Gudesa






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41916) Address `spark.task.resource.gpu.amount > 1`

2023-01-05 Thread Rithwik Ediga Lakhamsani (Jira)
Rithwik Ediga Lakhamsani created SPARK-41916:


 Summary: Address `spark.task.resource.gpu.amount > 1`
 Key: SPARK-41916
 URL: https://issues.apache.org/jira/browse/SPARK-41916
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.4.0
Reporter: Rithwik Ediga Lakhamsani


We want the distributor to have the ability to run multiple torchrun processes 
per task if task.gpu.amount > 1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41915) Change API so that the user doesn't have to explicitly set pytorch-lightning

2023-01-05 Thread Rithwik Ediga Lakhamsani (Jira)
Rithwik Ediga Lakhamsani created SPARK-41915:


 Summary: Change API so that the user doesn't have to explicitly 
set pytorch-lightning
 Key: SPARK-41915
 URL: https://issues.apache.org/jira/browse/SPARK-41915
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.4.0
Reporter: Rithwik Ediga Lakhamsani


Removing the `framework` parameter in the API and have cloudpickle 
automatically find out whether the user code has a dependency on PyTorch 
Lightning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40588) Sorting issue with partitioned-writing and AQE turned on

2023-01-05 Thread Enrico Minack (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655135#comment-17655135
 ] 

Enrico Minack commented on SPARK-40588:
---

Unfortunately, this issue persists with Spark 3.4.0, created issue SPARK-41914 
to track that.

> Sorting issue with partitioned-writing and AQE turned on
> 
>
> Key: SPARK-40588
> URL: https://issues.apache.org/jira/browse/SPARK-40588
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.3
> Environment: Spark v3.1.3
> Scala v2.12.13
>Reporter: Swetha Baskaran
>Assignee: Enrico Minack
>Priority: Major
> Fix For: 3.2.3, 3.3.2
>
> Attachments: image-2022-10-16-22-05-47-159.png
>
>
> We are attempting to partition data by a few columns, sort by a particular 
> _sortCol_ and write out one file per partition. 
> {code:java}
> df
>     .repartition(col("day"), col("month"), col("year"))
>     .withColumn("partitionId",spark_partition_id)
>     .withColumn("monotonicallyIncreasingIdUnsorted",monotonicallyIncreasingId)
>     .sortWithinPartitions("year", "month", "day", "sortCol")
>     .withColumn("monotonicallyIncreasingIdSorted",monotonicallyIncreasingId)
>     .write
>     .partitionBy("year", "month", "day")
>     .parquet(path){code}
> When inspecting the results, we observe one file per partition, however we 
> see an _alternating_ pattern of unsorted rows in some files.
> {code:java}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832121344,"monotonicallyIncreasingIdSorted":6287832121344}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877022389,"monotonicallyIncreasingIdSorted":6287876860586}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877567881,"monotonicallyIncreasingIdSorted":6287832121345}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287835105553,"monotonicallyIncreasingIdSorted":6287876860587}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832570127,"monotonicallyIncreasingIdSorted":6287832121346}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287879965760,"monotonicallyIncreasingIdSorted":6287876860588}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287878762347,"monotonicallyIncreasingIdSorted":6287832121347}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287837165012,"monotonicallyIncreasingIdSorted":6287876860589}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832910545,"monotonicallyIncreasingIdSorted":6287832121348}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287881244758,"monotonicallyIncreasingIdSorted":6287876860590}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287880041345,"monotonicallyIncreasingIdSorted":6287832121349}{code}
> Here is a 
> [gist|https://gist.github.com/Swebask/543030748a768be92d3c0ae343d2ae89] to 
> reproduce the issue. 
> Turning off AQE with spark.conf.set("spark.sql.adaptive.enabled", false) 
> fixes the issue.
> I'm working on identifying why AQE affects the sort order. Any leads or 
> thoughts would be appreciated!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41914) Sorting issue with partitioned-writing and planned write optimization disabled

2023-01-05 Thread Enrico Minack (Jira)
Enrico Minack created SPARK-41914:
-

 Summary: Sorting issue with partitioned-writing and planned write 
optimization disabled
 Key: SPARK-41914
 URL: https://issues.apache.org/jira/browse/SPARK-41914
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Enrico Minack


Spark 3.4.0 introduced option {{{}spark.sql.optimizer.plannedWrite.enabled{}}}, 
which is enabled by default. When disabled, partitioned writing loses 
in-partition order when spilling occurs.

This is related to SPARK-40885 where setting option 
{{spark.sql.optimizer.plannedWrite.enabled}} to {{true}} will remove the 
existing sort (for {{day}} and {{{}id{}}}) entirely.

Run this with 512m memory and one executor, e.g.:
{code}
spark-shell --driver-memory 512m --master "local[1]"
{code}
{code:scala}
import org.apache.spark.sql.SaveMode

spark.conf.set("spark.sql.optimizer.plannedWrite.enabled", false)

val ids = 200
val days = 2
val parts = 2

val ds = spark.range(0, days, 1, parts).withColumnRenamed("id", 
"day").join(spark.range(0, ids, 1, parts))

ds.repartition($"day")
  .sortWithinPartitions($"day", $"id")
  .write
  .partitionBy("day")
  .mode(SaveMode.Overwrite)
  .csv("interleaved.csv")
{code}
Check the written files are sorted (states OK when file is sorted):
{code:bash}
for file in interleaved.csv/day\=*/part-*
do
  echo "$(sort -n "$file" | md5sum | cut -d " " -f 1)  $file"
done | md5sum -c
{code}
Files should look like this
{code}
0
1
2
...
1048576
1048577
1048578
...
{code}
But they look like
{code}
0
1048576
1
1048577
2
1048578
...
{code}
The cause issue is the same as in SPARK-40588. A sort (for {{{}day{}}}) is 
added on top of the existing sort (for {{day}} and {{{}id{}}}). Spilling 
interleaves the sorted spill files.

{code}
Sort [input[0, bigint, false] ASC NULLS FIRST], false, 0
+- AdaptiveSparkPlan isFinalPlan=false
   +- Sort [day#2L ASC NULLS FIRST, id#4L ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(day#2L, 200), REPARTITION_BY_COL, 
[plan_id=30]
 +- BroadcastNestedLoopJoin BuildLeft, Inner
:- BroadcastExchange IdentityBroadcastMode, [plan_id=28]
:  +- Project [id#0L AS day#2L]
: +- Range (0, 2, step=1, splits=2)
+- Range (0, 200, step=1, splits=2)
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41913) Add a linter rule to enforce transforming with pruning

2023-01-05 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-41913:
--

 Summary: Add a linter rule to enforce transforming with pruning
 Key: SPARK-41913
 URL: https://issues.apache.org/jira/browse/SPARK-41913
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Gengliang Wang


Add a linter rule to enforce transforming Catalyst tree nodes with pruning. 
This is to ensure catalyst rules are implemented efficiently.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41633) Identify aggregation expression in the nodePatterns of PythonUDF

2023-01-05 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-41633.

Resolution: Won't Do

> Identify aggregation expression in the nodePatterns of PythonUDF
> 
>
> Key: SPARK-41633
> URL: https://issues.apache.org/jira/browse/SPARK-41633
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> When a PythonUDF is evaluated as SQL_GROUPED_AGG_PANDAS_UDF, we can mark it 
> as AGGREGATE_EXPRESSION in the `nodePatterns`. So that we can check whether 
> an expression contains aggregation with a handy way: 
> `expr.containsPattern(AGGREGATE_EXPRESSION)` 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41912) Subquery should not validate CTE

2023-01-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655103#comment-17655103
 ] 

Apache Spark commented on SPARK-41912:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/39414

> Subquery should not validate CTE
> 
>
> Key: SPARK-41912
> URL: https://issues.apache.org/jira/browse/SPARK-41912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41912) Subquery should not validate CTE

2023-01-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41912:


Assignee: Apache Spark  (was: Rui Wang)

> Subquery should not validate CTE
> 
>
> Key: SPARK-41912
> URL: https://issues.apache.org/jira/browse/SPARK-41912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41912) Subquery should not validate CTE

2023-01-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41912:


Assignee: Rui Wang  (was: Apache Spark)

> Subquery should not validate CTE
> 
>
> Key: SPARK-41912
> URL: https://issues.apache.org/jira/browse/SPARK-41912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41912) Subquery should not validate CTE

2023-01-05 Thread Rui Wang (Jira)
Rui Wang created SPARK-41912:


 Summary: Subquery should not validate CTE
 Key: SPARK-41912
 URL: https://issues.apache.org/jira/browse/SPARK-41912
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Rui Wang
Assignee: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41755) Reorder fields to use consecutive field numbers

2023-01-05 Thread Rui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-41755:
-
Summary: Reorder fields to use consecutive field numbers  (was: Reorder the 
relation IDs)

> Reorder fields to use consecutive field numbers
> ---
>
> Key: SPARK-41755
> URL: https://issues.apache.org/jira/browse/SPARK-41755
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> make IDs consecutive
> ```
> RepartitionByExpression repartition_by_expression = 27;
> // NA functions
> NAFill fill_na = 90;
> NADrop drop_na = 91;
> NAReplace replace = 92;
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41909) Update proto fields to use increasing field numbers and avoid holes

2023-01-05 Thread Rui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang resolved SPARK-41909.
--
Resolution: Duplicate

https://issues.apache.org/jira/browse/SPARK-41755

> Update proto fields to use increasing field numbers and avoid holes
> ---
>
> Key: SPARK-41909
> URL: https://issues.apache.org/jira/browse/SPARK-41909
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41908) Catalog API refactoring

2023-01-05 Thread Rui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-41908:
-
Description: We may revisit Catalog proto design and refactor it such that 
it becomes a breaking change.

> Catalog API refactoring
> ---
>
> Key: SPARK-41908
> URL: https://issues.apache.org/jira/browse/SPARK-41908
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>
> We may revisit Catalog proto design and refactor it such that it becomes a 
> breaking change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41910) Remove `optional` notation in proto

2023-01-05 Thread Rui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-41910:
-
Description: Every field in proto3 has a default value. We should revisit 
existing proto field to understand if the default value can be used without 
tell the field is set and not set, and remove `optional` as much as possible 
from Spark Connect proto surface.

> Remove `optional` notation in proto
> ---
>
> Key: SPARK-41910
> URL: https://issues.apache.org/jira/browse/SPARK-41910
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>
> Every field in proto3 has a default value. We should revisit existing proto 
> field to understand if the default value can be used without tell the field 
> is set and not set, and remove `optional` as much as possible from Spark 
> Connect proto surface.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41911) Add version fields to Connect proto

2023-01-05 Thread Rui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-41911:
-
Description: We may need this to help maintain compatibility. Depending on 
the concrete protocol design, we may use field number 1 for version fields thus 
may cause breaking changes on existing proto messages. 

> Add version fields to Connect proto
> ---
>
> Key: SPARK-41911
> URL: https://issues.apache.org/jira/browse/SPARK-41911
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>
> We may need this to help maintain compatibility. Depending on the concrete 
> protocol design, we may use field number 1 for version fields thus may cause 
> breaking changes on existing proto messages. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41911) Add version fields to Connect proto

2023-01-05 Thread Rui Wang (Jira)
Rui Wang created SPARK-41911:


 Summary: Add version fields to Connect proto
 Key: SPARK-41911
 URL: https://issues.apache.org/jira/browse/SPARK-41911
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41910) Remove `optional` notation in proto

2023-01-05 Thread Rui Wang (Jira)
Rui Wang created SPARK-41910:


 Summary: Remove `optional` notation in proto
 Key: SPARK-41910
 URL: https://issues.apache.org/jira/browse/SPARK-41910
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41909) Update proto fields to use increasing field numbers and avoid holes

2023-01-05 Thread Rui Wang (Jira)
Rui Wang created SPARK-41909:


 Summary: Update proto fields to use increasing field numbers and 
avoid holes
 Key: SPARK-41909
 URL: https://issues.apache.org/jira/browse/SPARK-41909
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41908) Catalog API refactoring

2023-01-05 Thread Rui Wang (Jira)
Rui Wang created SPARK-41908:


 Summary: Catalog API refactoring
 Key: SPARK-41908
 URL: https://issues.apache.org/jira/browse/SPARK-41908
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41849) Implement DataFrameReader.text

2023-01-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41849:


Assignee: Apache Spark

> Implement DataFrameReader.text
> --
>
> Key: SPARK-41849
> URL: https://issues.apache.org/jira/browse/SPARK-41849
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 276, in pyspark.sql.connect.functions.input_file_name
> Failed example:
>     df = spark.read.text(path)
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 
> 1, in 
>         df = spark.read.text(path)
>     AttributeError: 'DataFrameReader' object has no attribute 'text'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41849) Implement DataFrameReader.text

2023-01-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655094#comment-17655094
 ] 

Apache Spark commented on SPARK-41849:
--

User 'techaddict' has created a pull request for this issue:
https://github.com/apache/spark/pull/39413

> Implement DataFrameReader.text
> --
>
> Key: SPARK-41849
> URL: https://issues.apache.org/jira/browse/SPARK-41849
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 276, in pyspark.sql.connect.functions.input_file_name
> Failed example:
>     df = spark.read.text(path)
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 
> 1, in 
>         df = spark.read.text(path)
>     AttributeError: 'DataFrameReader' object has no attribute 'text'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41849) Implement DataFrameReader.text

2023-01-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41849:


Assignee: (was: Apache Spark)

> Implement DataFrameReader.text
> --
>
> Key: SPARK-41849
> URL: https://issues.apache.org/jira/browse/SPARK-41849
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 276, in pyspark.sql.connect.functions.input_file_name
> Failed example:
>     df = spark.read.text(path)
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 
> 1, in 
>         df = spark.read.text(path)
>     AttributeError: 'DataFrameReader' object has no attribute 'text'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41849) Implement DataFrameReader.text

2023-01-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655095#comment-17655095
 ] 

Apache Spark commented on SPARK-41849:
--

User 'techaddict' has created a pull request for this issue:
https://github.com/apache/spark/pull/39413

> Implement DataFrameReader.text
> --
>
> Key: SPARK-41849
> URL: https://issues.apache.org/jira/browse/SPARK-41849
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 276, in pyspark.sql.connect.functions.input_file_name
> Failed example:
>     df = spark.read.text(path)
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 
> 1, in 
>         df = spark.read.text(path)
>     AttributeError: 'DataFrameReader' object has no attribute 'text'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41882) Add tests for SQLAppStatusStore with RocksDB Backend

2023-01-05 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-41882.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39385
[https://github.com/apache/spark/pull/39385]

> Add tests for SQLAppStatusStore with RocksDB Backend
> 
>
> Key: SPARK-41882
> URL: https://issues.apache.org/jira/browse/SPARK-41882
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41882) Add tests for SQLAppStatusStore with RocksDB Backend

2023-01-05 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-41882:
--

Assignee: Yang Jie

> Add tests for SQLAppStatusStore with RocksDB Backend
> 
>
> Key: SPARK-41882
> URL: https://issues.apache.org/jira/browse/SPARK-41882
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41892) Add JIRAs or messages for skipped messages

2023-01-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655081#comment-17655081
 ] 

Apache Spark commented on SPARK-41892:
--

User 'techaddict' has created a pull request for this issue:
https://github.com/apache/spark/pull/39412

> Add JIRAs or messages for skipped messages
> --
>
> Key: SPARK-41892
> URL: https://issues.apache.org/jira/browse/SPARK-41892
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41892) Add JIRAs or messages for skipped messages

2023-01-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41892:


Assignee: Sandeep Singh  (was: Apache Spark)

> Add JIRAs or messages for skipped messages
> --
>
> Key: SPARK-41892
> URL: https://issues.apache.org/jira/browse/SPARK-41892
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41892) Add JIRAs or messages for skipped messages

2023-01-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41892:


Assignee: Apache Spark  (was: Sandeep Singh)

> Add JIRAs or messages for skipped messages
> --
>
> Key: SPARK-41892
> URL: https://issues.apache.org/jira/browse/SPARK-41892
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41892) Add JIRAs or messages for skipped messages

2023-01-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655082#comment-17655082
 ] 

Apache Spark commented on SPARK-41892:
--

User 'techaddict' has created a pull request for this issue:
https://github.com/apache/spark/pull/39412

> Add JIRAs or messages for skipped messages
> --
>
> Key: SPARK-41892
> URL: https://issues.apache.org/jira/browse/SPARK-41892
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41907) Function `sampleby` return parity

2023-01-05 Thread Sandeep Singh (Jira)
Sandeep Singh created SPARK-41907:
-

 Summary: Function `sampleby` return parity
 Key: SPARK-41907
 URL: https://issues.apache.org/jira/browse/SPARK-41907
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
df = self.df
from pyspark.sql import functions

rnd = df.select("key", functions.rand()).collect()
for row in rnd:
assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1]
rndn = df.select("key", functions.randn(5)).collect()
for row in rndn:
assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1]

# If the specified seed is 0, we should use it.
# https://issues.apache.org/jira/browse/SPARK-9691
rnd1 = df.select("key", functions.rand(0)).collect()
rnd2 = df.select("key", functions.rand(0)).collect()
self.assertEqual(sorted(rnd1), sorted(rnd2))

rndn1 = df.select("key", functions.randn(0)).collect()
rndn2 = df.select("key", functions.randn(0)).collect()
self.assertEqual(sorted(rndn1), sorted(rndn2)){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 299, in test_rand_functions
rnd = df.select("key", functions.rand()).collect()
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
line 2917, in select
jdf = self._jdf.select(self._jcols(*cols))
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
line 2537, in _jcols
return self._jseq(cols, _to_java_column)
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
line 2524, in _jseq
return _to_seq(self.sparkSession._sc, cols, converter)
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
86, in _to_seq
cols = [converter(c) for c in cols]
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
86, in 
cols = [converter(c) for c in cols]
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
65, in _to_java_column
raise TypeError(
TypeError: Invalid argument, not a string or column: Column<'rand()'> of type 
. For column literals, use 'lit', 
'array', 'struct' or 'create_map' function.
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41907) Function `sampleby` return parity

2023-01-05 Thread Sandeep Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41907:
--
Description: 
{code:java}
df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)])
sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0)
self.assertTrue(sampled.count() == 35){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 202, in test_sampleby
self.assertTrue(sampled.count() == 35)
AssertionError: False is not true {code}

  was:
{code:java}
df = self.df
from pyspark.sql import functions

rnd = df.select("key", functions.rand()).collect()
for row in rnd:
assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1]
rndn = df.select("key", functions.randn(5)).collect()
for row in rndn:
assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1]

# If the specified seed is 0, we should use it.
# https://issues.apache.org/jira/browse/SPARK-9691
rnd1 = df.select("key", functions.rand(0)).collect()
rnd2 = df.select("key", functions.rand(0)).collect()
self.assertEqual(sorted(rnd1), sorted(rnd2))

rndn1 = df.select("key", functions.randn(0)).collect()
rndn2 = df.select("key", functions.randn(0)).collect()
self.assertEqual(sorted(rndn1), sorted(rndn2)){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 299, in test_rand_functions
rnd = df.select("key", functions.rand()).collect()
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
line 2917, in select
jdf = self._jdf.select(self._jcols(*cols))
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
line 2537, in _jcols
return self._jseq(cols, _to_java_column)
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
line 2524, in _jseq
return _to_seq(self.sparkSession._sc, cols, converter)
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
86, in _to_seq
cols = [converter(c) for c in cols]
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
86, in 
cols = [converter(c) for c in cols]
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
65, in _to_java_column
raise TypeError(
TypeError: Invalid argument, not a string or column: Column<'rand()'> of type 
. For column literals, use 'lit', 
'array', 'struct' or 'create_map' function.
{code}


> Function `sampleby` return parity
> -
>
> Key: SPARK-41907
> URL: https://issues.apache.org/jira/browse/SPARK-41907
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)])
> sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0)
> self.assertTrue(sampled.count() == 35){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 202, in test_sampleby
> self.assertTrue(sampled.count() == 35)
> AssertionError: False is not true {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41906) Handle Function `rand() `

2023-01-05 Thread Sandeep Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41906:
--
Description: 
{code:java}
df = self.df
from pyspark.sql import functions

rnd = df.select("key", functions.rand()).collect()
for row in rnd:
assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1]
rndn = df.select("key", functions.randn(5)).collect()
for row in rndn:
assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1]

# If the specified seed is 0, we should use it.
# https://issues.apache.org/jira/browse/SPARK-9691
rnd1 = df.select("key", functions.rand(0)).collect()
rnd2 = df.select("key", functions.rand(0)).collect()
self.assertEqual(sorted(rnd1), sorted(rnd2))

rndn1 = df.select("key", functions.randn(0)).collect()
rndn2 = df.select("key", functions.randn(0)).collect()
self.assertEqual(sorted(rndn1), sorted(rndn2)){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 299, in test_rand_functions
rnd = df.select("key", functions.rand()).collect()
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
line 2917, in select
jdf = self._jdf.select(self._jcols(*cols))
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
line 2537, in _jcols
return self._jseq(cols, _to_java_column)
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
line 2524, in _jseq
return _to_seq(self.sparkSession._sc, cols, converter)
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
86, in _to_seq
cols = [converter(c) for c in cols]
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
86, in 
cols = [converter(c) for c in cols]
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
65, in _to_java_column
raise TypeError(
TypeError: Invalid argument, not a string or column: Column<'rand()'> of type 
. For column literals, use 'lit', 
'array', 'struct' or 'create_map' function.
{code}

  was:
{code:java}
df = self.spark.createDataFrame(
[
(
[1, 2, 3],
2,
2,
),
(
[4, 5],
2,
2,
),
],
["x", "index", "len"],
)

expected = [Row(sliced=[2, 3]), Row(sliced=[5])]
self.assertTrue(
all(
[
df.select(slice(df.x, 2, 2).alias("sliced")).collect() == expected,
df.select(slice(df.x, lit(2), lit(2)).alias("sliced")).collect() == 
expected,
df.select(slice("x", "index", "len").alias("sliced")).collect() == 
expected,
]
)
)

self.assertEqual(
df.select(slice(df.x, size(df.x) - 1, lit(1)).alias("sliced")).collect(),
[Row(sliced=[2]), Row(sliced=[4])],
)
self.assertEqual(
df.select(slice(df.x, lit(1), size(df.x) - 1).alias("sliced")).collect(),
[Row(sliced=[1, 2]), Row(sliced=[4])],
){code}
{code:java}
 Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 596, in test_slice
df.select(slice("x", "index", "len").alias("sliced")).collect() == expected,
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line 
332, in wrapped
return getattr(functions, f.__name__)(*args, **kwargs)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1525, in slice
raise TypeError(f"start should be a Column or int, but got 
{type(start).__name__}")
TypeError: start should be a Column or int, but got str{code}


> Handle Function `rand() `
> -
>
> Key: SPARK-41906
> URL: https://issues.apache.org/jira/browse/SPARK-41906
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.df
> from pyspark.sql import functions
> rnd = df.select("key", functions.rand()).collect()
> for row in rnd:
> assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1]
> rndn = df.select("key", functions.randn(5)).collect()
> for row in rndn:
> assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1]
> # If the specified seed is 0, we should use it.
> # https://issues.apache.org/jira/browse/SPARK-9691
> rnd1 = df.select("key", functions.rand(0)).collect()
> rnd2 = df.select("key", functions.rand(0)).collect()
> self.assertEqual(sorted(rnd1), sorted(rnd2))
> rndn1 = df.select("key", functions.randn(0)).collect()
> rndn2 = df.select("key", functions.randn(0)).collect()
> self.assertEqual(sorted(rndn1), sorted(rndn2)){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 29

  1   2   >