[jira] [Assigned] (SPARK-41455) Resolve dtypes inconsistencies of date/timestamp functions
[ https://issues.apache.org/jira/browse/SPARK-41455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41455: Assignee: (was: Apache Spark) > Resolve dtypes inconsistencies of date/timestamp functions > -- > > Key: SPARK-41455 > URL: https://issues.apache.org/jira/browse/SPARK-41455 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > When implementing date/timestamp functions, we notice inconsistent dtypes > with PySpark, as shown below. > {code:python} > >> sdf.select(SF.current_timestamp()).toPandas().dtypes > current_timestamp()datetime64[ns] > dtype: object > >>> cdf.select(CF.current_timestamp()).toPandas().dtypes > current_timestamp()datetime64[ns, America/Los_Angeles] > {code} > Affected functions include: > {code:python} > to_timestamp, from_utc_timestamp, to_utc_timestamp, timestamp_seconds, > current_timestamp, date_trunc > {code} > We may have to implement `is_timestamp_ntz_preferred` for Connect. > After the fix, tests of those date/timestamp functions which use > `compare_by_show` should be switched to `toPandas` comparison. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41455) Resolve dtypes inconsistencies of date/timestamp functions
[ https://issues.apache.org/jira/browse/SPARK-41455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41455: Assignee: Apache Spark > Resolve dtypes inconsistencies of date/timestamp functions > -- > > Key: SPARK-41455 > URL: https://issues.apache.org/jira/browse/SPARK-41455 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > When implementing date/timestamp functions, we notice inconsistent dtypes > with PySpark, as shown below. > {code:python} > >> sdf.select(SF.current_timestamp()).toPandas().dtypes > current_timestamp()datetime64[ns] > dtype: object > >>> cdf.select(CF.current_timestamp()).toPandas().dtypes > current_timestamp()datetime64[ns, America/Los_Angeles] > {code} > Affected functions include: > {code:python} > to_timestamp, from_utc_timestamp, to_utc_timestamp, timestamp_seconds, > current_timestamp, date_trunc > {code} > We may have to implement `is_timestamp_ntz_preferred` for Connect. > After the fix, tests of those date/timestamp functions which use > `compare_by_show` should be switched to `toPandas` comparison. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41455) Resolve dtypes inconsistencies of date/timestamp functions
[ https://issues.apache.org/jira/browse/SPARK-41455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655277#comment-17655277 ] Apache Spark commented on SPARK-41455: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39426 > Resolve dtypes inconsistencies of date/timestamp functions > -- > > Key: SPARK-41455 > URL: https://issues.apache.org/jira/browse/SPARK-41455 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > When implementing date/timestamp functions, we notice inconsistent dtypes > with PySpark, as shown below. > {code:python} > >> sdf.select(SF.current_timestamp()).toPandas().dtypes > current_timestamp()datetime64[ns] > dtype: object > >>> cdf.select(CF.current_timestamp()).toPandas().dtypes > current_timestamp()datetime64[ns, America/Los_Angeles] > {code} > Affected functions include: > {code:python} > to_timestamp, from_utc_timestamp, to_utc_timestamp, timestamp_seconds, > current_timestamp, date_trunc > {code} > We may have to implement `is_timestamp_ntz_preferred` for Connect. > After the fix, tests of those date/timestamp functions which use > `compare_by_show` should be switched to `toPandas` comparison. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41905) Function `slice` should handle string in params
[ https://issues.apache.org/jira/browse/SPARK-41905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41905. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39420 [https://github.com/apache/spark/pull/39420] > Function `slice` should handle string in params > --- > > Key: SPARK-41905 > URL: https://issues.apache.org/jira/browse/SPARK-41905 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > {code:java} > df = self.spark.createDataFrame( > [ > ( > [1, 2, 3], > 2, > 2, > ), > ( > [4, 5], > 2, > 2, > ), > ], > ["x", "index", "len"], > ) > expected = [Row(sliced=[2, 3]), Row(sliced=[5])] > self.assertTrue( > all( > [ > df.select(slice(df.x, 2, 2).alias("sliced")).collect() == > expected, > df.select(slice(df.x, lit(2), lit(2)).alias("sliced")).collect() > == expected, > df.select(slice("x", "index", "len").alias("sliced")).collect() > == expected, > ] > ) > ) > self.assertEqual( > df.select(slice(df.x, size(df.x) - 1, lit(1)).alias("sliced")).collect(), > [Row(sliced=[2]), Row(sliced=[4])], > ) > self.assertEqual( > df.select(slice(df.x, lit(1), size(df.x) - 1).alias("sliced")).collect(), > [Row(sliced=[1, 2]), Row(sliced=[4])], > ){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 596, in test_slice > df.select(slice("x", "index", "len").alias("sliced")).collect() == > expected, > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line > 332, in wrapped > return getattr(functions, f.__name__)(*args, **kwargs) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1525, in slice > raise TypeError(f"start should be a Column or int, but got > {type(start).__name__}") > TypeError: start should be a Column or int, but got str{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41921) Enable doctests in connect.column and connect.functions
[ https://issues.apache.org/jira/browse/SPARK-41921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41921. -- Resolution: Fixed Issue resolved by pull request 39423 [https://github.com/apache/spark/pull/39423] > Enable doctests in connect.column and connect.functions > --- > > Key: SPARK-41921 > URL: https://issues.apache.org/jira/browse/SPARK-41921 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41905) Function `slice` should handle string in params
[ https://issues.apache.org/jira/browse/SPARK-41905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41905: Assignee: Hyukjin Kwon > Function `slice` should handle string in params > --- > > Key: SPARK-41905 > URL: https://issues.apache.org/jira/browse/SPARK-41905 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Hyukjin Kwon >Priority: Major > > {code:java} > df = self.spark.createDataFrame( > [ > ( > [1, 2, 3], > 2, > 2, > ), > ( > [4, 5], > 2, > 2, > ), > ], > ["x", "index", "len"], > ) > expected = [Row(sliced=[2, 3]), Row(sliced=[5])] > self.assertTrue( > all( > [ > df.select(slice(df.x, 2, 2).alias("sliced")).collect() == > expected, > df.select(slice(df.x, lit(2), lit(2)).alias("sliced")).collect() > == expected, > df.select(slice("x", "index", "len").alias("sliced")).collect() > == expected, > ] > ) > ) > self.assertEqual( > df.select(slice(df.x, size(df.x) - 1, lit(1)).alias("sliced")).collect(), > [Row(sliced=[2]), Row(sliced=[4])], > ) > self.assertEqual( > df.select(slice(df.x, lit(1), size(df.x) - 1).alias("sliced")).collect(), > [Row(sliced=[1, 2]), Row(sliced=[4])], > ){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 596, in test_slice > df.select(slice("x", "index", "len").alias("sliced")).collect() == > expected, > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line > 332, in wrapped > return getattr(functions, f.__name__)(*args, **kwargs) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1525, in slice > raise TypeError(f"start should be a Column or int, but got > {type(start).__name__}") > TypeError: start should be a Column or int, but got str{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41906) Handle Function `rand() `
[ https://issues.apache.org/jira/browse/SPARK-41906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-41906: - Assignee: Hyukjin Kwon > Handle Function `rand() ` > - > > Key: SPARK-41906 > URL: https://issues.apache.org/jira/browse/SPARK-41906 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > {code:java} > df = self.df > from pyspark.sql import functions > rnd = df.select("key", functions.rand()).collect() > for row in rnd: > assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1] > rndn = df.select("key", functions.randn(5)).collect() > for row in rndn: > assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1] > # If the specified seed is 0, we should use it. > # https://issues.apache.org/jira/browse/SPARK-9691 > rnd1 = df.select("key", functions.rand(0)).collect() > rnd2 = df.select("key", functions.rand(0)).collect() > self.assertEqual(sorted(rnd1), sorted(rnd2)) > rndn1 = df.select("key", functions.randn(0)).collect() > rndn2 = df.select("key", functions.randn(0)).collect() > self.assertEqual(sorted(rndn1), sorted(rndn2)){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 299, in test_rand_functions > rnd = df.select("key", functions.rand()).collect() > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", > line 2917, in select > jdf = self._jdf.select(self._jcols(*cols)) > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", > line 2537, in _jcols > return self._jseq(cols, _to_java_column) > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", > line 2524, in _jseq > return _to_seq(self.sparkSession._sc, cols, converter) > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line > 86, in _to_seq > cols = [converter(c) for c in cols] > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line > 86, in > cols = [converter(c) for c in cols] > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line > 65, in _to_java_column > raise TypeError( > TypeError: Invalid argument, not a string or column: Column<'rand()'> of type > . For column literals, use 'lit', > 'array', 'struct' or 'create_map' function. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41906) Handle Function `rand() `
[ https://issues.apache.org/jira/browse/SPARK-41906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-41906. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39421 [https://github.com/apache/spark/pull/39421] > Handle Function `rand() ` > - > > Key: SPARK-41906 > URL: https://issues.apache.org/jira/browse/SPARK-41906 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > df = self.df > from pyspark.sql import functions > rnd = df.select("key", functions.rand()).collect() > for row in rnd: > assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1] > rndn = df.select("key", functions.randn(5)).collect() > for row in rndn: > assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1] > # If the specified seed is 0, we should use it. > # https://issues.apache.org/jira/browse/SPARK-9691 > rnd1 = df.select("key", functions.rand(0)).collect() > rnd2 = df.select("key", functions.rand(0)).collect() > self.assertEqual(sorted(rnd1), sorted(rnd2)) > rndn1 = df.select("key", functions.randn(0)).collect() > rndn2 = df.select("key", functions.randn(0)).collect() > self.assertEqual(sorted(rndn1), sorted(rndn2)){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 299, in test_rand_functions > rnd = df.select("key", functions.rand()).collect() > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", > line 2917, in select > jdf = self._jdf.select(self._jcols(*cols)) > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", > line 2537, in _jcols > return self._jseq(cols, _to_java_column) > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", > line 2524, in _jseq > return _to_seq(self.sparkSession._sc, cols, converter) > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line > 86, in _to_seq > cols = [converter(c) for c in cols] > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line > 86, in > cols = [converter(c) for c in cols] > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line > 65, in _to_java_column > raise TypeError( > TypeError: Invalid argument, not a string or column: Column<'rand()'> of type > . For column literals, use 'lit', > 'array', 'struct' or 'create_map' function. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41869) DataFrame dropDuplicates should throw error on non list argument
[ https://issues.apache.org/jira/browse/SPARK-41869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-41869. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39418 [https://github.com/apache/spark/pull/39418] > DataFrame dropDuplicates should throw error on non list argument > > > Key: SPARK-41869 > URL: https://issues.apache.org/jira/browse/SPARK-41869 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > df = self.spark.createDataFrame([("Alice", 50), ("Alice", 60)], ["name", > "age"]) > # shouldn't drop a non-null row > self.assertEqual(df.dropDuplicates().count(), 2) > self.assertEqual(df.dropDuplicates(["name"]).count(), 1) > self.assertEqual(df.dropDuplicates(["name", "age"]).count(), 2) > type_error_msg = "Parameter 'subset' must be a list of columns" > with self.assertRaisesRegex(TypeError, type_error_msg): > df.dropDuplicates("name"){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 128, in test_drop_duplicates > with self.assertRaisesRegex(TypeError, type_error_msg): > AssertionError: TypeError not raised{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41869) DataFrame dropDuplicates should throw error on non list argument
[ https://issues.apache.org/jira/browse/SPARK-41869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-41869: - Assignee: Hyukjin Kwon > DataFrame dropDuplicates should throw error on non list argument > > > Key: SPARK-41869 > URL: https://issues.apache.org/jira/browse/SPARK-41869 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > {code:java} > df = self.spark.createDataFrame([("Alice", 50), ("Alice", 60)], ["name", > "age"]) > # shouldn't drop a non-null row > self.assertEqual(df.dropDuplicates().count(), 2) > self.assertEqual(df.dropDuplicates(["name"]).count(), 1) > self.assertEqual(df.dropDuplicates(["name", "age"]).count(), 2) > type_error_msg = "Parameter 'subset' must be a list of columns" > with self.assertRaisesRegex(TypeError, type_error_msg): > df.dropDuplicates("name"){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 128, in test_drop_duplicates > with self.assertRaisesRegex(TypeError, type_error_msg): > AssertionError: TypeError not raised{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-39743) Unable to set zstd compression level while writing parquet files
[ https://issues.apache.org/jira/browse/SPARK-39743 ] ming95 deleted comment on SPARK-39743: was (Author: zing): [~euigeun_chung] see this Jira : https://issues.apache.org/jira/browse/SPARK-33978 > Unable to set zstd compression level while writing parquet files > > > Key: SPARK-39743 > URL: https://issues.apache.org/jira/browse/SPARK-39743 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Yeachan Park >Assignee: ming95 >Priority: Minor > Fix For: 3.4.0 > > > While writing zstd compressed parquet files, the following setting > `spark.io.compression.zstd.level` does not have any affect with regards to > the compression level of zstd. > All files seem to be written with the default zstd compression level, and the > config option seems to be ignored. > Using the zstd cli tool, we confirmed that setting a higher compression level > for the same file tested in spark resulted in a smaller file. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39743) Unable to set zstd compression level while writing parquet files
[ https://issues.apache.org/jira/browse/SPARK-39743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655244#comment-17655244 ] ming95 commented on SPARK-39743: [~euigeun_chung] see this Jira : https://issues.apache.org/jira/browse/SPARK-33978 > Unable to set zstd compression level while writing parquet files > > > Key: SPARK-39743 > URL: https://issues.apache.org/jira/browse/SPARK-39743 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Yeachan Park >Assignee: ming95 >Priority: Minor > Fix For: 3.4.0 > > > While writing zstd compressed parquet files, the following setting > `spark.io.compression.zstd.level` does not have any affect with regards to > the compression level of zstd. > All files seem to be written with the default zstd compression level, and the > config option seems to be ignored. > Using the zstd cli tool, we confirmed that setting a higher compression level > for the same file tested in spark resulted in a smaller file. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41538) Metadata column should be appended at the end of project list
[ https://issues.apache.org/jira/browse/SPARK-41538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655243#comment-17655243 ] Apache Spark commented on SPARK-41538: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/39425 > Metadata column should be appended at the end of project list > - > > Key: SPARK-41538 > URL: https://issues.apache.org/jira/browse/SPARK-41538 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.3.2, 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.3.2, 3.4.0 > > > For the following query: > > {code:java} > CREATE TABLE table_1 ( > a ARRAY, > s STRUCT) > USING parquet; > CREATE VIEW view_1 (id) > AS WITH source AS ( > SELECT * FROM table_1 > ), > renamed AS ( > SELECT > s.id > FROM source > ) > SELECT id FROM renamed; > with foo AS ( > SELECT 'a' as id > ), > bar AS ( > SELECT 'a' as id > ) > SELECT > 1 > FROM foo > FULL OUTER JOIN bar USING(id) > FULL OUTER JOIN view_1 USING(id) > WHERE foo.id IS NOT NULL{code} > There will be the following error: > > {code:java} > class org.apache.spark.sql.types.ArrayType cannot be cast to class > org.apache.spark.sql.types.StructType (org.apache.spark.sql.types.ArrayType > and org.apache.spark.sql.types.StructType are in unnamed module of loader > 'app') > java.lang.ClassCastException: class org.apache.spark.sql.types.ArrayType > cannot be cast to class org.apache.spark.sql.types.StructType > (org.apache.spark.sql.types.ArrayType and > org.apache.spark.sql.types.StructType are in unnamed module of loader 'app') > at > org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema$lzycompute(complexTypeExtractors.scala:108) > at > org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema(complexTypeExtractors.scala:108) > at > org.apache.spark.sql.catalyst.expressions.GetStructField.dataType(complexTypeExtractors.scala:114) > at > org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:193) > at > org.apache.spark.sql.catalyst.expressions.AliasHelper$$anonfun$getAliasMap$1.applyOrElse(AliasHelper.scala:50) > at > org.apache.spark.sql.catalyst.expressions.AliasHelper$$anonfun$getAliasMap$1.applyOrElse(AliasHelper.scala:50) > at scala.collection.immutable.List.collect(List.scala:315) > at > org.apache.spark.sql.catalyst.expressions.AliasHelper.getAliasMap(AliasHelper.scala:50) > at > org.apache.spark.sql.catalyst.expressions.AliasHelper.getAliasMap$(AliasHelper.scala:47) > at > org.apache.spark.sql.catalyst.optimizer.CollapseProject$.getAliasMap(Optimizer.scala:992) > at > org.apache.spark.sql.catalyst.optimizer.CollapseProject$.canCollapseExpressions(Optimizer.scala:1029){code} > This is caused by the inconsistent metadata column positions in the following > two nodes: > * Table relation: at the ending position > * Project list: at the beginning position > When the InlineCTE rule executes, the metadata column in project is wrongly > combined with the table output. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41708) Pull v1write information to WriteFiles
[ https://issues.apache.org/jira/browse/SPARK-41708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-41708: --- Assignee: XiDuo You > Pull v1write information to WriteFiles > -- > > Key: SPARK-41708 > URL: https://issues.apache.org/jira/browse/SPARK-41708 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > Make WriteFiles hold v1 write information -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41708) Pull v1write information to WriteFiles
[ https://issues.apache.org/jira/browse/SPARK-41708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-41708. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39277 [https://github.com/apache/spark/pull/39277] > Pull v1write information to WriteFiles > -- > > Key: SPARK-41708 > URL: https://issues.apache.org/jira/browse/SPARK-41708 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > Make WriteFiles hold v1 write information -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-41818) Support DataFrameWriter.saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-41818 ] Sandeep Singh deleted comment on SPARK-41818: --- was (Author: techaddict): Could be moved under https://issues.apache.org/jira/browse/SPARK-41279 > Support DataFrameWriter.saveAsTable > --- > > Key: SPARK-41818 > URL: https://issues.apache.org/jira/browse/SPARK-41818 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", > line 369, in pyspark.sql.connect.readwriter.DataFrameWriter.insertInto > Failed example: > df.write.saveAsTable("tblA") > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.readwriter.DataFrameWriter.insertInto[2]>", line 1, in > > df.write.saveAsTable("tblA") > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", > line 350, in saveAsTable > > self._spark.client.execute_command(self._write.command(self._spark.client)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 459, in execute_command > self._execute(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 547, in _execute > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 623, in _handle_error > raise SparkConnectException(status.message, info.reason) from None > pyspark.sql.connect.client.SparkConnectException: > (java.lang.ClassNotFoundException) .DefaultSource{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41921) Enable doctests in connect.column and connect.functions
[ https://issues.apache.org/jira/browse/SPARK-41921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655234#comment-17655234 ] Apache Spark commented on SPARK-41921: -- User 'techaddict' has created a pull request for this issue: https://github.com/apache/spark/pull/39423 > Enable doctests in connect.column and connect.functions > --- > > Key: SPARK-41921 > URL: https://issues.apache.org/jira/browse/SPARK-41921 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41921) Enable doctests in connect.column and connect.functions
[ https://issues.apache.org/jira/browse/SPARK-41921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41921: Assignee: Sandeep Singh (was: Apache Spark) > Enable doctests in connect.column and connect.functions > --- > > Key: SPARK-41921 > URL: https://issues.apache.org/jira/browse/SPARK-41921 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41921) Enable doctests in connect.column and connect.functions
[ https://issues.apache.org/jira/browse/SPARK-41921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41921: Assignee: Apache Spark (was: Sandeep Singh) > Enable doctests in connect.column and connect.functions > --- > > Key: SPARK-41921 > URL: https://issues.apache.org/jira/browse/SPARK-41921 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Apache Spark >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41921) Enable doctests in connect.column and connect.functions
Sandeep Singh created SPARK-41921: - Summary: Enable doctests in connect.column and connect.functions Key: SPARK-41921 URL: https://issues.apache.org/jira/browse/SPARK-41921 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh Assignee: Sandeep Singh Fix For: 3.4.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41875) Throw proper errors in Dataset.to()
[ https://issues.apache.org/jira/browse/SPARK-41875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655233#comment-17655233 ] Apache Spark commented on SPARK-41875: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/39422 > Throw proper errors in Dataset.to() > --- > > Key: SPARK-41875 > URL: https://issues.apache.org/jira/browse/SPARK-41875 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > schema = StructType( > [StructField("i", StringType(), True), StructField("j", IntegerType(), > True)] > ) > df = self.spark.createDataFrame([("a", 1)], schema) > schema1 = StructType([StructField("j", StringType()), StructField("i", > StringType())]) > df1 = df.to(schema1) > self.assertEqual(schema1, df1.schema) > self.assertEqual(df.count(), df1.count()) > schema2 = StructType([StructField("j", LongType())]) > df2 = df.to(schema2) > self.assertEqual(schema2, df2.schema) > self.assertEqual(df.count(), df2.count()) > schema3 = StructType([StructField("struct", schema1, False)]) > df3 = df.select(struct("i", "j").alias("struct")).to(schema3) > self.assertEqual(schema3, df3.schema) > self.assertEqual(df.count(), df3.count()) > # incompatible field nullability > schema4 = StructType([StructField("j", LongType(), False)]) > self.assertRaisesRegex( > AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4) > ){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 1486, in test_to > self.assertRaisesRegex( > AssertionError: AnalysisException not raised by {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41875) Throw proper errors in Dataset.to()
[ https://issues.apache.org/jira/browse/SPARK-41875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41875: Assignee: Apache Spark > Throw proper errors in Dataset.to() > --- > > Key: SPARK-41875 > URL: https://issues.apache.org/jira/browse/SPARK-41875 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Apache Spark >Priority: Major > > {code:java} > schema = StructType( > [StructField("i", StringType(), True), StructField("j", IntegerType(), > True)] > ) > df = self.spark.createDataFrame([("a", 1)], schema) > schema1 = StructType([StructField("j", StringType()), StructField("i", > StringType())]) > df1 = df.to(schema1) > self.assertEqual(schema1, df1.schema) > self.assertEqual(df.count(), df1.count()) > schema2 = StructType([StructField("j", LongType())]) > df2 = df.to(schema2) > self.assertEqual(schema2, df2.schema) > self.assertEqual(df.count(), df2.count()) > schema3 = StructType([StructField("struct", schema1, False)]) > df3 = df.select(struct("i", "j").alias("struct")).to(schema3) > self.assertEqual(schema3, df3.schema) > self.assertEqual(df.count(), df3.count()) > # incompatible field nullability > schema4 = StructType([StructField("j", LongType(), False)]) > self.assertRaisesRegex( > AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4) > ){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 1486, in test_to > self.assertRaisesRegex( > AssertionError: AnalysisException not raised by {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41875) Throw proper errors in Dataset.to()
[ https://issues.apache.org/jira/browse/SPARK-41875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41875: Assignee: (was: Apache Spark) > Throw proper errors in Dataset.to() > --- > > Key: SPARK-41875 > URL: https://issues.apache.org/jira/browse/SPARK-41875 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > schema = StructType( > [StructField("i", StringType(), True), StructField("j", IntegerType(), > True)] > ) > df = self.spark.createDataFrame([("a", 1)], schema) > schema1 = StructType([StructField("j", StringType()), StructField("i", > StringType())]) > df1 = df.to(schema1) > self.assertEqual(schema1, df1.schema) > self.assertEqual(df.count(), df1.count()) > schema2 = StructType([StructField("j", LongType())]) > df2 = df.to(schema2) > self.assertEqual(schema2, df2.schema) > self.assertEqual(df.count(), df2.count()) > schema3 = StructType([StructField("struct", schema1, False)]) > df3 = df.select(struct("i", "j").alias("struct")).to(schema3) > self.assertEqual(schema3, df3.schema) > self.assertEqual(df.count(), df3.count()) > # incompatible field nullability > schema4 = StructType([StructField("j", LongType(), False)]) > self.assertRaisesRegex( > AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4) > ){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 1486, in test_to > self.assertRaisesRegex( > AssertionError: AnalysisException not raised by {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41162) Anti-join must not be pushed below aggregation with ambiguous predicates
[ https://issues.apache.org/jira/browse/SPARK-41162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-41162. - Fix Version/s: 3.2.4 3.3.2 3.4.0 Resolution: Fixed Issue resolved by pull request 39409 [https://github.com/apache/spark/pull/39409] > Anti-join must not be pushed below aggregation with ambiguous predicates > > > Key: SPARK-41162 > URL: https://issues.apache.org/jira/browse/SPARK-41162 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.3, 3.3.1, 3.2.3, 3.4.0 >Reporter: Enrico Minack >Assignee: Enrico Minack >Priority: Major > Labels: correctness > Fix For: 3.2.4, 3.3.2, 3.4.0 > > > The following query should return a single row as all values for {{id}} > except for the largest will be eliminated by the anti-join: > {code} > val ids = Seq(1, 2, 3).toDF("id").distinct() > val result = ids.withColumn("id", $"id" + 1).join(ids, "id", > "left_anti").collect() > assert(result.length == 1) > {code} > Without the {{distinct()}}, the assertion is true. With {{distinct()}}, the > assertion should still hold but is false. > Rule {{PushDownLeftSemiAntiJoin}} pushes the {{Join}} below the left > {{Aggregate}} with join condition {{(id#750 + 1) = id#750}}, which can never > be true. > {code} > === Applying Rule > org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin === > !Join LeftAnti, (id#752 = id#750) 'Aggregate [id#750], > [(id#750 + 1) AS id#752] > !:- Aggregate [id#750], [(id#750 + 1) AS id#752] +- 'Join LeftAnti, > ((id#750 + 1) = id#750) > !: +- LocalRelation [id#750] :- LocalRelation > [id#750] > !+- Aggregate [id#750], [id#750] +- Aggregate [id#750], > [id#750] > ! +- LocalRelation [id#750]+- LocalRelation > [id#750] > {code} > The optimizer then rightly removes the left-anti join altogether, returning > the left child only. > Rule {{PushDownLeftSemiAntiJoin}} should not push down predicates that > reference left *and* right child. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41162) Anti-join must not be pushed below aggregation with ambiguous predicates
[ https://issues.apache.org/jira/browse/SPARK-41162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-41162: --- Assignee: Enrico Minack > Anti-join must not be pushed below aggregation with ambiguous predicates > > > Key: SPARK-41162 > URL: https://issues.apache.org/jira/browse/SPARK-41162 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.3, 3.3.1, 3.2.3, 3.4.0 >Reporter: Enrico Minack >Assignee: Enrico Minack >Priority: Major > Labels: correctness > > The following query should return a single row as all values for {{id}} > except for the largest will be eliminated by the anti-join: > {code} > val ids = Seq(1, 2, 3).toDF("id").distinct() > val result = ids.withColumn("id", $"id" + 1).join(ids, "id", > "left_anti").collect() > assert(result.length == 1) > {code} > Without the {{distinct()}}, the assertion is true. With {{distinct()}}, the > assertion should still hold but is false. > Rule {{PushDownLeftSemiAntiJoin}} pushes the {{Join}} below the left > {{Aggregate}} with join condition {{(id#750 + 1) = id#750}}, which can never > be true. > {code} > === Applying Rule > org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin === > !Join LeftAnti, (id#752 = id#750) 'Aggregate [id#750], > [(id#750 + 1) AS id#752] > !:- Aggregate [id#750], [(id#750 + 1) AS id#752] +- 'Join LeftAnti, > ((id#750 + 1) = id#750) > !: +- LocalRelation [id#750] :- LocalRelation > [id#750] > !+- Aggregate [id#750], [id#750] +- Aggregate [id#750], > [id#750] > ! +- LocalRelation [id#750]+- LocalRelation > [id#750] > {code} > The optimizer then rightly removes the left-anti join altogether, returning > the left child only. > Rule {{PushDownLeftSemiAntiJoin}} should not push down predicates that > reference left *and* right child. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41920) Task throw Exception call cleanUpAllAllocatedMemory cause throw NPE
Yi Zhu created SPARK-41920: -- Summary: Task throw Exception call cleanUpAllAllocatedMemory cause throw NPE Key: SPARK-41920 URL: https://issues.apache.org/jira/browse/SPARK-41920 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.1 Reporter: Yi Zhu {code:java} 23/01/03 21:41:18 INFO SortBasedPusher: Pushdata is not empty , do push. Traceback (most recent call last): File "/mnt/ssd/0/yarn/nm-local-dir/usercache/rcmd_feature/appcache/application_1671694574014_2488441/container_e260_1671694574014_2488441_01_000107/pyspark.zip/pyspark/daemon.py", line 186, in manager File "/mnt/ssd/0/yarn/nm-local-dir/usercache/rcmd_feature/appcache/application_1671694574014_2488441/container_e260_1671694574014_2488441_01_000107/pyspark.zip/pyspark/daemon.py", line 74, in worker File "/mnt/ssd/0/yarn/nm-local-dir/usercache/rcmd_feature/appcache/application_1671694574014_2488441/container_e260_1671694574014_2488441_01_000107/pyspark.zip/pyspark/worker.py", line 643, in main if read_int(infile) == SpecialLengths.END_OF_STREAM: File "/mnt/ssd/0/yarn/nm-local-dir/usercache/rcmd_feature/appcache/application_1671694574014_2488441/container_e260_1671694574014_2488441_01_000107/pyspark.zip/pyspark/serializers.py", line 564, in read_int raise EOFError EOFError 23/01/03 21:41:29 ERROR Executor: Exception in task 605.1 in stage 94.0 (TID 58026) java.lang.NullPointerException at org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:399) at org.apache.spark.shuffle.rss.SortBasedPusher.pushData(SortBasedPusher.java:155) at org.apache.spark.shuffle.rss.SortBasedPusher.spill(SortBasedPusher.java:317) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:177) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:289) at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:116) at org.apache.spark.sql.execution.python.HybridRowQueue.createNewQueue(RowQueue.scala:227) at org.apache.spark.sql.execution.python.HybridRowQueue.add(RowQueue.scala:250) at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$10(EvalPythonExec.scala:125) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1159) at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1174) at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1212) at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1215) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:307) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.writeIteratorToStream(PythonUDFRunner.scala:53) at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:397) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2066) at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:232) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41912) Subquery should not validate CTE
[ https://issues.apache.org/jira/browse/SPARK-41912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-41912. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39414 [https://github.com/apache/spark/pull/39414] > Subquery should not validate CTE > > > Key: SPARK-41912 > URL: https://issues.apache.org/jira/browse/SPARK-41912 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41831) DataFrame.transform: Only Column or String can be used for projections
[ https://issues.apache.org/jira/browse/SPARK-41831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-41831: - Assignee: Ruifeng Zheng > DataFrame.transform: Only Column or String can be used for projections > -- > > Key: SPARK-41831 > URL: https://issues.apache.org/jira/browse/SPARK-41831 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1168, in pyspark.sql.connect.dataframe.DataFrame.transform > Failed example: > df.transform(cast_all_to_int).transform(sort_columns_asc).show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", > line 1, in > df.transform(cast_all_to_int).transform(sort_columns_asc).show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1102, in transform > result = func(self, *args, **kwargs) > File "", > line 2, in cast_all_to_int > return input_df.select([col(col_name).cast("int") for col_name in > input_df.columns]) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 86, in select > return DataFrame.withPlan(plan.Project(self._plan, *cols), > session=self._session) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line > 344, in __init__ > self._verify_expressions() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line > 350, in _verify_expressions > raise InputValidationError( > pyspark.sql.connect.plan.InputValidationError: Only Column or String can > be used for projections: '[Column<'(ColumnReference(int) (int))'>, > Column<'(ColumnReference(float) (int))'>]'. > ** > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1179, in pyspark.sql.connect.dataframe.DataFrame.transform > Failed example: > df.transform(add_n, 1).transform(add_n, n=10).show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", > line 1, in > df.transform(add_n, 1).transform(add_n, n=10).show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1102, in transform > result = func(self, *args, **kwargs) > File "", > line 2, in add_n > return input_df.select([(col(col_name) + n).alias(col_name) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 86, in select > return DataFrame.withPlan(plan.Project(self._plan, *cols), > session=self._session) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line > 344, in __init__ > self._verify_expressions() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line > 350, in _verify_expressions > raise InputValidationError( > pyspark.sql.connect.plan.InputValidationError: Only Column or String can > be used for projections: '[Column<'Alias(+(ColumnReference(int), Literal(1)), > (int))'>, Column<'Alias(+(ColumnReference(float), Literal(1)), > (float))'>]'.{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41831) DataFrame.transform: Only Column or String can be used for projections
[ https://issues.apache.org/jira/browse/SPARK-41831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-41831. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39405 [https://github.com/apache/spark/pull/39405] > DataFrame.transform: Only Column or String can be used for projections > -- > > Key: SPARK-41831 > URL: https://issues.apache.org/jira/browse/SPARK-41831 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1168, in pyspark.sql.connect.dataframe.DataFrame.transform > Failed example: > df.transform(cast_all_to_int).transform(sort_columns_asc).show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", > line 1, in > df.transform(cast_all_to_int).transform(sort_columns_asc).show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1102, in transform > result = func(self, *args, **kwargs) > File "", > line 2, in cast_all_to_int > return input_df.select([col(col_name).cast("int") for col_name in > input_df.columns]) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 86, in select > return DataFrame.withPlan(plan.Project(self._plan, *cols), > session=self._session) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line > 344, in __init__ > self._verify_expressions() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line > 350, in _verify_expressions > raise InputValidationError( > pyspark.sql.connect.plan.InputValidationError: Only Column or String can > be used for projections: '[Column<'(ColumnReference(int) (int))'>, > Column<'(ColumnReference(float) (int))'>]'. > ** > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1179, in pyspark.sql.connect.dataframe.DataFrame.transform > Failed example: > df.transform(add_n, 1).transform(add_n, n=10).show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", > line 1, in > df.transform(add_n, 1).transform(add_n, n=10).show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1102, in transform > result = func(self, *args, **kwargs) > File "", > line 2, in add_n > return input_df.select([(col(col_name) + n).alias(col_name) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 86, in select > return DataFrame.withPlan(plan.Project(self._plan, *cols), > session=self._session) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line > 344, in __init__ > self._verify_expressions() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line > 350, in _verify_expressions > raise InputValidationError( > pyspark.sql.connect.plan.InputValidationError: Only Column or String can > be used for projections: '[Column<'Alias(+(ColumnReference(int), Literal(1)), > (int))'>, Column<'Alias(+(ColumnReference(float), Literal(1)), > (float))'>]'.{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41906) Handle Function `rand() `
[ https://issues.apache.org/jira/browse/SPARK-41906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41906: Assignee: Apache Spark > Handle Function `rand() ` > - > > Key: SPARK-41906 > URL: https://issues.apache.org/jira/browse/SPARK-41906 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Apache Spark >Priority: Major > > {code:java} > df = self.df > from pyspark.sql import functions > rnd = df.select("key", functions.rand()).collect() > for row in rnd: > assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1] > rndn = df.select("key", functions.randn(5)).collect() > for row in rndn: > assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1] > # If the specified seed is 0, we should use it. > # https://issues.apache.org/jira/browse/SPARK-9691 > rnd1 = df.select("key", functions.rand(0)).collect() > rnd2 = df.select("key", functions.rand(0)).collect() > self.assertEqual(sorted(rnd1), sorted(rnd2)) > rndn1 = df.select("key", functions.randn(0)).collect() > rndn2 = df.select("key", functions.randn(0)).collect() > self.assertEqual(sorted(rndn1), sorted(rndn2)){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 299, in test_rand_functions > rnd = df.select("key", functions.rand()).collect() > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", > line 2917, in select > jdf = self._jdf.select(self._jcols(*cols)) > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", > line 2537, in _jcols > return self._jseq(cols, _to_java_column) > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", > line 2524, in _jseq > return _to_seq(self.sparkSession._sc, cols, converter) > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line > 86, in _to_seq > cols = [converter(c) for c in cols] > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line > 86, in > cols = [converter(c) for c in cols] > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line > 65, in _to_java_column > raise TypeError( > TypeError: Invalid argument, not a string or column: Column<'rand()'> of type > . For column literals, use 'lit', > 'array', 'struct' or 'create_map' function. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41906) Handle Function `rand() `
[ https://issues.apache.org/jira/browse/SPARK-41906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655225#comment-17655225 ] Apache Spark commented on SPARK-41906: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39421 > Handle Function `rand() ` > - > > Key: SPARK-41906 > URL: https://issues.apache.org/jira/browse/SPARK-41906 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.df > from pyspark.sql import functions > rnd = df.select("key", functions.rand()).collect() > for row in rnd: > assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1] > rndn = df.select("key", functions.randn(5)).collect() > for row in rndn: > assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1] > # If the specified seed is 0, we should use it. > # https://issues.apache.org/jira/browse/SPARK-9691 > rnd1 = df.select("key", functions.rand(0)).collect() > rnd2 = df.select("key", functions.rand(0)).collect() > self.assertEqual(sorted(rnd1), sorted(rnd2)) > rndn1 = df.select("key", functions.randn(0)).collect() > rndn2 = df.select("key", functions.randn(0)).collect() > self.assertEqual(sorted(rndn1), sorted(rndn2)){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 299, in test_rand_functions > rnd = df.select("key", functions.rand()).collect() > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", > line 2917, in select > jdf = self._jdf.select(self._jcols(*cols)) > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", > line 2537, in _jcols > return self._jseq(cols, _to_java_column) > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", > line 2524, in _jseq > return _to_seq(self.sparkSession._sc, cols, converter) > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line > 86, in _to_seq > cols = [converter(c) for c in cols] > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line > 86, in > cols = [converter(c) for c in cols] > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line > 65, in _to_java_column > raise TypeError( > TypeError: Invalid argument, not a string or column: Column<'rand()'> of type > . For column literals, use 'lit', > 'array', 'struct' or 'create_map' function. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41906) Handle Function `rand() `
[ https://issues.apache.org/jira/browse/SPARK-41906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41906: Assignee: (was: Apache Spark) > Handle Function `rand() ` > - > > Key: SPARK-41906 > URL: https://issues.apache.org/jira/browse/SPARK-41906 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.df > from pyspark.sql import functions > rnd = df.select("key", functions.rand()).collect() > for row in rnd: > assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1] > rndn = df.select("key", functions.randn(5)).collect() > for row in rndn: > assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1] > # If the specified seed is 0, we should use it. > # https://issues.apache.org/jira/browse/SPARK-9691 > rnd1 = df.select("key", functions.rand(0)).collect() > rnd2 = df.select("key", functions.rand(0)).collect() > self.assertEqual(sorted(rnd1), sorted(rnd2)) > rndn1 = df.select("key", functions.randn(0)).collect() > rndn2 = df.select("key", functions.randn(0)).collect() > self.assertEqual(sorted(rndn1), sorted(rndn2)){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 299, in test_rand_functions > rnd = df.select("key", functions.rand()).collect() > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", > line 2917, in select > jdf = self._jdf.select(self._jcols(*cols)) > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", > line 2537, in _jcols > return self._jseq(cols, _to_java_column) > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", > line 2524, in _jseq > return _to_seq(self.sparkSession._sc, cols, converter) > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line > 86, in _to_seq > cols = [converter(c) for c in cols] > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line > 86, in > cols = [converter(c) for c in cols] > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line > 65, in _to_java_column > raise TypeError( > TypeError: Invalid argument, not a string or column: Column<'rand()'> of type > . For column literals, use 'lit', > 'array', 'struct' or 'create_map' function. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41905) Function `slice` should handle string in params
[ https://issues.apache.org/jira/browse/SPARK-41905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41905: Assignee: Apache Spark > Function `slice` should handle string in params > --- > > Key: SPARK-41905 > URL: https://issues.apache.org/jira/browse/SPARK-41905 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Apache Spark >Priority: Major > > {code:java} > df = self.spark.createDataFrame( > [ > ( > [1, 2, 3], > 2, > 2, > ), > ( > [4, 5], > 2, > 2, > ), > ], > ["x", "index", "len"], > ) > expected = [Row(sliced=[2, 3]), Row(sliced=[5])] > self.assertTrue( > all( > [ > df.select(slice(df.x, 2, 2).alias("sliced")).collect() == > expected, > df.select(slice(df.x, lit(2), lit(2)).alias("sliced")).collect() > == expected, > df.select(slice("x", "index", "len").alias("sliced")).collect() > == expected, > ] > ) > ) > self.assertEqual( > df.select(slice(df.x, size(df.x) - 1, lit(1)).alias("sliced")).collect(), > [Row(sliced=[2]), Row(sliced=[4])], > ) > self.assertEqual( > df.select(slice(df.x, lit(1), size(df.x) - 1).alias("sliced")).collect(), > [Row(sliced=[1, 2]), Row(sliced=[4])], > ){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 596, in test_slice > df.select(slice("x", "index", "len").alias("sliced")).collect() == > expected, > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line > 332, in wrapped > return getattr(functions, f.__name__)(*args, **kwargs) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1525, in slice > raise TypeError(f"start should be a Column or int, but got > {type(start).__name__}") > TypeError: start should be a Column or int, but got str{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41905) Function `slice` should handle string in params
[ https://issues.apache.org/jira/browse/SPARK-41905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41905: Assignee: (was: Apache Spark) > Function `slice` should handle string in params > --- > > Key: SPARK-41905 > URL: https://issues.apache.org/jira/browse/SPARK-41905 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.createDataFrame( > [ > ( > [1, 2, 3], > 2, > 2, > ), > ( > [4, 5], > 2, > 2, > ), > ], > ["x", "index", "len"], > ) > expected = [Row(sliced=[2, 3]), Row(sliced=[5])] > self.assertTrue( > all( > [ > df.select(slice(df.x, 2, 2).alias("sliced")).collect() == > expected, > df.select(slice(df.x, lit(2), lit(2)).alias("sliced")).collect() > == expected, > df.select(slice("x", "index", "len").alias("sliced")).collect() > == expected, > ] > ) > ) > self.assertEqual( > df.select(slice(df.x, size(df.x) - 1, lit(1)).alias("sliced")).collect(), > [Row(sliced=[2]), Row(sliced=[4])], > ) > self.assertEqual( > df.select(slice(df.x, lit(1), size(df.x) - 1).alias("sliced")).collect(), > [Row(sliced=[1, 2]), Row(sliced=[4])], > ){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 596, in test_slice > df.select(slice("x", "index", "len").alias("sliced")).collect() == > expected, > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line > 332, in wrapped > return getattr(functions, f.__name__)(*args, **kwargs) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1525, in slice > raise TypeError(f"start should be a Column or int, but got > {type(start).__name__}") > TypeError: start should be a Column or int, but got str{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41905) Function `slice` should handle string in params
[ https://issues.apache.org/jira/browse/SPARK-41905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655223#comment-17655223 ] Apache Spark commented on SPARK-41905: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39420 > Function `slice` should handle string in params > --- > > Key: SPARK-41905 > URL: https://issues.apache.org/jira/browse/SPARK-41905 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.createDataFrame( > [ > ( > [1, 2, 3], > 2, > 2, > ), > ( > [4, 5], > 2, > 2, > ), > ], > ["x", "index", "len"], > ) > expected = [Row(sliced=[2, 3]), Row(sliced=[5])] > self.assertTrue( > all( > [ > df.select(slice(df.x, 2, 2).alias("sliced")).collect() == > expected, > df.select(slice(df.x, lit(2), lit(2)).alias("sliced")).collect() > == expected, > df.select(slice("x", "index", "len").alias("sliced")).collect() > == expected, > ] > ) > ) > self.assertEqual( > df.select(slice(df.x, size(df.x) - 1, lit(1)).alias("sliced")).collect(), > [Row(sliced=[2]), Row(sliced=[4])], > ) > self.assertEqual( > df.select(slice(df.x, lit(1), size(df.x) - 1).alias("sliced")).collect(), > [Row(sliced=[1, 2]), Row(sliced=[4])], > ){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 596, in test_slice > df.select(slice("x", "index", "len").alias("sliced")).collect() == > expected, > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line > 332, in wrapped > return getattr(functions, f.__name__)(*args, **kwargs) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1525, in slice > raise TypeError(f"start should be a Column or int, but got > {type(start).__name__}") > TypeError: start should be a Column or int, but got str{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41905) Function `slice` should handle string in params
[ https://issues.apache.org/jira/browse/SPARK-41905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655222#comment-17655222 ] Apache Spark commented on SPARK-41905: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39420 > Function `slice` should handle string in params > --- > > Key: SPARK-41905 > URL: https://issues.apache.org/jira/browse/SPARK-41905 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.createDataFrame( > [ > ( > [1, 2, 3], > 2, > 2, > ), > ( > [4, 5], > 2, > 2, > ), > ], > ["x", "index", "len"], > ) > expected = [Row(sliced=[2, 3]), Row(sliced=[5])] > self.assertTrue( > all( > [ > df.select(slice(df.x, 2, 2).alias("sliced")).collect() == > expected, > df.select(slice(df.x, lit(2), lit(2)).alias("sliced")).collect() > == expected, > df.select(slice("x", "index", "len").alias("sliced")).collect() > == expected, > ] > ) > ) > self.assertEqual( > df.select(slice(df.x, size(df.x) - 1, lit(1)).alias("sliced")).collect(), > [Row(sliced=[2]), Row(sliced=[4])], > ) > self.assertEqual( > df.select(slice(df.x, lit(1), size(df.x) - 1).alias("sliced")).collect(), > [Row(sliced=[1, 2]), Row(sliced=[4])], > ){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 596, in test_slice > df.select(slice("x", "index", "len").alias("sliced")).collect() == > expected, > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line > 332, in wrapped > return getattr(functions, f.__name__)(*args, **kwargs) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1525, in slice > raise TypeError(f"start should be a Column or int, but got > {type(start).__name__}") > TypeError: start should be a Column or int, but got str{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41840) DataFrame.show(): 'Column' object is not callable
[ https://issues.apache.org/jira/browse/SPARK-41840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655219#comment-17655219 ] Apache Spark commented on SPARK-41840: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39419 > DataFrame.show(): 'Column' object is not callable > - > > Key: SPARK-41840 > URL: https://issues.apache.org/jira/browse/SPARK-41840 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 855, in pyspark.sql.connect.functions.first > Failed example: > df.groupby("name").agg(first("age", > ignorenulls=True)).orderBy("name").show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > df.groupby("name").agg(first("age", > ignorenulls=True)).orderBy("name").show() > TypeError: 'Column' object is not callable{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41840) DataFrame.show(): 'Column' object is not callable
[ https://issues.apache.org/jira/browse/SPARK-41840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655217#comment-17655217 ] Apache Spark commented on SPARK-41840: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39419 > DataFrame.show(): 'Column' object is not callable > - > > Key: SPARK-41840 > URL: https://issues.apache.org/jira/browse/SPARK-41840 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 855, in pyspark.sql.connect.functions.first > Failed example: > df.groupby("name").agg(first("age", > ignorenulls=True)).orderBy("name").show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > df.groupby("name").agg(first("age", > ignorenulls=True)).orderBy("name").show() > TypeError: 'Column' object is not callable{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41919) Unify the schema or datatype in protos
Ruifeng Zheng created SPARK-41919: - Summary: Unify the schema or datatype in protos Key: SPARK-41919 URL: https://issues.apache.org/jira/browse/SPARK-41919 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Ruifeng Zheng this ticket only focus on the protos sent from client to server. we normally use {code:java} oneof schema { DataType datatype = 2; // Server will use Catalyst parser to parse this string to DataType. string datatype_str = 3; } {code} to represent a schema or datatype. actually, we can simplify it with just a string. In the server, we can easily parse a DDL-formatted schema or a JSON formatted one. {code:java} // (Optional) The schema of local data. // It should be either a DDL-formatted type string or a JSON string. // // The server side will update the column names and data types according to this schema. // If the 'data' is not provided, then this schema will be required. optional string schema = 2; {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41918) Refine the naming in proto messages
Ruifeng Zheng created SPARK-41918: - Summary: Refine the naming in proto messages Key: SPARK-41918 URL: https://issues.apache.org/jira/browse/SPARK-41918 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Ruifeng Zheng normally, we name the fields after the corresponding LogiclalPlan or DataFrame API, but they are not consistent in protos, for example, the column name: {code:java} message UnresolvedRegex { // (Required) The column name used to extract column with regex. string col_name = 1; } {code} {code:java} message Alias { // (Required) The expression that alias will be added on. Expression expr = 1; // (Required) a list of name parts for the alias. // // Scalar columns only has one name that presents. repeated string name = 2; // (Optional) Alias metadata expressed as a JSON map. optional string metadata = 3; } {code} {code:java} // Relation of type [[Deduplicate]] which have duplicate rows removed, could consider either only // the subset of columns or all the columns. message Deduplicate { // (Required) Input relation for a Deduplicate. Relation input = 1; // (Optional) Deduplicate based on a list of column names. // // This field does not co-use with `all_columns_as_keys`. repeated string column_names = 2; // (Optional) Deduplicate based on all the columns of the input relation. // // This field does not co-use with `column_names`. optional bool all_columns_as_keys = 3; } {code} {code:java} // Computes basic statistics for numeric and string columns, including count, mean, stddev, min, // and max. If no columns are given, this function computes statistics for all numerical or // string columns. message StatDescribe { // (Required) The input relation. Relation input = 1; // (Optional) Columns to compute statistics on. repeated string cols = 2; } {code} we probably should unify the naming: single column -> `column` multi columns -> `columns` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41869) DataFrame dropDuplicates should throw error on non list argument
[ https://issues.apache.org/jira/browse/SPARK-41869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41869: Assignee: (was: Apache Spark) > DataFrame dropDuplicates should throw error on non list argument > > > Key: SPARK-41869 > URL: https://issues.apache.org/jira/browse/SPARK-41869 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.createDataFrame([("Alice", 50), ("Alice", 60)], ["name", > "age"]) > # shouldn't drop a non-null row > self.assertEqual(df.dropDuplicates().count(), 2) > self.assertEqual(df.dropDuplicates(["name"]).count(), 1) > self.assertEqual(df.dropDuplicates(["name", "age"]).count(), 2) > type_error_msg = "Parameter 'subset' must be a list of columns" > with self.assertRaisesRegex(TypeError, type_error_msg): > df.dropDuplicates("name"){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 128, in test_drop_duplicates > with self.assertRaisesRegex(TypeError, type_error_msg): > AssertionError: TypeError not raised{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41869) DataFrame dropDuplicates should throw error on non list argument
[ https://issues.apache.org/jira/browse/SPARK-41869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655216#comment-17655216 ] Apache Spark commented on SPARK-41869: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39418 > DataFrame dropDuplicates should throw error on non list argument > > > Key: SPARK-41869 > URL: https://issues.apache.org/jira/browse/SPARK-41869 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.createDataFrame([("Alice", 50), ("Alice", 60)], ["name", > "age"]) > # shouldn't drop a non-null row > self.assertEqual(df.dropDuplicates().count(), 2) > self.assertEqual(df.dropDuplicates(["name"]).count(), 1) > self.assertEqual(df.dropDuplicates(["name", "age"]).count(), 2) > type_error_msg = "Parameter 'subset' must be a list of columns" > with self.assertRaisesRegex(TypeError, type_error_msg): > df.dropDuplicates("name"){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 128, in test_drop_duplicates > with self.assertRaisesRegex(TypeError, type_error_msg): > AssertionError: TypeError not raised{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41869) DataFrame dropDuplicates should throw error on non list argument
[ https://issues.apache.org/jira/browse/SPARK-41869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41869: Assignee: Apache Spark > DataFrame dropDuplicates should throw error on non list argument > > > Key: SPARK-41869 > URL: https://issues.apache.org/jira/browse/SPARK-41869 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Apache Spark >Priority: Major > > {code:java} > df = self.spark.createDataFrame([("Alice", 50), ("Alice", 60)], ["name", > "age"]) > # shouldn't drop a non-null row > self.assertEqual(df.dropDuplicates().count(), 2) > self.assertEqual(df.dropDuplicates(["name"]).count(), 1) > self.assertEqual(df.dropDuplicates(["name", "age"]).count(), 2) > type_error_msg = "Parameter 'subset' must be a list of columns" > with self.assertRaisesRegex(TypeError, type_error_msg): > df.dropDuplicates("name"){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 128, in test_drop_duplicates > with self.assertRaisesRegex(TypeError, type_error_msg): > AssertionError: TypeError not raised{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41861) Make v2 ScanBuilders' build() return typed scan
[ https://issues.apache.org/jira/browse/SPARK-41861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-41861. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39367 [https://github.com/apache/spark/pull/39367] > Make v2 ScanBuilders' build() return typed scan > --- > > Key: SPARK-41861 > URL: https://issues.apache.org/jira/browse/SPARK-41861 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.1 >Reporter: Lorenzo Martini >Assignee: Lorenzo Martini >Priority: Trivial > Fix For: 3.4.0 > > > The `ScanBuilder` interface has the `build()` method to return `Scan` > objects. All the different implementations of it return scans that are of the > type of the builder itself. Eg `ParquetScanBuilder` will return a > `ParquetScan`, `TextScanBuilder` a `TextScan` etc. However, in the method > overrides declaration, the return type is not changed to the more specific > type but left as a generic `Scan`. This is a bit problematic when one has to > work with scan object because a manual cast is required, even if we know for > a fact that a `ParquetScanBuilder`'s `build()` method will return a > `ParquetScan`. For ease of developement (and for stricter correctness checks, > as we wouldn't want a `ParquetScanBuilder` to return a non-parquet scan I > would assume), it would be very nice if the `build()` methods of each > implementation of `ScanBuilder`s returned a more strictly typed object > instead of the generic `Scan` object -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41861) Make v2 ScanBuilders' build() return typed scan
[ https://issues.apache.org/jira/browse/SPARK-41861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-41861: --- Assignee: Lorenzo Martini > Make v2 ScanBuilders' build() return typed scan > --- > > Key: SPARK-41861 > URL: https://issues.apache.org/jira/browse/SPARK-41861 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.1 >Reporter: Lorenzo Martini >Assignee: Lorenzo Martini >Priority: Trivial > > The `ScanBuilder` interface has the `build()` method to return `Scan` > objects. All the different implementations of it return scans that are of the > type of the builder itself. Eg `ParquetScanBuilder` will return a > `ParquetScan`, `TextScanBuilder` a `TextScan` etc. However, in the method > overrides declaration, the return type is not changed to the more specific > type but left as a generic `Scan`. This is a bit problematic when one has to > work with scan object because a manual cast is required, even if we know for > a fact that a `ParquetScanBuilder`'s `build()` method will return a > `ParquetScan`. For ease of developement (and for stricter correctness checks, > as we wouldn't want a `ParquetScanBuilder` to return a non-parquet scan I > would assume), it would be very nice if the `build()` methods of each > implementation of `ScanBuilder`s returned a more strictly typed object > instead of the generic `Scan` object -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41806) Use AppendData.byName for SQL INSERT INTO by name for DSV2 and block ambiguous queries with static partitions columns
[ https://issues.apache.org/jira/browse/SPARK-41806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-41806: --- Assignee: Allison Portis > Use AppendData.byName for SQL INSERT INTO by name for DSV2 and block > ambiguous queries with static partitions columns > - > > Key: SPARK-41806 > URL: https://issues.apache.org/jira/browse/SPARK-41806 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Allison Portis >Assignee: Allison Portis >Priority: Major > Fix For: 3.4.0 > > > Currently for INSERT INTO by name we reorder the value list and convert it to > INSERT INTO by ordinal. Since DSv2 logical nodes have the isByName flag we > don't need to do this. The current approach is limiting in that > # Users must provide the full list of table columns (this limits the > functionality for features like generated columns see SPARK-41290) > # It allows ambiguous queries such as INSERT OVERWRITE t PARTITION (c='1') > (c) VALUES ('2') where the user provides both the static partition column 'c' > and the column 'c' in the column list. We should check that the static > partition column is not in the column list. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41806) Use AppendData.byName for SQL INSERT INTO by name for DSV2 and block ambiguous queries with static partitions columns
[ https://issues.apache.org/jira/browse/SPARK-41806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-41806. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39334 [https://github.com/apache/spark/pull/39334] > Use AppendData.byName for SQL INSERT INTO by name for DSV2 and block > ambiguous queries with static partitions columns > - > > Key: SPARK-41806 > URL: https://issues.apache.org/jira/browse/SPARK-41806 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Allison Portis >Priority: Major > Fix For: 3.4.0 > > > Currently for INSERT INTO by name we reorder the value list and convert it to > INSERT INTO by ordinal. Since DSv2 logical nodes have the isByName flag we > don't need to do this. The current approach is limiting in that > # Users must provide the full list of table columns (this limits the > functionality for features like generated columns see SPARK-41290) > # It allows ambiguous queries such as INSERT OVERWRITE t PARTITION (c='1') > (c) VALUES ('2') where the user provides both the static partition column 'c' > and the column 'c' in the column list. We should check that the static > partition column is not in the column list. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41831) DataFrame.transform: Only Column or String can be used for projections
[ https://issues.apache.org/jira/browse/SPARK-41831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655212#comment-17655212 ] Apache Spark commented on SPARK-41831: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39417 > DataFrame.transform: Only Column or String can be used for projections > -- > > Key: SPARK-41831 > URL: https://issues.apache.org/jira/browse/SPARK-41831 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1168, in pyspark.sql.connect.dataframe.DataFrame.transform > Failed example: > df.transform(cast_all_to_int).transform(sort_columns_asc).show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", > line 1, in > df.transform(cast_all_to_int).transform(sort_columns_asc).show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1102, in transform > result = func(self, *args, **kwargs) > File "", > line 2, in cast_all_to_int > return input_df.select([col(col_name).cast("int") for col_name in > input_df.columns]) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 86, in select > return DataFrame.withPlan(plan.Project(self._plan, *cols), > session=self._session) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line > 344, in __init__ > self._verify_expressions() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line > 350, in _verify_expressions > raise InputValidationError( > pyspark.sql.connect.plan.InputValidationError: Only Column or String can > be used for projections: '[Column<'(ColumnReference(int) (int))'>, > Column<'(ColumnReference(float) (int))'>]'. > ** > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1179, in pyspark.sql.connect.dataframe.DataFrame.transform > Failed example: > df.transform(add_n, 1).transform(add_n, n=10).show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", > line 1, in > df.transform(add_n, 1).transform(add_n, n=10).show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1102, in transform > result = func(self, *args, **kwargs) > File "", > line 2, in add_n > return input_df.select([(col(col_name) + n).alias(col_name) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 86, in select > return DataFrame.withPlan(plan.Project(self._plan, *cols), > session=self._session) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line > 344, in __init__ > self._verify_expressions() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line > 350, in _verify_expressions > raise InputValidationError( > pyspark.sql.connect.plan.InputValidationError: Only Column or String can > be used for projections: '[Column<'Alias(+(ColumnReference(int), Literal(1)), > (int))'>, Column<'Alias(+(ColumnReference(float), Literal(1)), > (float))'>]'.{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39743) Unable to set zstd compression level while writing parquet files
[ https://issues.apache.org/jira/browse/SPARK-39743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655211#comment-17655211 ] Eugene Chung commented on SPARK-39743: -- I'm sorry to ask a question here but I couldn't find the answer on Internet. Can I set the zstd compression level for ORC? > Unable to set zstd compression level while writing parquet files > > > Key: SPARK-39743 > URL: https://issues.apache.org/jira/browse/SPARK-39743 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Yeachan Park >Assignee: ming95 >Priority: Minor > Fix For: 3.4.0 > > > While writing zstd compressed parquet files, the following setting > `spark.io.compression.zstd.level` does not have any affect with regards to > the compression level of zstd. > All files seem to be written with the default zstd compression level, and the > config option seems to be ignored. > Using the zstd cli tool, we confirmed that setting a higher compression level > for the same file tested in spark resulted in a smaller file. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41875) Throw proper errors in Dataset.to()
[ https://issues.apache.org/jira/browse/SPARK-41875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655206#comment-17655206 ] Hyukjin Kwon commented on SPARK-41875: -- Thanks [~beliefer] > Throw proper errors in Dataset.to() > --- > > Key: SPARK-41875 > URL: https://issues.apache.org/jira/browse/SPARK-41875 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > schema = StructType( > [StructField("i", StringType(), True), StructField("j", IntegerType(), > True)] > ) > df = self.spark.createDataFrame([("a", 1)], schema) > schema1 = StructType([StructField("j", StringType()), StructField("i", > StringType())]) > df1 = df.to(schema1) > self.assertEqual(schema1, df1.schema) > self.assertEqual(df.count(), df1.count()) > schema2 = StructType([StructField("j", LongType())]) > df2 = df.to(schema2) > self.assertEqual(schema2, df2.schema) > self.assertEqual(df.count(), df2.count()) > schema3 = StructType([StructField("struct", schema1, False)]) > df3 = df.select(struct("i", "j").alias("struct")).to(schema3) > self.assertEqual(schema3, df3.schema) > self.assertEqual(df.count(), df3.count()) > # incompatible field nullability > schema4 = StructType([StructField("j", LongType(), False)]) > self.assertRaisesRegex( > AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4) > ){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 1486, in test_to > self.assertRaisesRegex( > AssertionError: AnalysisException not raised by {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41875) Throw proper errors in Dataset.to()
[ https://issues.apache.org/jira/browse/SPARK-41875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655203#comment-17655203 ] jiaan.geng commented on SPARK-41875: I will take a look! > Throw proper errors in Dataset.to() > --- > > Key: SPARK-41875 > URL: https://issues.apache.org/jira/browse/SPARK-41875 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > schema = StructType( > [StructField("i", StringType(), True), StructField("j", IntegerType(), > True)] > ) > df = self.spark.createDataFrame([("a", 1)], schema) > schema1 = StructType([StructField("j", StringType()), StructField("i", > StringType())]) > df1 = df.to(schema1) > self.assertEqual(schema1, df1.schema) > self.assertEqual(df.count(), df1.count()) > schema2 = StructType([StructField("j", LongType())]) > df2 = df.to(schema2) > self.assertEqual(schema2, df2.schema) > self.assertEqual(df.count(), df2.count()) > schema3 = StructType([StructField("struct", schema1, False)]) > df3 = df.select(struct("i", "j").alias("struct")).to(schema3) > self.assertEqual(schema3, df3.schema) > self.assertEqual(df.count(), df3.count()) > # incompatible field nullability > schema4 = StructType([StructField("j", LongType(), False)]) > self.assertRaisesRegex( > AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4) > ){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 1486, in test_to > self.assertRaisesRegex( > AssertionError: AnalysisException not raised by {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41567) Move configuration of `versions-maven-plugin` to parent pom
[ https://issues.apache.org/jira/browse/SPARK-41567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-41567: - Assignee: Yang Jie > Move configuration of `versions-maven-plugin` to parent pom > --- > > Key: SPARK-41567 > URL: https://issues.apache.org/jira/browse/SPARK-41567 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > In addition to `test-dependencies.sh`, `release-build.sh` and > `release-tag.sh` also using the `build/mvn versions:set` command. So move the > configuration of `versions-maven-plugin` to parent pom can unify the > `versions-maven-plugin` version used by the whole Spark project. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41567) Move configuration of `versions-maven-plugin` to parent pom
[ https://issues.apache.org/jira/browse/SPARK-41567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-41567. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39118 [https://github.com/apache/spark/pull/39118] > Move configuration of `versions-maven-plugin` to parent pom > --- > > Key: SPARK-41567 > URL: https://issues.apache.org/jira/browse/SPARK-41567 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > In addition to `test-dependencies.sh`, `release-build.sh` and > `release-tag.sh` also using the `build/mvn versions:set` command. So move the > configuration of `versions-maven-plugin` to parent pom can unify the > `versions-maven-plugin` version used by the whole Spark project. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41827) DataFrame.groupBy requires all cols be Column or str
[ https://issues.apache.org/jira/browse/SPARK-41827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41827. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39404 [https://github.com/apache/spark/pull/39404] > DataFrame.groupBy requires all cols be Column or str > > > Key: SPARK-41827 > URL: https://issues.apache.org/jira/browse/SPARK-41827 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 302, in pyspark.sql.connect.dataframe.DataFrame.groupBy > Failed example: > df.groupBy(["name", df.age]).count().sort("name", "age").show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", > line 1, in > df.groupBy(["name", df.age]).count().sort("name", "age").show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 251, in groupBy > raise TypeError( > TypeError: groupBy requires all cols be Column or str, but got list > ['name', Column<'ColumnReference(age)'>]{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41827) DataFrame.groupBy requires all cols be Column or str
[ https://issues.apache.org/jira/browse/SPARK-41827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41827: Assignee: Ruifeng Zheng > DataFrame.groupBy requires all cols be Column or str > > > Key: SPARK-41827 > URL: https://issues.apache.org/jira/browse/SPARK-41827 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 302, in pyspark.sql.connect.dataframe.DataFrame.groupBy > Failed example: > df.groupBy(["name", df.age]).count().sort("name", "age").show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", > line 1, in > df.groupBy(["name", df.age]).count().sort("name", "age").show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 251, in groupBy > raise TypeError( > TypeError: groupBy requires all cols be Column or str, but got list > ['name', Column<'ColumnReference(age)'>]{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41652) Test parity: pyspark.sql.tests.test_functions
[ https://issues.apache.org/jira/browse/SPARK-41652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41652: Assignee: Sandeep Singh (was: Hyukjin Kwon) > Test parity: pyspark.sql.tests.test_functions > - > > Key: SPARK-41652 > URL: https://issues.apache.org/jira/browse/SPARK-41652 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Sandeep Singh >Priority: Major > > After https://github.com/apache/spark/pull/39041 (SPARK-41528), we now reuses > the same test cases, see > {{python/pyspark/sql/tests/connect/test_parity_functions.py}}. > We should remove all the test cases defined there, and fix Spark Connect > behaviours accordingly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41651) Test parity: pyspark.sql.tests.test_dataframe
[ https://issues.apache.org/jira/browse/SPARK-41651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41651: Assignee: Sandeep Singh (was: Hyukjin Kwon) > Test parity: pyspark.sql.tests.test_dataframe > - > > Key: SPARK-41651 > URL: https://issues.apache.org/jira/browse/SPARK-41651 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Sandeep Singh >Priority: Major > > After https://github.com/apache/spark/pull/39041 (SPARK-41528), we now reuses > the same test cases, see > {{python/pyspark/sql/tests/connect/test_parity_dataframe.py}}. > We should remove all the test cases defined there, and fix Spark Connect > behaviours accordingly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41849) Implement DataFrameReader.text
[ https://issues.apache.org/jira/browse/SPARK-41849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41849: Assignee: Sandeep Singh > Implement DataFrameReader.text > -- > > Key: SPARK-41849 > URL: https://issues.apache.org/jira/browse/SPARK-41849 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 276, in pyspark.sql.connect.functions.input_file_name > Failed example: > df = spark.read.text(path) > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > df = spark.read.text(path) > AttributeError: 'DataFrameReader' object has no attribute 'text'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41849) Implement DataFrameReader.text
[ https://issues.apache.org/jira/browse/SPARK-41849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41849. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39413 [https://github.com/apache/spark/pull/39413] > Implement DataFrameReader.text > -- > > Key: SPARK-41849 > URL: https://issues.apache.org/jira/browse/SPARK-41849 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 276, in pyspark.sql.connect.functions.input_file_name > Failed example: > df = spark.read.text(path) > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > df = spark.read.text(path) > AttributeError: 'DataFrameReader' object has no attribute 'text'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41677) Protobuf serializer for StreamingQueryProgressWrapper
[ https://issues.apache.org/jira/browse/SPARK-41677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655194#comment-17655194 ] Apache Spark commented on SPARK-41677: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/39416 > Protobuf serializer for StreamingQueryProgressWrapper > - > > Key: SPARK-41677 > URL: https://issues.apache.org/jira/browse/SPARK-41677 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41895) Add tests for streaming UI with RocksDB backend
[ https://issues.apache.org/jira/browse/SPARK-41895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655191#comment-17655191 ] Apache Spark commented on SPARK-41895: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/39415 > Add tests for streaming UI with RocksDB backend > --- > > Key: SPARK-41895 > URL: https://issues.apache.org/jira/browse/SPARK-41895 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41895) Add tests for streaming UI with RocksDB backend
[ https://issues.apache.org/jira/browse/SPARK-41895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41895: Assignee: Apache Spark (was: Gengliang Wang) > Add tests for streaming UI with RocksDB backend > --- > > Key: SPARK-41895 > URL: https://issues.apache.org/jira/browse/SPARK-41895 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41895) Add tests for streaming UI with RocksDB backend
[ https://issues.apache.org/jira/browse/SPARK-41895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41895: Assignee: Gengliang Wang (was: Apache Spark) > Add tests for streaming UI with RocksDB backend > --- > > Key: SPARK-41895 > URL: https://issues.apache.org/jira/browse/SPARK-41895 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41892) Add JIRAs or messages for skipped messages
[ https://issues.apache.org/jira/browse/SPARK-41892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41892. -- Resolution: Fixed Issue resolved by pull request 39412 [https://github.com/apache/spark/pull/39412] > Add JIRAs or messages for skipped messages > -- > > Key: SPARK-41892 > URL: https://issues.apache.org/jira/browse/SPARK-41892 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41305) Connect Proto Completeness
[ https://issues.apache.org/jira/browse/SPARK-41305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41305: - Fix Version/s: (was: 3.4.0) > Connect Proto Completeness > -- > > Key: SPARK-41305 > URL: https://issues.apache.org/jira/browse/SPARK-41305 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Critical > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41893) Publish SBOM artifacts
[ https://issues.apache.org/jira/browse/SPARK-41893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-41893. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39401 [https://github.com/apache/spark/pull/39401] > Publish SBOM artifacts > -- > > Key: SPARK-41893 > URL: https://issues.apache.org/jira/browse/SPARK-41893 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: releasenotes > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41893) Publish SBOM artifacts
[ https://issues.apache.org/jira/browse/SPARK-41893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-41893: - Assignee: Dongjoon Hyun > Publish SBOM artifacts > -- > > Key: SPARK-41893 > URL: https://issues.apache.org/jira/browse/SPARK-41893 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: releasenotes > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41917) Support SSL and Auth token in connection channel for JVM/Scala Client
Venkata Sai Akhil Gudesa created SPARK-41917: Summary: Support SSL and Auth token in connection channel for JVM/Scala Client Key: SPARK-41917 URL: https://issues.apache.org/jira/browse/SPARK-41917 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.4.0 Reporter: Venkata Sai Akhil Gudesa -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41916) Address `spark.task.resource.gpu.amount > 1`
Rithwik Ediga Lakhamsani created SPARK-41916: Summary: Address `spark.task.resource.gpu.amount > 1` Key: SPARK-41916 URL: https://issues.apache.org/jira/browse/SPARK-41916 Project: Spark Issue Type: Sub-task Components: ML, PySpark Affects Versions: 3.4.0 Reporter: Rithwik Ediga Lakhamsani We want the distributor to have the ability to run multiple torchrun processes per task if task.gpu.amount > 1. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41915) Change API so that the user doesn't have to explicitly set pytorch-lightning
Rithwik Ediga Lakhamsani created SPARK-41915: Summary: Change API so that the user doesn't have to explicitly set pytorch-lightning Key: SPARK-41915 URL: https://issues.apache.org/jira/browse/SPARK-41915 Project: Spark Issue Type: Sub-task Components: ML, PySpark Affects Versions: 3.4.0 Reporter: Rithwik Ediga Lakhamsani Removing the `framework` parameter in the API and have cloudpickle automatically find out whether the user code has a dependency on PyTorch Lightning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40588) Sorting issue with partitioned-writing and AQE turned on
[ https://issues.apache.org/jira/browse/SPARK-40588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655135#comment-17655135 ] Enrico Minack commented on SPARK-40588: --- Unfortunately, this issue persists with Spark 3.4.0, created issue SPARK-41914 to track that. > Sorting issue with partitioned-writing and AQE turned on > > > Key: SPARK-40588 > URL: https://issues.apache.org/jira/browse/SPARK-40588 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.3 > Environment: Spark v3.1.3 > Scala v2.12.13 >Reporter: Swetha Baskaran >Assignee: Enrico Minack >Priority: Major > Fix For: 3.2.3, 3.3.2 > > Attachments: image-2022-10-16-22-05-47-159.png > > > We are attempting to partition data by a few columns, sort by a particular > _sortCol_ and write out one file per partition. > {code:java} > df > .repartition(col("day"), col("month"), col("year")) > .withColumn("partitionId",spark_partition_id) > .withColumn("monotonicallyIncreasingIdUnsorted",monotonicallyIncreasingId) > .sortWithinPartitions("year", "month", "day", "sortCol") > .withColumn("monotonicallyIncreasingIdSorted",monotonicallyIncreasingId) > .write > .partitionBy("year", "month", "day") > .parquet(path){code} > When inspecting the results, we observe one file per partition, however we > see an _alternating_ pattern of unsorted rows in some files. > {code:java} > {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832121344,"monotonicallyIncreasingIdSorted":6287832121344} > {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877022389,"monotonicallyIncreasingIdSorted":6287876860586} > {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877567881,"monotonicallyIncreasingIdSorted":6287832121345} > {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287835105553,"monotonicallyIncreasingIdSorted":6287876860587} > {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832570127,"monotonicallyIncreasingIdSorted":6287832121346} > {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287879965760,"monotonicallyIncreasingIdSorted":6287876860588} > {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287878762347,"monotonicallyIncreasingIdSorted":6287832121347} > {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287837165012,"monotonicallyIncreasingIdSorted":6287876860589} > {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832910545,"monotonicallyIncreasingIdSorted":6287832121348} > {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287881244758,"monotonicallyIncreasingIdSorted":6287876860590} > {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287880041345,"monotonicallyIncreasingIdSorted":6287832121349}{code} > Here is a > [gist|https://gist.github.com/Swebask/543030748a768be92d3c0ae343d2ae89] to > reproduce the issue. > Turning off AQE with spark.conf.set("spark.sql.adaptive.enabled", false) > fixes the issue. > I'm working on identifying why AQE affects the sort order. Any leads or > thoughts would be appreciated! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41914) Sorting issue with partitioned-writing and planned write optimization disabled
Enrico Minack created SPARK-41914: - Summary: Sorting issue with partitioned-writing and planned write optimization disabled Key: SPARK-41914 URL: https://issues.apache.org/jira/browse/SPARK-41914 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Enrico Minack Spark 3.4.0 introduced option {{{}spark.sql.optimizer.plannedWrite.enabled{}}}, which is enabled by default. When disabled, partitioned writing loses in-partition order when spilling occurs. This is related to SPARK-40885 where setting option {{spark.sql.optimizer.plannedWrite.enabled}} to {{true}} will remove the existing sort (for {{day}} and {{{}id{}}}) entirely. Run this with 512m memory and one executor, e.g.: {code} spark-shell --driver-memory 512m --master "local[1]" {code} {code:scala} import org.apache.spark.sql.SaveMode spark.conf.set("spark.sql.optimizer.plannedWrite.enabled", false) val ids = 200 val days = 2 val parts = 2 val ds = spark.range(0, days, 1, parts).withColumnRenamed("id", "day").join(spark.range(0, ids, 1, parts)) ds.repartition($"day") .sortWithinPartitions($"day", $"id") .write .partitionBy("day") .mode(SaveMode.Overwrite) .csv("interleaved.csv") {code} Check the written files are sorted (states OK when file is sorted): {code:bash} for file in interleaved.csv/day\=*/part-* do echo "$(sort -n "$file" | md5sum | cut -d " " -f 1) $file" done | md5sum -c {code} Files should look like this {code} 0 1 2 ... 1048576 1048577 1048578 ... {code} But they look like {code} 0 1048576 1 1048577 2 1048578 ... {code} The cause issue is the same as in SPARK-40588. A sort (for {{{}day{}}}) is added on top of the existing sort (for {{day}} and {{{}id{}}}). Spilling interleaves the sorted spill files. {code} Sort [input[0, bigint, false] ASC NULLS FIRST], false, 0 +- AdaptiveSparkPlan isFinalPlan=false +- Sort [day#2L ASC NULLS FIRST, id#4L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(day#2L, 200), REPARTITION_BY_COL, [plan_id=30] +- BroadcastNestedLoopJoin BuildLeft, Inner :- BroadcastExchange IdentityBroadcastMode, [plan_id=28] : +- Project [id#0L AS day#2L] : +- Range (0, 2, step=1, splits=2) +- Range (0, 200, step=1, splits=2) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41913) Add a linter rule to enforce transforming with pruning
Gengliang Wang created SPARK-41913: -- Summary: Add a linter rule to enforce transforming with pruning Key: SPARK-41913 URL: https://issues.apache.org/jira/browse/SPARK-41913 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Gengliang Wang Add a linter rule to enforce transforming Catalyst tree nodes with pruning. This is to ensure catalyst rules are implemented efficiently. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41633) Identify aggregation expression in the nodePatterns of PythonUDF
[ https://issues.apache.org/jira/browse/SPARK-41633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-41633. Resolution: Won't Do > Identify aggregation expression in the nodePatterns of PythonUDF > > > Key: SPARK-41633 > URL: https://issues.apache.org/jira/browse/SPARK-41633 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > When a PythonUDF is evaluated as SQL_GROUPED_AGG_PANDAS_UDF, we can mark it > as AGGREGATE_EXPRESSION in the `nodePatterns`. So that we can check whether > an expression contains aggregation with a handy way: > `expr.containsPattern(AGGREGATE_EXPRESSION)` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41912) Subquery should not validate CTE
[ https://issues.apache.org/jira/browse/SPARK-41912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655103#comment-17655103 ] Apache Spark commented on SPARK-41912: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/39414 > Subquery should not validate CTE > > > Key: SPARK-41912 > URL: https://issues.apache.org/jira/browse/SPARK-41912 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41912) Subquery should not validate CTE
[ https://issues.apache.org/jira/browse/SPARK-41912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41912: Assignee: Apache Spark (was: Rui Wang) > Subquery should not validate CTE > > > Key: SPARK-41912 > URL: https://issues.apache.org/jira/browse/SPARK-41912 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41912) Subquery should not validate CTE
[ https://issues.apache.org/jira/browse/SPARK-41912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41912: Assignee: Rui Wang (was: Apache Spark) > Subquery should not validate CTE > > > Key: SPARK-41912 > URL: https://issues.apache.org/jira/browse/SPARK-41912 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41912) Subquery should not validate CTE
Rui Wang created SPARK-41912: Summary: Subquery should not validate CTE Key: SPARK-41912 URL: https://issues.apache.org/jira/browse/SPARK-41912 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Rui Wang Assignee: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41755) Reorder fields to use consecutive field numbers
[ https://issues.apache.org/jira/browse/SPARK-41755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Wang updated SPARK-41755: - Summary: Reorder fields to use consecutive field numbers (was: Reorder the relation IDs) > Reorder fields to use consecutive field numbers > --- > > Key: SPARK-41755 > URL: https://issues.apache.org/jira/browse/SPARK-41755 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > make IDs consecutive > ``` > RepartitionByExpression repartition_by_expression = 27; > // NA functions > NAFill fill_na = 90; > NADrop drop_na = 91; > NAReplace replace = 92; > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41909) Update proto fields to use increasing field numbers and avoid holes
[ https://issues.apache.org/jira/browse/SPARK-41909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Wang resolved SPARK-41909. -- Resolution: Duplicate https://issues.apache.org/jira/browse/SPARK-41755 > Update proto fields to use increasing field numbers and avoid holes > --- > > Key: SPARK-41909 > URL: https://issues.apache.org/jira/browse/SPARK-41909 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41908) Catalog API refactoring
[ https://issues.apache.org/jira/browse/SPARK-41908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Wang updated SPARK-41908: - Description: We may revisit Catalog proto design and refactor it such that it becomes a breaking change. > Catalog API refactoring > --- > > Key: SPARK-41908 > URL: https://issues.apache.org/jira/browse/SPARK-41908 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Priority: Major > > We may revisit Catalog proto design and refactor it such that it becomes a > breaking change. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41910) Remove `optional` notation in proto
[ https://issues.apache.org/jira/browse/SPARK-41910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Wang updated SPARK-41910: - Description: Every field in proto3 has a default value. We should revisit existing proto field to understand if the default value can be used without tell the field is set and not set, and remove `optional` as much as possible from Spark Connect proto surface. > Remove `optional` notation in proto > --- > > Key: SPARK-41910 > URL: https://issues.apache.org/jira/browse/SPARK-41910 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Priority: Major > > Every field in proto3 has a default value. We should revisit existing proto > field to understand if the default value can be used without tell the field > is set and not set, and remove `optional` as much as possible from Spark > Connect proto surface. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41911) Add version fields to Connect proto
[ https://issues.apache.org/jira/browse/SPARK-41911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Wang updated SPARK-41911: - Description: We may need this to help maintain compatibility. Depending on the concrete protocol design, we may use field number 1 for version fields thus may cause breaking changes on existing proto messages. > Add version fields to Connect proto > --- > > Key: SPARK-41911 > URL: https://issues.apache.org/jira/browse/SPARK-41911 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Priority: Major > > We may need this to help maintain compatibility. Depending on the concrete > protocol design, we may use field number 1 for version fields thus may cause > breaking changes on existing proto messages. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41911) Add version fields to Connect proto
Rui Wang created SPARK-41911: Summary: Add version fields to Connect proto Key: SPARK-41911 URL: https://issues.apache.org/jira/browse/SPARK-41911 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41910) Remove `optional` notation in proto
Rui Wang created SPARK-41910: Summary: Remove `optional` notation in proto Key: SPARK-41910 URL: https://issues.apache.org/jira/browse/SPARK-41910 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41909) Update proto fields to use increasing field numbers and avoid holes
Rui Wang created SPARK-41909: Summary: Update proto fields to use increasing field numbers and avoid holes Key: SPARK-41909 URL: https://issues.apache.org/jira/browse/SPARK-41909 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41908) Catalog API refactoring
Rui Wang created SPARK-41908: Summary: Catalog API refactoring Key: SPARK-41908 URL: https://issues.apache.org/jira/browse/SPARK-41908 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41849) Implement DataFrameReader.text
[ https://issues.apache.org/jira/browse/SPARK-41849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41849: Assignee: Apache Spark > Implement DataFrameReader.text > -- > > Key: SPARK-41849 > URL: https://issues.apache.org/jira/browse/SPARK-41849 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Apache Spark >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 276, in pyspark.sql.connect.functions.input_file_name > Failed example: > df = spark.read.text(path) > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > df = spark.read.text(path) > AttributeError: 'DataFrameReader' object has no attribute 'text'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41849) Implement DataFrameReader.text
[ https://issues.apache.org/jira/browse/SPARK-41849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655094#comment-17655094 ] Apache Spark commented on SPARK-41849: -- User 'techaddict' has created a pull request for this issue: https://github.com/apache/spark/pull/39413 > Implement DataFrameReader.text > -- > > Key: SPARK-41849 > URL: https://issues.apache.org/jira/browse/SPARK-41849 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 276, in pyspark.sql.connect.functions.input_file_name > Failed example: > df = spark.read.text(path) > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > df = spark.read.text(path) > AttributeError: 'DataFrameReader' object has no attribute 'text'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41849) Implement DataFrameReader.text
[ https://issues.apache.org/jira/browse/SPARK-41849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41849: Assignee: (was: Apache Spark) > Implement DataFrameReader.text > -- > > Key: SPARK-41849 > URL: https://issues.apache.org/jira/browse/SPARK-41849 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 276, in pyspark.sql.connect.functions.input_file_name > Failed example: > df = spark.read.text(path) > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > df = spark.read.text(path) > AttributeError: 'DataFrameReader' object has no attribute 'text'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41849) Implement DataFrameReader.text
[ https://issues.apache.org/jira/browse/SPARK-41849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655095#comment-17655095 ] Apache Spark commented on SPARK-41849: -- User 'techaddict' has created a pull request for this issue: https://github.com/apache/spark/pull/39413 > Implement DataFrameReader.text > -- > > Key: SPARK-41849 > URL: https://issues.apache.org/jira/browse/SPARK-41849 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 276, in pyspark.sql.connect.functions.input_file_name > Failed example: > df = spark.read.text(path) > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > df = spark.read.text(path) > AttributeError: 'DataFrameReader' object has no attribute 'text'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41882) Add tests for SQLAppStatusStore with RocksDB Backend
[ https://issues.apache.org/jira/browse/SPARK-41882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-41882. Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39385 [https://github.com/apache/spark/pull/39385] > Add tests for SQLAppStatusStore with RocksDB Backend > > > Key: SPARK-41882 > URL: https://issues.apache.org/jira/browse/SPARK-41882 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41882) Add tests for SQLAppStatusStore with RocksDB Backend
[ https://issues.apache.org/jira/browse/SPARK-41882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-41882: -- Assignee: Yang Jie > Add tests for SQLAppStatusStore with RocksDB Backend > > > Key: SPARK-41882 > URL: https://issues.apache.org/jira/browse/SPARK-41882 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41892) Add JIRAs or messages for skipped messages
[ https://issues.apache.org/jira/browse/SPARK-41892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655081#comment-17655081 ] Apache Spark commented on SPARK-41892: -- User 'techaddict' has created a pull request for this issue: https://github.com/apache/spark/pull/39412 > Add JIRAs or messages for skipped messages > -- > > Key: SPARK-41892 > URL: https://issues.apache.org/jira/browse/SPARK-41892 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41892) Add JIRAs or messages for skipped messages
[ https://issues.apache.org/jira/browse/SPARK-41892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41892: Assignee: Sandeep Singh (was: Apache Spark) > Add JIRAs or messages for skipped messages > -- > > Key: SPARK-41892 > URL: https://issues.apache.org/jira/browse/SPARK-41892 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41892) Add JIRAs or messages for skipped messages
[ https://issues.apache.org/jira/browse/SPARK-41892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41892: Assignee: Apache Spark (was: Sandeep Singh) > Add JIRAs or messages for skipped messages > -- > > Key: SPARK-41892 > URL: https://issues.apache.org/jira/browse/SPARK-41892 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Apache Spark >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41892) Add JIRAs or messages for skipped messages
[ https://issues.apache.org/jira/browse/SPARK-41892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655082#comment-17655082 ] Apache Spark commented on SPARK-41892: -- User 'techaddict' has created a pull request for this issue: https://github.com/apache/spark/pull/39412 > Add JIRAs or messages for skipped messages > -- > > Key: SPARK-41892 > URL: https://issues.apache.org/jira/browse/SPARK-41892 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41907) Function `sampleby` return parity
Sandeep Singh created SPARK-41907: - Summary: Function `sampleby` return parity Key: SPARK-41907 URL: https://issues.apache.org/jira/browse/SPARK-41907 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} df = self.df from pyspark.sql import functions rnd = df.select("key", functions.rand()).collect() for row in rnd: assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1] rndn = df.select("key", functions.randn(5)).collect() for row in rndn: assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1] # If the specified seed is 0, we should use it. # https://issues.apache.org/jira/browse/SPARK-9691 rnd1 = df.select("key", functions.rand(0)).collect() rnd2 = df.select("key", functions.rand(0)).collect() self.assertEqual(sorted(rnd1), sorted(rnd2)) rndn1 = df.select("key", functions.randn(0)).collect() rndn2 = df.select("key", functions.randn(0)).collect() self.assertEqual(sorted(rndn1), sorted(rndn2)){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 299, in test_rand_functions rnd = df.select("key", functions.rand()).collect() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", line 2917, in select jdf = self._jdf.select(self._jcols(*cols)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", line 2537, in _jcols return self._jseq(cols, _to_java_column) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", line 2524, in _jseq return _to_seq(self.sparkSession._sc, cols, converter) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 86, in _to_seq cols = [converter(c) for c in cols] File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 86, in cols = [converter(c) for c in cols] File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 65, in _to_java_column raise TypeError( TypeError: Invalid argument, not a string or column: Column<'rand()'> of type . For column literals, use 'lit', 'array', 'struct' or 'create_map' function. {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41907) Function `sampleby` return parity
[ https://issues.apache.org/jira/browse/SPARK-41907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41907: -- Description: {code:java} df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)]) sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0) self.assertTrue(sampled.count() == 35){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 202, in test_sampleby self.assertTrue(sampled.count() == 35) AssertionError: False is not true {code} was: {code:java} df = self.df from pyspark.sql import functions rnd = df.select("key", functions.rand()).collect() for row in rnd: assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1] rndn = df.select("key", functions.randn(5)).collect() for row in rndn: assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1] # If the specified seed is 0, we should use it. # https://issues.apache.org/jira/browse/SPARK-9691 rnd1 = df.select("key", functions.rand(0)).collect() rnd2 = df.select("key", functions.rand(0)).collect() self.assertEqual(sorted(rnd1), sorted(rnd2)) rndn1 = df.select("key", functions.randn(0)).collect() rndn2 = df.select("key", functions.randn(0)).collect() self.assertEqual(sorted(rndn1), sorted(rndn2)){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 299, in test_rand_functions rnd = df.select("key", functions.rand()).collect() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", line 2917, in select jdf = self._jdf.select(self._jcols(*cols)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", line 2537, in _jcols return self._jseq(cols, _to_java_column) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", line 2524, in _jseq return _to_seq(self.sparkSession._sc, cols, converter) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 86, in _to_seq cols = [converter(c) for c in cols] File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 86, in cols = [converter(c) for c in cols] File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 65, in _to_java_column raise TypeError( TypeError: Invalid argument, not a string or column: Column<'rand()'> of type . For column literals, use 'lit', 'array', 'struct' or 'create_map' function. {code} > Function `sampleby` return parity > - > > Key: SPARK-41907 > URL: https://issues.apache.org/jira/browse/SPARK-41907 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)]) > sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0) > self.assertTrue(sampled.count() == 35){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 202, in test_sampleby > self.assertTrue(sampled.count() == 35) > AssertionError: False is not true {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41906) Handle Function `rand() `
[ https://issues.apache.org/jira/browse/SPARK-41906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41906: -- Description: {code:java} df = self.df from pyspark.sql import functions rnd = df.select("key", functions.rand()).collect() for row in rnd: assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1] rndn = df.select("key", functions.randn(5)).collect() for row in rndn: assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1] # If the specified seed is 0, we should use it. # https://issues.apache.org/jira/browse/SPARK-9691 rnd1 = df.select("key", functions.rand(0)).collect() rnd2 = df.select("key", functions.rand(0)).collect() self.assertEqual(sorted(rnd1), sorted(rnd2)) rndn1 = df.select("key", functions.randn(0)).collect() rndn2 = df.select("key", functions.randn(0)).collect() self.assertEqual(sorted(rndn1), sorted(rndn2)){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 299, in test_rand_functions rnd = df.select("key", functions.rand()).collect() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", line 2917, in select jdf = self._jdf.select(self._jcols(*cols)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", line 2537, in _jcols return self._jseq(cols, _to_java_column) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", line 2524, in _jseq return _to_seq(self.sparkSession._sc, cols, converter) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 86, in _to_seq cols = [converter(c) for c in cols] File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 86, in cols = [converter(c) for c in cols] File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 65, in _to_java_column raise TypeError( TypeError: Invalid argument, not a string or column: Column<'rand()'> of type . For column literals, use 'lit', 'array', 'struct' or 'create_map' function. {code} was: {code:java} df = self.spark.createDataFrame( [ ( [1, 2, 3], 2, 2, ), ( [4, 5], 2, 2, ), ], ["x", "index", "len"], ) expected = [Row(sliced=[2, 3]), Row(sliced=[5])] self.assertTrue( all( [ df.select(slice(df.x, 2, 2).alias("sliced")).collect() == expected, df.select(slice(df.x, lit(2), lit(2)).alias("sliced")).collect() == expected, df.select(slice("x", "index", "len").alias("sliced")).collect() == expected, ] ) ) self.assertEqual( df.select(slice(df.x, size(df.x) - 1, lit(1)).alias("sliced")).collect(), [Row(sliced=[2]), Row(sliced=[4])], ) self.assertEqual( df.select(slice(df.x, lit(1), size(df.x) - 1).alias("sliced")).collect(), [Row(sliced=[1, 2]), Row(sliced=[4])], ){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 596, in test_slice df.select(slice("x", "index", "len").alias("sliced")).collect() == expected, File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line 332, in wrapped return getattr(functions, f.__name__)(*args, **kwargs) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1525, in slice raise TypeError(f"start should be a Column or int, but got {type(start).__name__}") TypeError: start should be a Column or int, but got str{code} > Handle Function `rand() ` > - > > Key: SPARK-41906 > URL: https://issues.apache.org/jira/browse/SPARK-41906 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.df > from pyspark.sql import functions > rnd = df.select("key", functions.rand()).collect() > for row in rnd: > assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1] > rndn = df.select("key", functions.randn(5)).collect() > for row in rndn: > assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1] > # If the specified seed is 0, we should use it. > # https://issues.apache.org/jira/browse/SPARK-9691 > rnd1 = df.select("key", functions.rand(0)).collect() > rnd2 = df.select("key", functions.rand(0)).collect() > self.assertEqual(sorted(rnd1), sorted(rnd2)) > rndn1 = df.select("key", functions.randn(0)).collect() > rndn2 = df.select("key", functions.randn(0)).collect() > self.assertEqual(sorted(rndn1), sorted(rndn2)){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 29