[jira] [Commented] (SPARK-42002) Implement DataFrameWriterV2 (ReadwriterV2Tests)
[ https://issues.apache.org/jira/browse/SPARK-42002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677094#comment-17677094 ] Sandeep Singh commented on SPARK-42002: --- I'm working on this > Implement DataFrameWriterV2 (ReadwriterV2Tests) > --- > > Key: SPARK-42002 > URL: https://issues.apache.org/jira/browse/SPARK-42002 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > pyspark/sql/tests/test_readwriter.py:182 (ReadwriterV2ParityTests.test_api) > self = > testMethod=test_api> > def test_api(self): > df = self.df > > writer = df.writeTo("testcat.t") > ../test_readwriter.py:185: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ > self = DataFrame[key: bigint, value: string], args = ('testcat.t',), kwargs = > {} > def writeTo(self, *args: Any, **kwargs: Any) -> None: > > raise NotImplementedError("writeTo() is not implemented.") > E NotImplementedError: writeTo() is not implemented. > ../../connect/dataframe.py:1529: NotImplementedError > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42073) Enable pyspark.sql.tests.test_types 2 test cases
Sandeep Singh created SPARK-42073: - Summary: Enable pyspark.sql.tests.test_types 2 test cases Key: SPARK-42073 URL: https://issues.apache.org/jira/browse/SPARK-42073 Project: Spark Issue Type: Sub-task Components: Connect, Tests Affects Versions: 3.4.0 Reporter: Sandeep Singh Assignee: Hyukjin Kwon Fix For: 3.4.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42012) Implement DataFrameReader.orc
[ https://issues.apache.org/jira/browse/SPARK-42012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17676751#comment-17676751 ] Sandeep Singh commented on SPARK-42012: --- working on this. > Implement DataFrameReader.orc > - > > Key: SPARK-42012 > URL: https://issues.apache.org/jira/browse/SPARK-42012 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > pyspark/sql/tests/test_datasources.py:114 > (DataSourcesParityTests.test_read_multiple_orc_file) > self = > testMethod=test_read_multiple_orc_file> > def test_read_multiple_orc_file(self): > > df = self.spark.read.orc( > [ > "python/test_support/sql/orc_partitioned/b=0/c=0", > "python/test_support/sql/orc_partitioned/b=1/c=1", > ] > ) > ../test_datasources.py:116: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ > self = 0x7fb170946b50> > args = (['python/test_support/sql/orc_partitioned/b=0/c=0', > 'python/test_support/sql/orc_partitioned/b=1/c=1'],) > kwargs = {} > def orc(self, *args: Any, **kwargs: Any) -> None: > > raise NotImplementedError("orc() is not implemented.") > E NotImplementedError: orc() is not implemented. > ../../connect/readwriter.py:228: NotImplementedError > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41820) DataFrame.createOrReplaceGlobalTempView - SparkConnectException: requirement failed
[ https://issues.apache.org/jira/browse/SPARK-41820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41820: -- Description: {code:java} >>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], schema=["age", >>> "name"]) >>> df.createOrReplaceGlobalTempView("people") {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1292, in pyspark.sql.connect.dataframe.DataFrame.createOrReplaceGlobalTempView Failed example: df2.createOrReplaceGlobalTempView("people") Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df2.createOrReplaceGlobalTempView("people") File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1192, in createOrReplaceGlobalTempView self._session.client.execute_command(command) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 459, in execute_command self._execute(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 547, in _execute self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 625, in _handle_error raise SparkConnectException(status.message) from None pyspark.sql.connect.client.SparkConnectException: requirement failed {code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1292, in pyspark.sql.connect.dataframe.DataFrame.createOrReplaceGlobalTempView Failed example: df2.createOrReplaceGlobalTempView("people") Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df2.createOrReplaceGlobalTempView("people") File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1192, in createOrReplaceGlobalTempView self._session.client.execute_command(command) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 459, in execute_command self._execute(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 547, in _execute self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 625, in _handle_error raise SparkConnectException(status.message) from None pyspark.sql.connect.client.SparkConnectException: requirement failed {code} > DataFrame.createOrReplaceGlobalTempView - SparkConnectException: requirement > failed > --- > > Key: SPARK-41820 > URL: https://issues.apache.org/jira/browse/SPARK-41820 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > >>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], schema=["age", > >>> "name"]) > >>> df.createOrReplaceGlobalTempView("people") {code} > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1292, in > pyspark.sql.connect.dataframe.DataFrame.createOrReplaceGlobalTempView > Failed example: > df2.createOrReplaceGlobalTempView("people") > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.dataframe.DataFrame.createOrReplaceGlobalTempView[3]>", > line 1, in > df2.createOrReplaceGlobalTempView("people") > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1192, in createOrReplaceGlobalTempView > self._session.client.execute_command(command) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 459, in execute_command > self._execute(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 547, in _execute > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 625, in _handle_error > raise SparkConnectException(statu
[jira] [Created] (SPARK-41922) Implement DataFrame `semanticHash`
Sandeep Singh created SPARK-41922: - Summary: Implement DataFrame `semanticHash` Key: SPARK-41922 URL: https://issues.apache.org/jira/browse/SPARK-41922 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41874) Implement DataFrame `sameSemantics`
[ https://issues.apache.org/jira/browse/SPARK-41874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655329#comment-17655329 ] Sandeep Singh commented on SPARK-41874: --- Working on this > Implement DataFrame `sameSemantics` > --- > > Key: SPARK-41874 > URL: https://issues.apache.org/jira/browse/SPARK-41874 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41824) Implement DataFrame.explain format to be similar to PySpark
[ https://issues.apache.org/jira/browse/SPARK-41824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41824: -- Description: {code:java} df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"]) df.explain() df.explain(True) df.explain(mode="formatted") df.explain("cost"){code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1296, in pyspark.sql.connect.dataframe.DataFrame.explain Failed example: df.explain() Expected: == Physical Plan == *(1) Scan ExistingRDD[age...,name...] Got: == Physical Plan == LocalTableScan [age#1148L, name#1149] ** File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1314, in pyspark.sql.connect.dataframe.DataFrame.explain Failed example: df.explain(mode="formatted") Expected: == Physical Plan == * Scan ExistingRDD (...) (1) Scan ExistingRDD [codegen id : ...] Output [2]: [age..., name...] ... Got: == Physical Plan == LocalTableScan (1) (1) LocalTableScan Output [2]: [age#1170L, name#1171] Arguments: [age#1170L, name#1171] {code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1296, in pyspark.sql.connect.dataframe.DataFrame.explain Failed example: df.explain() Expected: == Physical Plan == *(1) Scan ExistingRDD[age...,name...] Got: == Physical Plan == LocalTableScan [age#1148L, name#1149] ** File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1314, in pyspark.sql.connect.dataframe.DataFrame.explain Failed example: df.explain(mode="formatted") Expected: == Physical Plan == * Scan ExistingRDD (...) (1) Scan ExistingRDD [codegen id : ...] Output [2]: [age..., name...] ... Got: == Physical Plan == LocalTableScan (1) (1) LocalTableScan Output [2]: [age#1170L, name#1171] Arguments: [age#1170L, name#1171] {code} > Implement DataFrame.explain format to be similar to PySpark > --- > > Key: SPARK-41824 > URL: https://issues.apache.org/jira/browse/SPARK-41824 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", > "name"]) > df.explain() > df.explain(True) > df.explain(mode="formatted") > df.explain("cost"){code} > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1296, in pyspark.sql.connect.dataframe.DataFrame.explain > Failed example: > df.explain() > Expected: > == Physical Plan == > *(1) Scan ExistingRDD[age...,name...] > Got: > == Physical Plan == > LocalTableScan [age#1148L, name#1149] > > > ** > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1314, in pyspark.sql.connect.dataframe.DataFrame.explain > Failed example: > df.explain(mode="formatted") > Expected: > == Physical Plan == > * Scan ExistingRDD (...) > (1) Scan ExistingRDD [codegen id : ...] > Output [2]: [age..., name...] > ... > Got: > == Physical Plan == > LocalTableScan (1) > > > (1) LocalTableScan > Output [2]: [age#1170L, name#1171] > Arguments: [age#1170L, name#1171] > > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41824) Implement DataFrame.explain format to be similar to PySpark
[ https://issues.apache.org/jira/browse/SPARK-41824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655293#comment-17655293 ] Sandeep Singh commented on SPARK-41824: --- this is from the doctests `./python/run-tests --testnames 'pyspark.sql.connect.dataframe'` > Implement DataFrame.explain format to be similar to PySpark > --- > > Key: SPARK-41824 > URL: https://issues.apache.org/jira/browse/SPARK-41824 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1296, in pyspark.sql.connect.dataframe.DataFrame.explain > Failed example: > df.explain() > Expected: > == Physical Plan == > *(1) Scan ExistingRDD[age...,name...] > Got: > == Physical Plan == > LocalTableScan [age#1148L, name#1149] > > > ** > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1314, in pyspark.sql.connect.dataframe.DataFrame.explain > Failed example: > df.explain(mode="formatted") > Expected: > == Physical Plan == > * Scan ExistingRDD (...) > (1) Scan ExistingRDD [codegen id : ...] > Output [2]: [age..., name...] > ... > Got: > == Physical Plan == > LocalTableScan (1) > > > (1) LocalTableScan > Output [2]: [age#1170L, name#1171] > Arguments: [age#1170L, name#1171] > > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-41818) Support DataFrameWriter.saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-41818 ] Sandeep Singh deleted comment on SPARK-41818: --- was (Author: techaddict): Could be moved under https://issues.apache.org/jira/browse/SPARK-41279 > Support DataFrameWriter.saveAsTable > --- > > Key: SPARK-41818 > URL: https://issues.apache.org/jira/browse/SPARK-41818 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", > line 369, in pyspark.sql.connect.readwriter.DataFrameWriter.insertInto > Failed example: > df.write.saveAsTable("tblA") > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.readwriter.DataFrameWriter.insertInto[2]>", line 1, in > > df.write.saveAsTable("tblA") > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", > line 350, in saveAsTable > > self._spark.client.execute_command(self._write.command(self._spark.client)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 459, in execute_command > self._execute(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 547, in _execute > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 623, in _handle_error > raise SparkConnectException(status.message, info.reason) from None > pyspark.sql.connect.client.SparkConnectException: > (java.lang.ClassNotFoundException) .DefaultSource{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41921) Enable doctests in connect.column and connect.functions
Sandeep Singh created SPARK-41921: - Summary: Enable doctests in connect.column and connect.functions Key: SPARK-41921 URL: https://issues.apache.org/jira/browse/SPARK-41921 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh Assignee: Sandeep Singh Fix For: 3.4.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41907) Function `sampleby` return parity
Sandeep Singh created SPARK-41907: - Summary: Function `sampleby` return parity Key: SPARK-41907 URL: https://issues.apache.org/jira/browse/SPARK-41907 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} df = self.df from pyspark.sql import functions rnd = df.select("key", functions.rand()).collect() for row in rnd: assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1] rndn = df.select("key", functions.randn(5)).collect() for row in rndn: assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1] # If the specified seed is 0, we should use it. # https://issues.apache.org/jira/browse/SPARK-9691 rnd1 = df.select("key", functions.rand(0)).collect() rnd2 = df.select("key", functions.rand(0)).collect() self.assertEqual(sorted(rnd1), sorted(rnd2)) rndn1 = df.select("key", functions.randn(0)).collect() rndn2 = df.select("key", functions.randn(0)).collect() self.assertEqual(sorted(rndn1), sorted(rndn2)){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 299, in test_rand_functions rnd = df.select("key", functions.rand()).collect() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", line 2917, in select jdf = self._jdf.select(self._jcols(*cols)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", line 2537, in _jcols return self._jseq(cols, _to_java_column) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", line 2524, in _jseq return _to_seq(self.sparkSession._sc, cols, converter) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 86, in _to_seq cols = [converter(c) for c in cols] File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 86, in cols = [converter(c) for c in cols] File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 65, in _to_java_column raise TypeError( TypeError: Invalid argument, not a string or column: Column<'rand()'> of type . For column literals, use 'lit', 'array', 'struct' or 'create_map' function. {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41907) Function `sampleby` return parity
[ https://issues.apache.org/jira/browse/SPARK-41907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41907: -- Description: {code:java} df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)]) sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0) self.assertTrue(sampled.count() == 35){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 202, in test_sampleby self.assertTrue(sampled.count() == 35) AssertionError: False is not true {code} was: {code:java} df = self.df from pyspark.sql import functions rnd = df.select("key", functions.rand()).collect() for row in rnd: assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1] rndn = df.select("key", functions.randn(5)).collect() for row in rndn: assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1] # If the specified seed is 0, we should use it. # https://issues.apache.org/jira/browse/SPARK-9691 rnd1 = df.select("key", functions.rand(0)).collect() rnd2 = df.select("key", functions.rand(0)).collect() self.assertEqual(sorted(rnd1), sorted(rnd2)) rndn1 = df.select("key", functions.randn(0)).collect() rndn2 = df.select("key", functions.randn(0)).collect() self.assertEqual(sorted(rndn1), sorted(rndn2)){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 299, in test_rand_functions rnd = df.select("key", functions.rand()).collect() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", line 2917, in select jdf = self._jdf.select(self._jcols(*cols)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", line 2537, in _jcols return self._jseq(cols, _to_java_column) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", line 2524, in _jseq return _to_seq(self.sparkSession._sc, cols, converter) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 86, in _to_seq cols = [converter(c) for c in cols] File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 86, in cols = [converter(c) for c in cols] File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 65, in _to_java_column raise TypeError( TypeError: Invalid argument, not a string or column: Column<'rand()'> of type . For column literals, use 'lit', 'array', 'struct' or 'create_map' function. {code} > Function `sampleby` return parity > - > > Key: SPARK-41907 > URL: https://issues.apache.org/jira/browse/SPARK-41907 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)]) > sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0) > self.assertTrue(sampled.count() == 35){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 202, in test_sampleby > self.assertTrue(sampled.count() == 35) > AssertionError: False is not true {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41906) Handle Function `rand() `
[ https://issues.apache.org/jira/browse/SPARK-41906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41906: -- Description: {code:java} df = self.df from pyspark.sql import functions rnd = df.select("key", functions.rand()).collect() for row in rnd: assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1] rndn = df.select("key", functions.randn(5)).collect() for row in rndn: assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1] # If the specified seed is 0, we should use it. # https://issues.apache.org/jira/browse/SPARK-9691 rnd1 = df.select("key", functions.rand(0)).collect() rnd2 = df.select("key", functions.rand(0)).collect() self.assertEqual(sorted(rnd1), sorted(rnd2)) rndn1 = df.select("key", functions.randn(0)).collect() rndn2 = df.select("key", functions.randn(0)).collect() self.assertEqual(sorted(rndn1), sorted(rndn2)){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 299, in test_rand_functions rnd = df.select("key", functions.rand()).collect() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", line 2917, in select jdf = self._jdf.select(self._jcols(*cols)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", line 2537, in _jcols return self._jseq(cols, _to_java_column) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", line 2524, in _jseq return _to_seq(self.sparkSession._sc, cols, converter) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 86, in _to_seq cols = [converter(c) for c in cols] File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 86, in cols = [converter(c) for c in cols] File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 65, in _to_java_column raise TypeError( TypeError: Invalid argument, not a string or column: Column<'rand()'> of type . For column literals, use 'lit', 'array', 'struct' or 'create_map' function. {code} was: {code:java} df = self.spark.createDataFrame( [ ( [1, 2, 3], 2, 2, ), ( [4, 5], 2, 2, ), ], ["x", "index", "len"], ) expected = [Row(sliced=[2, 3]), Row(sliced=[5])] self.assertTrue( all( [ df.select(slice(df.x, 2, 2).alias("sliced")).collect() == expected, df.select(slice(df.x, lit(2), lit(2)).alias("sliced")).collect() == expected, df.select(slice("x", "index", "len").alias("sliced")).collect() == expected, ] ) ) self.assertEqual( df.select(slice(df.x, size(df.x) - 1, lit(1)).alias("sliced")).collect(), [Row(sliced=[2]), Row(sliced=[4])], ) self.assertEqual( df.select(slice(df.x, lit(1), size(df.x) - 1).alias("sliced")).collect(), [Row(sliced=[1, 2]), Row(sliced=[4])], ){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 596, in test_slice df.select(slice("x", "index", "len").alias("sliced")).collect() == expected, File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line 332, in wrapped return getattr(functions, f.__name__)(*args, **kwargs) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1525, in slice raise TypeError(f"start should be a Column or int, but got {type(start).__name__}") TypeError: start should be a Column or int, but got str{code} > Handle Function `rand() ` > - > > Key: SPARK-41906 > URL: https://issues.apache.org/jira/browse/SPARK-41906 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.df > from pyspark.sql import functions > rnd = df.select("key", functions.rand()).collect() > for row in rnd: > assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1] > rndn = df.select("key", functions.randn(5)).collect() > for row in rndn: > assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1] > # If the specified seed is 0, we should use it. > # https://issues.apache.org/jira/browse/SPARK-9691 > rnd1 = df.select("key", functions.rand(0)).collect() > rnd2 = df.select("key", functions.rand(0)).collect() > self.assertEqual(sorted(rnd1), sorted(rnd2)) > rndn1 = df.select("key", functions.randn(0)).collect() > rndn2 = df.select("key", functions.randn(0)).collect() > self.assertEqual(sorted(rndn1), sorted(rndn2)){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 29
[jira] [Created] (SPARK-41906) Handle Function `rand() `
Sandeep Singh created SPARK-41906: - Summary: Handle Function `rand() ` Key: SPARK-41906 URL: https://issues.apache.org/jira/browse/SPARK-41906 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} df = self.spark.createDataFrame( [ ( [1, 2, 3], 2, 2, ), ( [4, 5], 2, 2, ), ], ["x", "index", "len"], ) expected = [Row(sliced=[2, 3]), Row(sliced=[5])] self.assertTrue( all( [ df.select(slice(df.x, 2, 2).alias("sliced")).collect() == expected, df.select(slice(df.x, lit(2), lit(2)).alias("sliced")).collect() == expected, df.select(slice("x", "index", "len").alias("sliced")).collect() == expected, ] ) ) self.assertEqual( df.select(slice(df.x, size(df.x) - 1, lit(1)).alias("sliced")).collect(), [Row(sliced=[2]), Row(sliced=[4])], ) self.assertEqual( df.select(slice(df.x, lit(1), size(df.x) - 1).alias("sliced")).collect(), [Row(sliced=[1, 2]), Row(sliced=[4])], ){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 596, in test_slice df.select(slice("x", "index", "len").alias("sliced")).collect() == expected, File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line 332, in wrapped return getattr(functions, f.__name__)(*args, **kwargs) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1525, in slice raise TypeError(f"start should be a Column or int, but got {type(start).__name__}") TypeError: start should be a Column or int, but got str{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41905) Function `slice` should handle string in params
[ https://issues.apache.org/jira/browse/SPARK-41905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41905: -- Summary: Function `slice` should handle string in params (was: Function `slice` should expect string in params) > Function `slice` should handle string in params > --- > > Key: SPARK-41905 > URL: https://issues.apache.org/jira/browse/SPARK-41905 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > from pyspark.sql import Window > from pyspark.sql.functions import nth_value > df = self.spark.createDataFrame( > [ > ("a", 0, None), > ("a", 1, "x"), > ("a", 2, "y"), > ("a", 3, "z"), > ("a", 4, None), > ("b", 1, None), > ("b", 2, None), > ], > schema=("key", "order", "value"), > ) > w = Window.partitionBy("key").orderBy("order") > rs = df.select( > df.key, > df.order, > nth_value("value", 2).over(w), > nth_value("value", 2, False).over(w), > nth_value("value", 2, True).over(w), > ).collect() > expected = [ > ("a", 0, None, None, None), > ("a", 1, "x", "x", None), > ("a", 2, "x", "x", "y"), > ("a", 3, "x", "x", "y"), > ("a", 4, "x", "x", "y"), > ("b", 1, None, None, None), > ("b", 2, None, None, None), > ] > for r, ex in zip(sorted(rs), sorted(expected)): > self.assertEqual(tuple(r), ex[: len(r)]){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 755, in test_nth_value > self.assertEqual(tuple(r), ex[: len(r)]) > AssertionError: Tuples differ: ('a', 1, 'x', None) != ('a', 1, 'x', 'x') > First differing element 3: > None > 'x' > - ('a', 1, 'x', None) > ? > + ('a', 1, 'x', 'x') > ? ^^^ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41905) Function `slice` should handle string in params
[ https://issues.apache.org/jira/browse/SPARK-41905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41905: -- Description: {code:java} df = self.spark.createDataFrame( [ ( [1, 2, 3], 2, 2, ), ( [4, 5], 2, 2, ), ], ["x", "index", "len"], ) expected = [Row(sliced=[2, 3]), Row(sliced=[5])] self.assertTrue( all( [ df.select(slice(df.x, 2, 2).alias("sliced")).collect() == expected, df.select(slice(df.x, lit(2), lit(2)).alias("sliced")).collect() == expected, df.select(slice("x", "index", "len").alias("sliced")).collect() == expected, ] ) ) self.assertEqual( df.select(slice(df.x, size(df.x) - 1, lit(1)).alias("sliced")).collect(), [Row(sliced=[2]), Row(sliced=[4])], ) self.assertEqual( df.select(slice(df.x, lit(1), size(df.x) - 1).alias("sliced")).collect(), [Row(sliced=[1, 2]), Row(sliced=[4])], ){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 596, in test_slice df.select(slice("x", "index", "len").alias("sliced")).collect() == expected, File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line 332, in wrapped return getattr(functions, f.__name__)(*args, **kwargs) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1525, in slice raise TypeError(f"start should be a Column or int, but got {type(start).__name__}") TypeError: start should be a Column or int, but got str{code} was: {code:java} from pyspark.sql import Window from pyspark.sql.functions import nth_value df = self.spark.createDataFrame( [ ("a", 0, None), ("a", 1, "x"), ("a", 2, "y"), ("a", 3, "z"), ("a", 4, None), ("b", 1, None), ("b", 2, None), ], schema=("key", "order", "value"), ) w = Window.partitionBy("key").orderBy("order") rs = df.select( df.key, df.order, nth_value("value", 2).over(w), nth_value("value", 2, False).over(w), nth_value("value", 2, True).over(w), ).collect() expected = [ ("a", 0, None, None, None), ("a", 1, "x", "x", None), ("a", 2, "x", "x", "y"), ("a", 3, "x", "x", "y"), ("a", 4, "x", "x", "y"), ("b", 1, None, None, None), ("b", 2, None, None, None), ] for r, ex in zip(sorted(rs), sorted(expected)): self.assertEqual(tuple(r), ex[: len(r)]){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 755, in test_nth_value self.assertEqual(tuple(r), ex[: len(r)]) AssertionError: Tuples differ: ('a', 1, 'x', None) != ('a', 1, 'x', 'x') First differing element 3: None 'x' - ('a', 1, 'x', None) ? + ('a', 1, 'x', 'x') ? ^^^ {code} > Function `slice` should handle string in params > --- > > Key: SPARK-41905 > URL: https://issues.apache.org/jira/browse/SPARK-41905 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.createDataFrame( > [ > ( > [1, 2, 3], > 2, > 2, > ), > ( > [4, 5], > 2, > 2, > ), > ], > ["x", "index", "len"], > ) > expected = [Row(sliced=[2, 3]), Row(sliced=[5])] > self.assertTrue( > all( > [ > df.select(slice(df.x, 2, 2).alias("sliced")).collect() == > expected, > df.select(slice(df.x, lit(2), lit(2)).alias("sliced")).collect() > == expected, > df.select(slice("x", "index", "len").alias("sliced")).collect() > == expected, > ] > ) > ) > self.assertEqual( > df.select(slice(df.x, size(df.x) - 1, lit(1)).alias("sliced")).collect(), > [Row(sliced=[2]), Row(sliced=[4])], > ) > self.assertEqual( > df.select(slice(df.x, lit(1), size(df.x) - 1).alias("sliced")).collect(), > [Row(sliced=[1, 2]), Row(sliced=[4])], > ){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 596, in test_slice > df.select(slice("x", "index", "len").alias("sliced")).collect() == > expected, > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line > 332, in wrapped > return getattr(functions, f.__name__)(*args, **kwargs) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1525, in slice > raise TypeError(f"start should be a Column or
[jira] [Created] (SPARK-41905) Function `slice` should expect string in params
Sandeep Singh created SPARK-41905: - Summary: Function `slice` should expect string in params Key: SPARK-41905 URL: https://issues.apache.org/jira/browse/SPARK-41905 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} from pyspark.sql import Window from pyspark.sql.functions import nth_value df = self.spark.createDataFrame( [ ("a", 0, None), ("a", 1, "x"), ("a", 2, "y"), ("a", 3, "z"), ("a", 4, None), ("b", 1, None), ("b", 2, None), ], schema=("key", "order", "value"), ) w = Window.partitionBy("key").orderBy("order") rs = df.select( df.key, df.order, nth_value("value", 2).over(w), nth_value("value", 2, False).over(w), nth_value("value", 2, True).over(w), ).collect() expected = [ ("a", 0, None, None, None), ("a", 1, "x", "x", None), ("a", 2, "x", "x", "y"), ("a", 3, "x", "x", "y"), ("a", 4, "x", "x", "y"), ("b", 1, None, None, None), ("b", 2, None, None, None), ] for r, ex in zip(sorted(rs), sorted(expected)): self.assertEqual(tuple(r), ex[: len(r)]){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 755, in test_nth_value self.assertEqual(tuple(r), ex[: len(r)]) AssertionError: Tuples differ: ('a', 1, 'x', None) != ('a', 1, 'x', 'x') First differing element 3: None 'x' - ('a', 1, 'x', None) ? + ('a', 1, 'x', 'x') ? ^^^ {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41904) Fix Function `nth_value` functions output
[ https://issues.apache.org/jira/browse/SPARK-41904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41904: -- Summary: Fix Function `nth_value` functions output (was: Fix `nth_value` functions output) > Fix Function `nth_value` functions output > - > > Key: SPARK-41904 > URL: https://issues.apache.org/jira/browse/SPARK-41904 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > from pyspark.sql import Window > from pyspark.sql.functions import nth_value > df = self.spark.createDataFrame( > [ > ("a", 0, None), > ("a", 1, "x"), > ("a", 2, "y"), > ("a", 3, "z"), > ("a", 4, None), > ("b", 1, None), > ("b", 2, None), > ], > schema=("key", "order", "value"), > ) > w = Window.partitionBy("key").orderBy("order") > rs = df.select( > df.key, > df.order, > nth_value("value", 2).over(w), > nth_value("value", 2, False).over(w), > nth_value("value", 2, True).over(w), > ).collect() > expected = [ > ("a", 0, None, None, None), > ("a", 1, "x", "x", None), > ("a", 2, "x", "x", "y"), > ("a", 3, "x", "x", "y"), > ("a", 4, "x", "x", "y"), > ("b", 1, None, None, None), > ("b", 2, None, None, None), > ] > for r, ex in zip(sorted(rs), sorted(expected)): > self.assertEqual(tuple(r), ex[: len(r)]){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 755, in test_nth_value > self.assertEqual(tuple(r), ex[: len(r)]) > AssertionError: Tuples differ: ('a', 1, 'x', None) != ('a', 1, 'x', 'x') > First differing element 3: > None > 'x' > - ('a', 1, 'x', None) > ? > + ('a', 1, 'x', 'x') > ? ^^^ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41904) Fix `nth_value` functions output
[ https://issues.apache.org/jira/browse/SPARK-41904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41904: -- Description: {code:java} from pyspark.sql import Window from pyspark.sql.functions import nth_value df = self.spark.createDataFrame( [ ("a", 0, None), ("a", 1, "x"), ("a", 2, "y"), ("a", 3, "z"), ("a", 4, None), ("b", 1, None), ("b", 2, None), ], schema=("key", "order", "value"), ) w = Window.partitionBy("key").orderBy("order") rs = df.select( df.key, df.order, nth_value("value", 2).over(w), nth_value("value", 2, False).over(w), nth_value("value", 2, True).over(w), ).collect() expected = [ ("a", 0, None, None, None), ("a", 1, "x", "x", None), ("a", 2, "x", "x", "y"), ("a", 3, "x", "x", "y"), ("a", 4, "x", "x", "y"), ("b", 1, None, None, None), ("b", 2, None, None, None), ] for r, ex in zip(sorted(rs), sorted(expected)): self.assertEqual(tuple(r), ex[: len(r)]){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 755, in test_nth_value self.assertEqual(tuple(r), ex[: len(r)]) AssertionError: Tuples differ: ('a', 1, 'x', None) != ('a', 1, 'x', 'x') First differing element 3: None 'x' - ('a', 1, 'x', None) ? + ('a', 1, 'x', 'x') ? ^^^ {code} was: {code:java} from pyspark.sql.functions import flatten, struct, transform df = self.spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters") actual = df.select( flatten( transform( "numbers", lambda number: transform( "letters", lambda letter: struct(number.alias("n"), letter.alias("l")) ), ) ) ).first()[0] expected = [ (1, "a"), (1, "b"), (1, "c"), (2, "a"), (2, "b"), (2, "c"), (3, "a"), (3, "b"), (3, "c"), ] self.assertEquals(actual, expected){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 809, in test_nested_higher_order_function self.assertEquals(actual, expected) AssertionError: Lists differ: [{'n': 'a', 'l': 'a'}, {'n': 'b', 'l': 'b'[151 chars]'c'}] != [(1, 'a'), (1, 'b'), (1, 'c'), (2, 'a'), ([43 chars]'c')] First differing element 0: {'n': 'a', 'l': 'a'} (1, 'a') - [{'l': 'a', 'n': 'a'}, - {'l': 'b', 'n': 'b'}, - {'l': 'c', 'n': 'c'}, - {'l': 'a', 'n': 'a'}, - {'l': 'b', 'n': 'b'}, - {'l': 'c', 'n': 'c'}, - {'l': 'a', 'n': 'a'}, - {'l': 'b', 'n': 'b'}, - {'l': 'c', 'n': 'c'}] + [(1, 'a'), + (1, 'b'), + (1, 'c'), + (2, 'a'), + (2, 'b'), + (2, 'c'), + (3, 'a'), + (3, 'b'), + (3, 'c')] {code} > Fix `nth_value` functions output > > > Key: SPARK-41904 > URL: https://issues.apache.org/jira/browse/SPARK-41904 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > from pyspark.sql import Window > from pyspark.sql.functions import nth_value > df = self.spark.createDataFrame( > [ > ("a", 0, None), > ("a", 1, "x"), > ("a", 2, "y"), > ("a", 3, "z"), > ("a", 4, None), > ("b", 1, None), > ("b", 2, None), > ], > schema=("key", "order", "value"), > ) > w = Window.partitionBy("key").orderBy("order") > rs = df.select( > df.key, > df.order, > nth_value("value", 2).over(w), > nth_value("value", 2, False).over(w), > nth_value("value", 2, True).over(w), > ).collect() > expected = [ > ("a", 0, None, None, None), > ("a", 1, "x", "x", None), > ("a", 2, "x", "x", "y"), > ("a", 3, "x", "x", "y"), > ("a", 4, "x", "x", "y"), > ("b", 1, None, None, None), > ("b", 2, None, None, None), > ] > for r, ex in zip(sorted(rs), sorted(expected)): > self.assertEqual(tuple(r), ex[: len(r)]){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 755, in test_nth_value > self.assertEqual(tuple(r), ex[: len(r)]) > AssertionError: Tuples differ: ('a', 1, 'x', None) != ('a', 1, 'x', 'x') > First differing element 3: > None > 'x' > - ('a', 1, 'x', None) > ? > + ('a', 1, 'x', 'x') > ? ^^^ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41904) Fix `nth_value` functions output
Sandeep Singh created SPARK-41904: - Summary: Fix `nth_value` functions output Key: SPARK-41904 URL: https://issues.apache.org/jira/browse/SPARK-41904 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} from pyspark.sql.functions import flatten, struct, transform df = self.spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters") actual = df.select( flatten( transform( "numbers", lambda number: transform( "letters", lambda letter: struct(number.alias("n"), letter.alias("l")) ), ) ) ).first()[0] expected = [ (1, "a"), (1, "b"), (1, "c"), (2, "a"), (2, "b"), (2, "c"), (3, "a"), (3, "b"), (3, "c"), ] self.assertEquals(actual, expected){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 809, in test_nested_higher_order_function self.assertEquals(actual, expected) AssertionError: Lists differ: [{'n': 'a', 'l': 'a'}, {'n': 'b', 'l': 'b'[151 chars]'c'}] != [(1, 'a'), (1, 'b'), (1, 'c'), (2, 'a'), ([43 chars]'c')] First differing element 0: {'n': 'a', 'l': 'a'} (1, 'a') - [{'l': 'a', 'n': 'a'}, - {'l': 'b', 'n': 'b'}, - {'l': 'c', 'n': 'c'}, - {'l': 'a', 'n': 'a'}, - {'l': 'b', 'n': 'b'}, - {'l': 'c', 'n': 'c'}, - {'l': 'a', 'n': 'a'}, - {'l': 'b', 'n': 'b'}, - {'l': 'c', 'n': 'c'}] + [(1, 'a'), + (1, 'b'), + (1, 'c'), + (2, 'a'), + (2, 'b'), + (2, 'c'), + (3, 'a'), + (3, 'b'), + (3, 'c')] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41902) Parity in String representation of higher_order_function's output
[ https://issues.apache.org/jira/browse/SPARK-41902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41902: -- Summary: Parity in String representation of higher_order_function's output (was: Parity in String representation of higher_order_function) > Parity in String representation of higher_order_function's output > - > > Key: SPARK-41902 > URL: https://issues.apache.org/jira/browse/SPARK-41902 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > from pyspark.sql.functions import flatten, struct, transform > df = self.spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') > as letters") > actual = df.select( > flatten( > transform( > "numbers", > lambda number: transform( > "letters", lambda letter: struct(number.alias("n"), > letter.alias("l")) > ), > ) > ) > ).first()[0] > expected = [ > (1, "a"), > (1, "b"), > (1, "c"), > (2, "a"), > (2, "b"), > (2, "c"), > (3, "a"), > (3, "b"), > (3, "c"), > ] > self.assertEquals(actual, expected){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 809, in test_nested_higher_order_function > self.assertEquals(actual, expected) > AssertionError: Lists differ: [{'n': 'a', 'l': 'a'}, {'n': 'b', 'l': 'b'[151 > chars]'c'}] != [(1, 'a'), (1, 'b'), (1, 'c'), (2, 'a'), ([43 chars]'c')] > First differing element 0: > {'n': 'a', 'l': 'a'} > (1, 'a') > - [{'l': 'a', 'n': 'a'}, > - {'l': 'b', 'n': 'b'}, > - {'l': 'c', 'n': 'c'}, > - {'l': 'a', 'n': 'a'}, > - {'l': 'b', 'n': 'b'}, > - {'l': 'c', 'n': 'c'}, > - {'l': 'a', 'n': 'a'}, > - {'l': 'b', 'n': 'b'}, > - {'l': 'c', 'n': 'c'}] > + [(1, 'a'), > + (1, 'b'), > + (1, 'c'), > + (2, 'a'), > + (2, 'b'), > + (2, 'c'), > + (3, 'a'), > + (3, 'b'), > + (3, 'c')] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41902) Parity in String representation of higher_order_function
[ https://issues.apache.org/jira/browse/SPARK-41902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41902: -- Description: {code:java} from pyspark.sql.functions import flatten, struct, transform df = self.spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters") actual = df.select( flatten( transform( "numbers", lambda number: transform( "letters", lambda letter: struct(number.alias("n"), letter.alias("l")) ), ) ) ).first()[0] expected = [ (1, "a"), (1, "b"), (1, "c"), (2, "a"), (2, "b"), (2, "c"), (3, "a"), (3, "b"), (3, "c"), ] self.assertEquals(actual, expected){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 809, in test_nested_higher_order_function self.assertEquals(actual, expected) AssertionError: Lists differ: [{'n': 'a', 'l': 'a'}, {'n': 'b', 'l': 'b'[151 chars]'c'}] != [(1, 'a'), (1, 'b'), (1, 'c'), (2, 'a'), ([43 chars]'c')] First differing element 0: {'n': 'a', 'l': 'a'} (1, 'a') - [{'l': 'a', 'n': 'a'}, - {'l': 'b', 'n': 'b'}, - {'l': 'c', 'n': 'c'}, - {'l': 'a', 'n': 'a'}, - {'l': 'b', 'n': 'b'}, - {'l': 'c', 'n': 'c'}, - {'l': 'a', 'n': 'a'}, - {'l': 'b', 'n': 'b'}, - {'l': 'c', 'n': 'c'}] + [(1, 'a'), + (1, 'b'), + (1, 'c'), + (2, 'a'), + (2, 'b'), + (2, 'c'), + (3, 'a'), + (3, 'b'), + (3, 'c')] {code} was: {code:java} expected = {"a": 1, "b": 2} expected2 = {"c": 3, "d": 4} df = self.spark.createDataFrame( [(list(expected.keys()), list(expected.values()))], ["k", "v"] ) actual = ( df.select( expr("map('c', 3, 'd', 4) as dict2"), map_from_arrays(df.k, df.v).alias("dict"), "*", ) .select( map_contains_key("dict", "a").alias("one"), map_contains_key("dict", "d").alias("not_exists"), map_keys("dict").alias("keys"), map_values("dict").alias("values"), map_entries("dict").alias("items"), "*", ) .select( map_concat("dict", "dict2").alias("merged"), map_from_entries(arrays_zip("keys", "values")).alias("from_items"), "*", ) .first() ) self.assertEqual(expected, actual["dict"]){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 1142, in test_map_functions self.assertEqual(expected, actual["dict"]) AssertionError: {'a': 1, 'b': 2} != [('a', 1), ('b', 2)]{code} Summary: Parity in String representation of higher_order_function (was: Fix String representation of maps created by `map_from_arrays`) > Parity in String representation of higher_order_function > > > Key: SPARK-41902 > URL: https://issues.apache.org/jira/browse/SPARK-41902 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > from pyspark.sql.functions import flatten, struct, transform > df = self.spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') > as letters") > actual = df.select( > flatten( > transform( > "numbers", > lambda number: transform( > "letters", lambda letter: struct(number.alias("n"), > letter.alias("l")) > ), > ) > ) > ).first()[0] > expected = [ > (1, "a"), > (1, "b"), > (1, "c"), > (2, "a"), > (2, "b"), > (2, "c"), > (3, "a"), > (3, "b"), > (3, "c"), > ] > self.assertEquals(actual, expected){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 809, in test_nested_higher_order_function > self.assertEquals(actual, expected) > AssertionError: Lists differ: [{'n': 'a', 'l': 'a'}, {'n': 'b', 'l': 'b'[151 > chars]'c'}] != [(1, 'a'), (1, 'b'), (1, 'c'), (2, 'a'), ([43 chars]'c')] > First differing element 0: > {'n': 'a', 'l': 'a'} > (1, 'a') > - [{'l': 'a', 'n': 'a'}, > - {'l': 'b', 'n': 'b'}, > - {'l': 'c', 'n': 'c'}, > - {'l': 'a', 'n': 'a'}, > - {'l': 'b', 'n': 'b'}, > - {'l': 'c', 'n': 'c'}, > - {'l': 'a', 'n': 'a'}, > - {'l': 'b', 'n': 'b'}, > - {'l': 'c', 'n': 'c'}] > + [(1, 'a'), > + (1, 'b'), > + (1, 'c'), > + (2, 'a'), > + (2, 'b'), > + (2, 'c'), > + (3, 'a'), > + (3, 'b'), > + (3, 'c')] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apac
[jira] [Updated] (SPARK-41903) Support data type ndarray
[ https://issues.apache.org/jira/browse/SPARK-41903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41903: -- Description: {code:java} import numpy as np arr_dtype_to_spark_dtypes = [ ("int8", [("b", "array")]), ("int16", [("b", "array")]), ("int32", [("b", "array")]), ("int64", [("b", "array")]), ("float32", [("b", "array")]), ("float64", [("b", "array")]), ] for t, expected_spark_dtypes in arr_dtype_to_spark_dtypes: arr = np.array([1, 2]).astype(t) self.assertEqual( expected_spark_dtypes, self.spark.range(1).select(lit(arr).alias("b")).dtypes ) arr = np.array([1, 2]).astype(np.uint) with self.assertRaisesRegex( TypeError, "The type of array scalar '%s' is not supported" % arr.dtype ): self.spark.range(1).select(lit(arr).alias("b")){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 1100, in test_ndarray_input expected_spark_dtypes, self.spark.range(1).select(lit(arr).alias("b")).dtypes File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line 332, in wrapped return getattr(functions, f.__name__)(*args, **kwargs) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 198, in lit return Column(LiteralExpression._from_value(col)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/expressions.py", line 266, in _from_value return LiteralExpression(value=value, dataType=LiteralExpression._infer_type(value)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/expressions.py", line 262, in _infer_type raise ValueError(f"Unsupported Data Type {type(value).__name__}") ValueError: Unsupported Data Type ndarray {code} was: {code:java} import numpy as np from pyspark.sql.functions import lit dtype_to_spark_dtypes = [ (np.int8, [("CAST(1 AS TINYINT)", "tinyint")]), (np.int16, [("CAST(1 AS SMALLINT)", "smallint")]), (np.int32, [("CAST(1 AS INT)", "int")]), (np.int64, [("CAST(1 AS BIGINT)", "bigint")]), (np.float32, [("CAST(1.0 AS FLOAT)", "float")]), (np.float64, [("CAST(1.0 AS DOUBLE)", "double")]), (np.bool_, [("true", "boolean")]), ] for dtype, spark_dtypes in dtype_to_spark_dtypes: self.assertEqual(self.spark.range(1).select(lit(dtype(1))).dtypes, spark_dtypes){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 1064, in test_lit_np_scalar self.assertEqual(self.spark.range(1).select(lit(dtype(1))).dtypes, spark_dtypes) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line 332, in wrapped return getattr(functions, f.__name__)(*args, **kwargs) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 198, in lit return Column(LiteralExpression._from_value(col)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/expressions.py", line 266, in _from_value return LiteralExpression(value=value, dataType=LiteralExpression._infer_type(value)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/expressions.py", line 262, in _infer_type raise ValueError(f"Unsupported Data Type {type(value).__name__}") ValueError: Unsupported Data Type int8 {code} > Support data type ndarray > - > > Key: SPARK-41903 > URL: https://issues.apache.org/jira/browse/SPARK-41903 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > import numpy as np > arr_dtype_to_spark_dtypes = [ > ("int8", [("b", "array")]), > ("int16", [("b", "array")]), > ("int32", [("b", "array")]), > ("int64", [("b", "array")]), > ("float32", [("b", "array")]), > ("float64", [("b", "array")]), > ] > for t, expected_spark_dtypes in arr_dtype_to_spark_dtypes: > arr = np.array([1, 2]).astype(t) > self.assertEqual( > expected_spark_dtypes, > self.spark.range(1).select(lit(arr).alias("b")).dtypes > ) > arr = np.array([1, 2]).astype(np.uint) > with self.assertRaisesRegex( > TypeError, "The type of array scalar '%s' is not supported" % arr.dtype > ): > self.spark.range(1).select(lit(arr).alias("b")){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 1100, in test_ndarray_input > expected_spark_dtypes, > self.spark.range(1).select(lit(arr).alias("b")).dtypes > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line > 332, in wrapped > return getattr(functions, f.__name__)(*args, **kwargs) >
[jira] [Updated] (SPARK-41902) Fix String representation of maps created by `map_from_arrays`
[ https://issues.apache.org/jira/browse/SPARK-41902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41902: -- Description: {code:java} expected = {"a": 1, "b": 2} expected2 = {"c": 3, "d": 4} df = self.spark.createDataFrame( [(list(expected.keys()), list(expected.values()))], ["k", "v"] ) actual = ( df.select( expr("map('c', 3, 'd', 4) as dict2"), map_from_arrays(df.k, df.v).alias("dict"), "*", ) .select( map_contains_key("dict", "a").alias("one"), map_contains_key("dict", "d").alias("not_exists"), map_keys("dict").alias("keys"), map_values("dict").alias("values"), map_entries("dict").alias("items"), "*", ) .select( map_concat("dict", "dict2").alias("merged"), map_from_entries(arrays_zip("keys", "values")).alias("from_items"), "*", ) .first() ) self.assertEqual(expected, actual["dict"]){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 1142, in test_map_functions self.assertEqual(expected, actual["dict"]) AssertionError: {'a': 1, 'b': 2} != [('a', 1), ('b', 2)]{code} was: {code:java} from pyspark.sql import functions funs = [ (functions.acosh, "ACOSH"), (functions.asinh, "ASINH"), (functions.atanh, "ATANH"), ] cols = ["a", functions.col("a")] for f, alias in funs: for c in cols: self.assertIn(f"{alias}(a)", repr(f(c))){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 271, in test_inverse_trig_functions self.assertIn(f"{alias}(a)", repr(f(c))) AssertionError: 'ACOSH(a)' not found in "Column<'acosh(ColumnReference(a))'>"{code} {code:java} from pyspark.sql.functions import col, lit, overlay from itertools import chain import re actual = list( chain.from_iterable( [ re.findall("(overlay\\(.*\\))", str(x)) for x in [ overlay(col("foo"), col("bar"), 1), overlay("x", "y", 3), overlay(col("x"), col("y"), 1, 3), overlay("x", "y", 2, 5), overlay("x", "y", lit(11)), overlay("x", "y", lit(2), lit(5)), ] ] ) ) expected = [ "overlay(foo, bar, 1, -1)", "overlay(x, y, 3, -1)", "overlay(x, y, 1, 3)", "overlay(x, y, 2, 5)", "overlay(x, y, 11, -1)", "overlay(x, y, 2, 5)", ] self.assertListEqual(actual, expected) df = self.spark.createDataFrame([("SPARK_SQL", "CORE", 7, 0)], ("x", "y", "pos", "len")) exp = [Row(ol="SPARK_CORESQL")] self.assertTrue( all( [ df.select(overlay(df.x, df.y, 7, 0).alias("ol")).collect() == exp, df.select(overlay(df.x, df.y, lit(7), lit(0)).alias("ol")).collect() == exp, df.select(overlay("x", "y", "pos", "len").alias("ol")).collect() == exp, ] ) ) {code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 675, in test_overlay self.assertListEqual(actual, expected) AssertionError: Lists differ: ['overlay(ColumnReference(foo), ColumnReference(bar[402 chars]5))'] != ['overlay(foo, bar, 1, -1)', 'overlay(x, y, 3, -1)'[90 chars] 5)'] First differing element 0: 'overlay(ColumnReference(foo), ColumnReference(bar), Literal(1), Literal(-1))' 'overlay(foo, bar, 1, -1)' - ['overlay(ColumnReference(foo), ColumnReference(bar), Literal(1), Literal(-1))', - 'overlay(ColumnReference(x), ColumnReference(y), Literal(3), Literal(-1))', - 'overlay(ColumnReference(x), ColumnReference(y), Literal(1), Literal(3))', - 'overlay(ColumnReference(x), ColumnReference(y), Literal(2), Literal(5))', - 'overlay(ColumnReference(x), ColumnReference(y), Literal(11), Literal(-1))', - 'overlay(ColumnReference(x), ColumnReference(y), Literal(2), Literal(5))'] + ['overlay(foo, bar, 1, -1)', + 'overlay(x, y, 3, -1)', + 'overlay(x, y, 1, 3)', + 'overlay(x, y, 2, 5)', + 'overlay(x, y, 11, -1)', + 'overlay(x, y, 2, 5)'] {code} > Fix String representation of maps created by `map_from_arrays` > -- > > Key: SPARK-41902 > URL: https://issues.apache.org/jira/browse/SPARK-41902 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > expected = {"a": 1, "b": 2} > expected2 = {"c": 3, "d": 4} > df = self.spark.createDataFrame( > [(list(expected.keys()), list(expected.values()))], ["k", "v"] > ) > actual = ( > df.select( > expr("map('c', 3, 'd', 4) as dict2"), > map_from_arrays(df.
[jira] [Created] (SPARK-41903) Support data type ndarray
Sandeep Singh created SPARK-41903: - Summary: Support data type ndarray Key: SPARK-41903 URL: https://issues.apache.org/jira/browse/SPARK-41903 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} import numpy as np from pyspark.sql.functions import lit dtype_to_spark_dtypes = [ (np.int8, [("CAST(1 AS TINYINT)", "tinyint")]), (np.int16, [("CAST(1 AS SMALLINT)", "smallint")]), (np.int32, [("CAST(1 AS INT)", "int")]), (np.int64, [("CAST(1 AS BIGINT)", "bigint")]), (np.float32, [("CAST(1.0 AS FLOAT)", "float")]), (np.float64, [("CAST(1.0 AS DOUBLE)", "double")]), (np.bool_, [("true", "boolean")]), ] for dtype, spark_dtypes in dtype_to_spark_dtypes: self.assertEqual(self.spark.range(1).select(lit(dtype(1))).dtypes, spark_dtypes){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 1064, in test_lit_np_scalar self.assertEqual(self.spark.range(1).select(lit(dtype(1))).dtypes, spark_dtypes) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line 332, in wrapped return getattr(functions, f.__name__)(*args, **kwargs) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 198, in lit return Column(LiteralExpression._from_value(col)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/expressions.py", line 266, in _from_value return LiteralExpression(value=value, dataType=LiteralExpression._infer_type(value)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/expressions.py", line 262, in _infer_type raise ValueError(f"Unsupported Data Type {type(value).__name__}") ValueError: Unsupported Data Type int8 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41902) Fix String representation of maps created by `map_from_arrays`
Sandeep Singh created SPARK-41902: - Summary: Fix String representation of maps created by `map_from_arrays` Key: SPARK-41902 URL: https://issues.apache.org/jira/browse/SPARK-41902 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} from pyspark.sql import functions funs = [ (functions.acosh, "ACOSH"), (functions.asinh, "ASINH"), (functions.atanh, "ATANH"), ] cols = ["a", functions.col("a")] for f, alias in funs: for c in cols: self.assertIn(f"{alias}(a)", repr(f(c))){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 271, in test_inverse_trig_functions self.assertIn(f"{alias}(a)", repr(f(c))) AssertionError: 'ACOSH(a)' not found in "Column<'acosh(ColumnReference(a))'>"{code} {code:java} from pyspark.sql.functions import col, lit, overlay from itertools import chain import re actual = list( chain.from_iterable( [ re.findall("(overlay\\(.*\\))", str(x)) for x in [ overlay(col("foo"), col("bar"), 1), overlay("x", "y", 3), overlay(col("x"), col("y"), 1, 3), overlay("x", "y", 2, 5), overlay("x", "y", lit(11)), overlay("x", "y", lit(2), lit(5)), ] ] ) ) expected = [ "overlay(foo, bar, 1, -1)", "overlay(x, y, 3, -1)", "overlay(x, y, 1, 3)", "overlay(x, y, 2, 5)", "overlay(x, y, 11, -1)", "overlay(x, y, 2, 5)", ] self.assertListEqual(actual, expected) df = self.spark.createDataFrame([("SPARK_SQL", "CORE", 7, 0)], ("x", "y", "pos", "len")) exp = [Row(ol="SPARK_CORESQL")] self.assertTrue( all( [ df.select(overlay(df.x, df.y, 7, 0).alias("ol")).collect() == exp, df.select(overlay(df.x, df.y, lit(7), lit(0)).alias("ol")).collect() == exp, df.select(overlay("x", "y", "pos", "len").alias("ol")).collect() == exp, ] ) ) {code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 675, in test_overlay self.assertListEqual(actual, expected) AssertionError: Lists differ: ['overlay(ColumnReference(foo), ColumnReference(bar[402 chars]5))'] != ['overlay(foo, bar, 1, -1)', 'overlay(x, y, 3, -1)'[90 chars] 5)'] First differing element 0: 'overlay(ColumnReference(foo), ColumnReference(bar), Literal(1), Literal(-1))' 'overlay(foo, bar, 1, -1)' - ['overlay(ColumnReference(foo), ColumnReference(bar), Literal(1), Literal(-1))', - 'overlay(ColumnReference(x), ColumnReference(y), Literal(3), Literal(-1))', - 'overlay(ColumnReference(x), ColumnReference(y), Literal(1), Literal(3))', - 'overlay(ColumnReference(x), ColumnReference(y), Literal(2), Literal(5))', - 'overlay(ColumnReference(x), ColumnReference(y), Literal(11), Literal(-1))', - 'overlay(ColumnReference(x), ColumnReference(y), Literal(2), Literal(5))'] + ['overlay(foo, bar, 1, -1)', + 'overlay(x, y, 3, -1)', + 'overlay(x, y, 1, 3)', + 'overlay(x, y, 2, 5)', + 'overlay(x, y, 11, -1)', + 'overlay(x, y, 2, 5)'] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41901) Parity in String representation of Column
[ https://issues.apache.org/jira/browse/SPARK-41901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41901: -- Description: {code:java} from pyspark.sql import functions funs = [ (functions.acosh, "ACOSH"), (functions.asinh, "ASINH"), (functions.atanh, "ATANH"), ] cols = ["a", functions.col("a")] for f, alias in funs: for c in cols: self.assertIn(f"{alias}(a)", repr(f(c))){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 271, in test_inverse_trig_functions self.assertIn(f"{alias}(a)", repr(f(c))) AssertionError: 'ACOSH(a)' not found in "Column<'acosh(ColumnReference(a))'>"{code} {code:java} from pyspark.sql.functions import col, lit, overlay from itertools import chain import re actual = list( chain.from_iterable( [ re.findall("(overlay\\(.*\\))", str(x)) for x in [ overlay(col("foo"), col("bar"), 1), overlay("x", "y", 3), overlay(col("x"), col("y"), 1, 3), overlay("x", "y", 2, 5), overlay("x", "y", lit(11)), overlay("x", "y", lit(2), lit(5)), ] ] ) ) expected = [ "overlay(foo, bar, 1, -1)", "overlay(x, y, 3, -1)", "overlay(x, y, 1, 3)", "overlay(x, y, 2, 5)", "overlay(x, y, 11, -1)", "overlay(x, y, 2, 5)", ] self.assertListEqual(actual, expected) df = self.spark.createDataFrame([("SPARK_SQL", "CORE", 7, 0)], ("x", "y", "pos", "len")) exp = [Row(ol="SPARK_CORESQL")] self.assertTrue( all( [ df.select(overlay(df.x, df.y, 7, 0).alias("ol")).collect() == exp, df.select(overlay(df.x, df.y, lit(7), lit(0)).alias("ol")).collect() == exp, df.select(overlay("x", "y", "pos", "len").alias("ol")).collect() == exp, ] ) ) {code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 675, in test_overlay self.assertListEqual(actual, expected) AssertionError: Lists differ: ['overlay(ColumnReference(foo), ColumnReference(bar[402 chars]5))'] != ['overlay(foo, bar, 1, -1)', 'overlay(x, y, 3, -1)'[90 chars] 5)'] First differing element 0: 'overlay(ColumnReference(foo), ColumnReference(bar), Literal(1), Literal(-1))' 'overlay(foo, bar, 1, -1)' - ['overlay(ColumnReference(foo), ColumnReference(bar), Literal(1), Literal(-1))', - 'overlay(ColumnReference(x), ColumnReference(y), Literal(3), Literal(-1))', - 'overlay(ColumnReference(x), ColumnReference(y), Literal(1), Literal(3))', - 'overlay(ColumnReference(x), ColumnReference(y), Literal(2), Literal(5))', - 'overlay(ColumnReference(x), ColumnReference(y), Literal(11), Literal(-1))', - 'overlay(ColumnReference(x), ColumnReference(y), Literal(2), Literal(5))'] + ['overlay(foo, bar, 1, -1)', + 'overlay(x, y, 3, -1)', + 'overlay(x, y, 1, 3)', + 'overlay(x, y, 2, 5)', + 'overlay(x, y, 11, -1)', + 'overlay(x, y, 2, 5)'] {code} was: {code:java} from pyspark.sql import functions funs = [ (functions.acosh, "ACOSH"), (functions.asinh, "ASINH"), (functions.atanh, "ATANH"), ] cols = ["a", functions.col("a")] for f, alias in funs: for c in cols: self.assertIn(f"{alias}(a)", repr(f(c))){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 271, in test_inverse_trig_functions self.assertIn(f"{alias}(a)", repr(f(c))) AssertionError: 'ACOSH(a)' not found in "Column<'acosh(ColumnReference(a))'>"{code} {code:java} from pyspark.sql.functions import col, lit, overlay from itertools import chain import re actual = list( chain.from_iterable( [ re.findall("(overlay\\(.*\\))", str(x)) for x in [ overlay(col("foo"), col("bar"), 1), overlay("x", "y", 3), overlay(col("x"), col("y"), 1, 3), overlay("x", "y", 2, 5), overlay("x", "y", lit(11)), overlay("x", "y", lit(2), lit(5)), ] ] ) ) expected = [ "overlay(foo, bar, 1, -1)", "overlay(x, y, 3, -1)", "overlay(x, y, 1, 3)", "overlay(x, y, 2, 5)", "overlay(x, y, 11, -1)", "overlay(x, y, 2, 5)", ] self.assertListEqual(actual, expected) df = self.spark.createDataFrame([("SPARK_SQL", "CORE", 7, 0)], ("x", "y", "pos", "len")) exp = [Row(ol="SPARK_CORESQL")] self.assertTrue( all( [ df.select(overlay(df.x, df.y, 7, 0).alias("ol")).collect() == exp, df.select(overlay(df.x, df.y, lit(7), lit(0)).alias("ol")).collect() == exp, df.select(overlay("x", "y", "pos", "len").alias("ol")).collect() == exp, ] ) ) {code} {code:ja
[jira] [Updated] (SPARK-41901) Parity in String representation of Column
[ https://issues.apache.org/jira/browse/SPARK-41901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41901: -- Description: {code:java} from pyspark.sql import functions funs = [ (functions.acosh, "ACOSH"), (functions.asinh, "ASINH"), (functions.atanh, "ATANH"), ] cols = ["a", functions.col("a")] for f, alias in funs: for c in cols: self.assertIn(f"{alias}(a)", repr(f(c))){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 271, in test_inverse_trig_functions self.assertIn(f"{alias}(a)", repr(f(c))) AssertionError: 'ACOSH(a)' not found in "Column<'acosh(ColumnReference(a))'>"{code} {code:java} from pyspark.sql.functions import col, lit, overlay from itertools import chain import re actual = list( chain.from_iterable( [ re.findall("(overlay\\(.*\\))", str(x)) for x in [ overlay(col("foo"), col("bar"), 1), overlay("x", "y", 3), overlay(col("x"), col("y"), 1, 3), overlay("x", "y", 2, 5), overlay("x", "y", lit(11)), overlay("x", "y", lit(2), lit(5)), ] ] ) ) expected = [ "overlay(foo, bar, 1, -1)", "overlay(x, y, 3, -1)", "overlay(x, y, 1, 3)", "overlay(x, y, 2, 5)", "overlay(x, y, 11, -1)", "overlay(x, y, 2, 5)", ] self.assertListEqual(actual, expected) df = self.spark.createDataFrame([("SPARK_SQL", "CORE", 7, 0)], ("x", "y", "pos", "len")) exp = [Row(ol="SPARK_CORESQL")] self.assertTrue( all( [ df.select(overlay(df.x, df.y, 7, 0).alias("ol")).collect() == exp, df.select(overlay(df.x, df.y, lit(7), lit(0)).alias("ol")).collect() == exp, df.select(overlay("x", "y", "pos", "len").alias("ol")).collect() == exp, ] ) ) {code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 675, in test_overlay self.assertListEqual(actual, expected) AssertionError: Lists differ: ['overlay(ColumnReference(foo), ColumnReference(bar[402 chars]5))'] != ['overlay(foo, bar, 1, -1)', 'overlay(x, y, 3, -1)'[90 chars] 5)'] First differing element 0: 'overlay(ColumnReference(foo), ColumnReference(bar), Literal(1), Literal(-1))' 'overlay(foo, bar, 1, -1)' - ['overlay(ColumnReference(foo), ColumnReference(bar), Literal(1), Literal(-1))', - 'overlay(ColumnReference(x), ColumnReference(y), Literal(3), Literal(-1))', - 'overlay(ColumnReference(x), ColumnReference(y), Literal(1), Literal(3))', - 'overlay(ColumnReference(x), ColumnReference(y), Literal(2), Literal(5))', - 'overlay(ColumnReference(x), ColumnReference(y), Literal(11), Literal(-1))', - 'overlay(ColumnReference(x), ColumnReference(y), Literal(2), Literal(5))'] + ['overlay(foo, bar, 1, -1)', + 'overlay(x, y, 3, -1)', + 'overlay(x, y, 1, 3)', + 'overlay(x, y, 2, 5)', + 'overlay(x, y, 11, -1)', + 'overlay(x, y, 2, 5)'] {code} was: {code:java} dt = datetime.date(2021, 12, 27) # Note; number var in Python gets converted to LongType column; # this is not supported by the function, so cast to Integer explicitly df = self.spark.createDataFrame([Row(date=dt, add=2)], "date date, add integer") self.assertTrue( all( df.select( date_add(df.date, df.add) == datetime.date(2021, 12, 29), date_add(df.date, "add") == datetime.date(2021, 12, 29), date_add(df.date, 3) == datetime.date(2021, 12, 30), ).first() ) ){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 391, in test_date_add_function ).first() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 246, in first return self.head() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 310, in head rs = self.head(1) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 312, in head return self.take(n) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 317, in take return self.limit(num).collect() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1076, in collect table = self._session.client.to_table(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 414, in to_table table, _ = self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 586, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 625, in _handle_error
[jira] [Created] (SPARK-41901) Parity in String representation of Column
Sandeep Singh created SPARK-41901: - Summary: Parity in String representation of Column Key: SPARK-41901 URL: https://issues.apache.org/jira/browse/SPARK-41901 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} dt = datetime.date(2021, 12, 27) # Note; number var in Python gets converted to LongType column; # this is not supported by the function, so cast to Integer explicitly df = self.spark.createDataFrame([Row(date=dt, add=2)], "date date, add integer") self.assertTrue( all( df.select( date_add(df.date, df.add) == datetime.date(2021, 12, 29), date_add(df.date, "add") == datetime.date(2021, 12, 29), date_add(df.date, 3) == datetime.date(2021, 12, 30), ).first() ) ){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 391, in test_date_add_function ).first() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 246, in first return self.head() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 310, in head rs = self.head(1) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 312, in head return self.take(n) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 317, in take return self.limit(num).collect() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1076, in collect table = self._session.client.to_table(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 414, in to_table table, _ = self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 586, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 625, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "date_add(date, add)" due to data type mismatch: Parameter 2 requires the ("INT" or "SMALLINT" or "TINYINT") type, however "add" has the type "BIGINT". Plan: 'GlobalLimit 1 +- 'LocalLimit 1 +- 'Project [unresolvedalias('`==`(date_add(date#753, add#754L), 2021-12-29), None), unresolvedalias('`==`(date_add(date#753, add#754L), 2021-12-29), None), (date_add(date#753, 3) = 2021-12-30) AS (date_add(date, 3) = DATE '2021-12-30')#759] +- Project [date#753, add#754L] +- Project [date#749 AS date#753, add#750L AS add#754L] +- LocalRelation [date#749, add#750L]{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41900) Support data type int8
[ https://issues.apache.org/jira/browse/SPARK-41900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41900: -- Description: {code:java} import numpy as np from pyspark.sql.functions import lit dtype_to_spark_dtypes = [ (np.int8, [("CAST(1 AS TINYINT)", "tinyint")]), (np.int16, [("CAST(1 AS SMALLINT)", "smallint")]), (np.int32, [("CAST(1 AS INT)", "int")]), (np.int64, [("CAST(1 AS BIGINT)", "bigint")]), (np.float32, [("CAST(1.0 AS FLOAT)", "float")]), (np.float64, [("CAST(1.0 AS DOUBLE)", "double")]), (np.bool_, [("true", "boolean")]), ] for dtype, spark_dtypes in dtype_to_spark_dtypes: self.assertEqual(self.spark.range(1).select(lit(dtype(1))).dtypes, spark_dtypes){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 1064, in test_lit_np_scalar self.assertEqual(self.spark.range(1).select(lit(dtype(1))).dtypes, spark_dtypes) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line 332, in wrapped return getattr(functions, f.__name__)(*args, **kwargs) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 198, in lit return Column(LiteralExpression._from_value(col)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/expressions.py", line 266, in _from_value return LiteralExpression(value=value, dataType=LiteralExpression._infer_type(value)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/expressions.py", line 262, in _infer_type raise ValueError(f"Unsupported Data Type {type(value).__name__}") ValueError: Unsupported Data Type int8 {code} was: {code:java} row = self.spark.createDataFrame([("Alice", None, None, None)], schema).fillna(True).first() self.assertEqual(row.age, None){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 231, in test_fillna self.assertEqual(row.age, None) AssertionError: nan != None{code} {code:java} row = ( self.spark.createDataFrame([("Alice", 10, None)], schema) .replace(10, 20, subset=["name", "height"]) .first() ) self.assertEqual(row.name, "Alice") self.assertEqual(row.age, 10) self.assertEqual(row.height, None) {code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 372, in test_replace self.assertEqual(row.height, None) AssertionError: nan != None {code} > Support data type int8 > -- > > Key: SPARK-41900 > URL: https://issues.apache.org/jira/browse/SPARK-41900 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > import numpy as np > from pyspark.sql.functions import lit > dtype_to_spark_dtypes = [ > (np.int8, [("CAST(1 AS TINYINT)", "tinyint")]), > (np.int16, [("CAST(1 AS SMALLINT)", "smallint")]), > (np.int32, [("CAST(1 AS INT)", "int")]), > (np.int64, [("CAST(1 AS BIGINT)", "bigint")]), > (np.float32, [("CAST(1.0 AS FLOAT)", "float")]), > (np.float64, [("CAST(1.0 AS DOUBLE)", "double")]), > (np.bool_, [("true", "boolean")]), > ] > for dtype, spark_dtypes in dtype_to_spark_dtypes: > self.assertEqual(self.spark.range(1).select(lit(dtype(1))).dtypes, > spark_dtypes){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 1064, in test_lit_np_scalar > self.assertEqual(self.spark.range(1).select(lit(dtype(1))).dtypes, > spark_dtypes) > File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line > 332, in wrapped > return getattr(functions, f.__name__)(*args, **kwargs) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 198, in lit > return Column(LiteralExpression._from_value(col)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/expressions.py", > line 266, in _from_value > return LiteralExpression(value=value, > dataType=LiteralExpression._infer_type(value)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/expressions.py", > line 262, in _infer_type > raise ValueError(f"Unsupported Data Type {type(value).__name__}") > ValueError: Unsupported Data Type int8 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41900) Support data type int8
Sandeep Singh created SPARK-41900: - Summary: Support data type int8 Key: SPARK-41900 URL: https://issues.apache.org/jira/browse/SPARK-41900 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} row = self.spark.createDataFrame([("Alice", None, None, None)], schema).fillna(True).first() self.assertEqual(row.age, None){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 231, in test_fillna self.assertEqual(row.age, None) AssertionError: nan != None{code} {code:java} row = ( self.spark.createDataFrame([("Alice", 10, None)], schema) .replace(10, 20, subset=["name", "height"]) .first() ) self.assertEqual(row.name, "Alice") self.assertEqual(row.age, 10) self.assertEqual(row.height, None) {code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 372, in test_replace self.assertEqual(row.height, None) AssertionError: nan != None {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41898) Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as argument
[ https://issues.apache.org/jira/browse/SPARK-41898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41898: -- Description: {code:java} df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], ["key", "value"]) w = Window.partitionBy("value").orderBy("key") from pyspark.sql import functions as F sel = df.select( df.value, df.key, F.max("key").over(w.rowsBetween(0, 1)), F.min("key").over(w.rowsBetween(0, 1)), F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))), F.row_number().over(w), F.rank().over(w), F.dense_rank().over(w), F.ntile(2).over(w), ) rs = sorted(sel.collect()){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 821, in test_window_functions F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))), File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/window.py", line 152, in rowsBetween raise TypeError(f"start must be a int, but got {type(start).__name__}") TypeError: start must be a int, but got float {code} was: {code:java} from pyspark.sql.functions import assert_true df = self.spark.range(3) self.assertEqual( df.select(assert_true(df.id < 3)).toDF("val").collect(), [Row(val=None), Row(val=None), Row(val=None)], ) with self.assertRaises(Py4JJavaError) as cm: df.select(assert_true(df.id < 2, "too big")).toDF("val").collect(){code} {code:java} df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], ["key", "value"]) w = Window.partitionBy("value").orderBy("key") from pyspark.sql import functions as F sel = df.select( df.value, df.key, F.max("key").over(w.rowsBetween(0, 1)), F.min("key").over(w.rowsBetween(0, 1)), F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))), F.row_number().over(w), F.rank().over(w), F.dense_rank().over(w), F.ntile(2).over(w), ) rs = sorted(sel.collect()){code} > Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as > argument > > > Key: SPARK-41898 > URL: https://issues.apache.org/jira/browse/SPARK-41898 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], > ["key", "value"]) > w = Window.partitionBy("value").orderBy("key") > from pyspark.sql import functions as F > sel = df.select( > df.value, > df.key, > F.max("key").over(w.rowsBetween(0, 1)), > F.min("key").over(w.rowsBetween(0, 1)), > F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))), > F.row_number().over(w), > F.rank().over(w), > F.dense_rank().over(w), > F.ntile(2).over(w), > ) > rs = sorted(sel.collect()){code} > {code:java} > Traceback (most recent call last): File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 821, in test_window_functions > F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))), File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/window.py", > line 152, in rowsBetween raise TypeError(f"start must be a int, but got > {type(start).__name__}") TypeError: start must be a int, but got float {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41899) DataFrame.createDataFrame converting int to bigint
[ https://issues.apache.org/jira/browse/SPARK-41899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41899: -- Description: {code:java} dt = datetime.date(2021, 12, 27) # Note; number var in Python gets converted to LongType column; # this is not supported by the function, so cast to Integer explicitly df = self.spark.createDataFrame([Row(date=dt, add=2)], "date date, add integer") self.assertTrue( all( df.select( date_add(df.date, df.add) == datetime.date(2021, 12, 29), date_add(df.date, "add") == datetime.date(2021, 12, 29), date_add(df.date, 3) == datetime.date(2021, 12, 30), ).first() ) ){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 391, in test_date_add_function ).first() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 246, in first return self.head() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 310, in head rs = self.head(1) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 312, in head return self.take(n) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 317, in take return self.limit(num).collect() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1076, in collect table = self._session.client.to_table(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 414, in to_table table, _ = self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 586, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 625, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "date_add(date, add)" due to data type mismatch: Parameter 2 requires the ("INT" or "SMALLINT" or "TINYINT") type, however "add" has the type "BIGINT". Plan: 'GlobalLimit 1 +- 'LocalLimit 1 +- 'Project [unresolvedalias('`==`(date_add(date#753, add#754L), 2021-12-29), None), unresolvedalias('`==`(date_add(date#753, add#754L), 2021-12-29), None), (date_add(date#753, 3) = 2021-12-30) AS (date_add(date, 3) = DATE '2021-12-30')#759] +- Project [date#753, add#754L] +- Project [date#749 AS date#753, add#750L AS add#754L] +- LocalRelation [date#749, add#750L]{code} was: {code:java} from pyspark.sql.functions import assert_true df = self.spark.range(3) self.assertEqual( df.select(assert_true(df.id < 3)).toDF("val").collect(), [Row(val=None), Row(val=None), Row(val=None)], ) with self.assertRaises(Py4JJavaError) as cm: df.select(assert_true(df.id < 2, "too big")).toDF("val").collect(){code} {code:java} df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], ["key", "value"]) w = Window.partitionBy("value").orderBy("key") from pyspark.sql import functions as F sel = df.select( df.value, df.key, F.max("key").over(w.rowsBetween(0, 1)), F.min("key").over(w.rowsBetween(0, 1)), F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))), F.row_number().over(w), F.rank().over(w), F.dense_rank().over(w), F.ntile(2).over(w), ) rs = sorted(sel.collect()){code} > DataFrame.createDataFrame converting int to bigint > -- > > Key: SPARK-41899 > URL: https://issues.apache.org/jira/browse/SPARK-41899 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > dt = datetime.date(2021, 12, 27) > # Note; number var in Python gets converted to LongType column; > # this is not supported by the function, so cast to Integer explicitly > df = self.spark.createDataFrame([Row(date=dt, add=2)], "date date, add > integer") > self.assertTrue( > all( > df.select( > date_add(df.date, df.add) == datetime.date(2021, 12, 29), > date_add(df.date, "add") == datetime.date(2021, 12, 29), > date_add(df.date, 3) == datetime.date(2021, 12, 30), > ).first() > ) > ){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 391, in test_date_add_function > ).first() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 246, in first > return self.head() > File > "/Us
[jira] [Created] (SPARK-41899) DataFrame.createDataFrame converting int to bigint
Sandeep Singh created SPARK-41899: - Summary: DataFrame.createDataFrame converting int to bigint Key: SPARK-41899 URL: https://issues.apache.org/jira/browse/SPARK-41899 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} from pyspark.sql.functions import assert_true df = self.spark.range(3) self.assertEqual( df.select(assert_true(df.id < 3)).toDF("val").collect(), [Row(val=None), Row(val=None), Row(val=None)], ) with self.assertRaises(Py4JJavaError) as cm: df.select(assert_true(df.id < 2, "too big")).toDF("val").collect(){code} {code:java} df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], ["key", "value"]) w = Window.partitionBy("value").orderBy("key") from pyspark.sql import functions as F sel = df.select( df.value, df.key, F.max("key").over(w.rowsBetween(0, 1)), F.min("key").over(w.rowsBetween(0, 1)), F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))), F.row_number().over(w), F.rank().over(w), F.dense_rank().over(w), F.ntile(2).over(w), ) rs = sorted(sel.collect()){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41898) Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as argument
[ https://issues.apache.org/jira/browse/SPARK-41898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41898: -- Description: {code:java} from pyspark.sql.functions import assert_true df = self.spark.range(3) self.assertEqual( df.select(assert_true(df.id < 3)).toDF("val").collect(), [Row(val=None), Row(val=None), Row(val=None)], ) with self.assertRaises(Py4JJavaError) as cm: df.select(assert_true(df.id < 2, "too big")).toDF("val").collect(){code} {code:java} df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], ["key", "value"]) w = Window.partitionBy("value").orderBy("key") from pyspark.sql import functions as F sel = df.select( df.value, df.key, F.max("key").over(w.rowsBetween(0, 1)), F.min("key").over(w.rowsBetween(0, 1)), F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))), F.row_number().over(w), F.rank().over(w), F.dense_rank().over(w), F.ntile(2).over(w), ) rs = sorted(sel.collect()){code} was: PySpark throws Py4JJavaError where as connect throws SparkConnectException {code:java} from pyspark.sql.functions import assert_true df = self.spark.range(3) self.assertEqual( df.select(assert_true(df.id < 3)).toDF("val").collect(), [Row(val=None), Row(val=None), Row(val=None)], ) with self.assertRaises(Py4JJavaError) as cm: df.select(assert_true(df.id < 2, "too big")).toDF("val").collect(){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 950, in test_assert_true df.select(assert_true(df.id < 2, "too big")).toDF("val").collect() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1076, in collect table = self._session.client.to_table(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 414, in to_table table, _ = self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 586, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 629, in _handle_error raise SparkConnectException(status.message, info.reason) from None pyspark.sql.connect.client.SparkConnectException: (java.lang.RuntimeException) too big {code} > Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as > argument > > > Key: SPARK-41898 > URL: https://issues.apache.org/jira/browse/SPARK-41898 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > from pyspark.sql.functions import assert_true > df = self.spark.range(3) > self.assertEqual( > df.select(assert_true(df.id < 3)).toDF("val").collect(), > [Row(val=None), Row(val=None), Row(val=None)], > ) > with self.assertRaises(Py4JJavaError) as cm: > df.select(assert_true(df.id < 2, "too big")).toDF("val").collect(){code} > {code:java} > df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], > ["key", "value"]) > w = Window.partitionBy("value").orderBy("key") > from pyspark.sql import functions as F > sel = df.select( > df.value, > df.key, > F.max("key").over(w.rowsBetween(0, 1)), > F.min("key").over(w.rowsBetween(0, 1)), > F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))), > F.row_number().over(w), > F.rank().over(w), > F.dense_rank().over(w), > F.ntile(2).over(w), > ) > rs = sorted(sel.collect()){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41898) Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as argument
Sandeep Singh created SPARK-41898: - Summary: Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as argument Key: SPARK-41898 URL: https://issues.apache.org/jira/browse/SPARK-41898 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh PySpark throws Py4JJavaError where as connect throws SparkConnectException {code:java} from pyspark.sql.functions import assert_true df = self.spark.range(3) self.assertEqual( df.select(assert_true(df.id < 3)).toDF("val").collect(), [Row(val=None), Row(val=None), Row(val=None)], ) with self.assertRaises(Py4JJavaError) as cm: df.select(assert_true(df.id < 2, "too big")).toDF("val").collect(){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 950, in test_assert_true df.select(assert_true(df.id < 2, "too big")).toDF("val").collect() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1076, in collect table = self._session.client.to_table(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 414, in to_table table, _ = self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 586, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 629, in _handle_error raise SparkConnectException(status.message, info.reason) from None pyspark.sql.connect.client.SparkConnectException: (java.lang.RuntimeException) too big {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41897) Parity in Error types between pyspark and connect functions
[ https://issues.apache.org/jira/browse/SPARK-41897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41897: -- Description: PySpark throws Py4JJavaError where as connect throws SparkConnectException {code:java} from pyspark.sql.functions import assert_true df = self.spark.range(3) self.assertEqual( df.select(assert_true(df.id < 3)).toDF("val").collect(), [Row(val=None), Row(val=None), Row(val=None)], ) with self.assertRaises(Py4JJavaError) as cm: df.select(assert_true(df.id < 2, "too big")).toDF("val").collect(){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", line 950, in test_assert_true df.select(assert_true(df.id < 2, "too big")).toDF("val").collect() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1076, in collect table = self._session.client.to_table(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 414, in to_table table, _ = self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 586, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 629, in _handle_error raise SparkConnectException(status.message, info.reason) from None pyspark.sql.connect.client.SparkConnectException: (java.lang.RuntimeException) too big {code} was: {code:java} df = self.spark.range(10e10).toDF("id") such_a_nice_list = ["itworks1", "itworks2", "itworks3"] hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code} > Parity in Error types between pyspark and connect functions > --- > > Key: SPARK-41897 > URL: https://issues.apache.org/jira/browse/SPARK-41897 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > PySpark throws Py4JJavaError where as connect throws SparkConnectException > {code:java} > from pyspark.sql.functions import assert_true > df = self.spark.range(3) > self.assertEqual( > df.select(assert_true(df.id < 3)).toDF("val").collect(), > [Row(val=None), Row(val=None), Row(val=None)], > ) > with self.assertRaises(Py4JJavaError) as cm: > df.select(assert_true(df.id < 2, "too big")).toDF("val").collect(){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 950, in test_assert_true > df.select(assert_true(df.id < 2, "too big")).toDF("val").collect() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1076, in collect > table = self._session.client.to_table(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 414, in to_table > table, _ = self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 586, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 629, in _handle_error > raise SparkConnectException(status.message, info.reason) from None > pyspark.sql.connect.client.SparkConnectException: > (java.lang.RuntimeException) too big {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41897) Parity in Error types between pyspark and connect functions
Sandeep Singh created SPARK-41897: - Summary: Parity in Error types between pyspark and connect functions Key: SPARK-41897 URL: https://issues.apache.org/jira/browse/SPARK-41897 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} df = self.spark.range(10e10).toDF("id") such_a_nice_list = ["itworks1", "itworks2", "itworks3"] hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41891) Enable test_add_months_function, test_array_repeat, test_dayofweek, test_first_last_ignorenulls, test_function_parity, test_inline, test_window_time, test_reciprocal_tri
[ https://issues.apache.org/jira/browse/SPARK-41891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41891: -- Summary: Enable test_add_months_function, test_array_repeat, test_dayofweek, test_first_last_ignorenulls, test_function_parity, test_inline, test_window_time, test_reciprocal_trig_functions (was: Enable 8 tests) > Enable test_add_months_function, test_array_repeat, test_dayofweek, > test_first_last_ignorenulls, test_function_parity, test_inline, > test_window_time, test_reciprocal_trig_functions > > > Key: SPARK-41891 > URL: https://issues.apache.org/jira/browse/SPARK-41891 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41892) Add JIRAs or messages for skipped messages
Sandeep Singh created SPARK-41892: - Summary: Add JIRAs or messages for skipped messages Key: SPARK-41892 URL: https://issues.apache.org/jira/browse/SPARK-41892 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh Assignee: Sandeep Singh Fix For: 3.4.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41878) Add JIRAs or messages for skipped tests
[ https://issues.apache.org/jira/browse/SPARK-41878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41878: -- Summary: Add JIRAs or messages for skipped tests (was: Add JIRAs or messages for skipped messages) > Add JIRAs or messages for skipped tests > --- > > Key: SPARK-41878 > URL: https://issues.apache.org/jira/browse/SPARK-41878 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > Add JIRAs or Messages for all the skipped messages. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41891) Enable 8 tests
Sandeep Singh created SPARK-41891: - Summary: Enable 8 tests Key: SPARK-41891 URL: https://issues.apache.org/jira/browse/SPARK-41891 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh Assignee: Sandeep Singh Fix For: 3.4.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41887) Support DataFrame hint parameter to be list
Sandeep Singh created SPARK-41887: - Summary: Support DataFrame hint parameter to be list Key: SPARK-41887 URL: https://issues.apache.org/jira/browse/SPARK-41887 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} df = self.spark.range(10e10).toDF("id") such_a_nice_list = ["itworks1", "itworks2", "itworks3"] hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 556, in test_extended_hint_types hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 482, in hint raise TypeError( TypeError: param should be a int or str, but got float 1.2345{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41887) Support DataFrame hint parameter to be list
[ https://issues.apache.org/jira/browse/SPARK-41887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41887: -- Description: {code:java} df = self.spark.range(10e10).toDF("id") such_a_nice_list = ["itworks1", "itworks2", "itworks3"] hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code} was: {code:java} df = self.spark.range(10e10).toDF("id") such_a_nice_list = ["itworks1", "itworks2", "itworks3"] hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 556, in test_extended_hint_types hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 482, in hint raise TypeError( TypeError: param should be a int or str, but got float 1.2345{code} > Support DataFrame hint parameter to be list > --- > > Key: SPARK-41887 > URL: https://issues.apache.org/jira/browse/SPARK-41887 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.range(10e10).toDF("id") > such_a_nice_list = ["itworks1", "itworks2", "itworks3"] > hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41871) DataFrame hint parameter can be str, float or int
[ https://issues.apache.org/jira/browse/SPARK-41871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41871: -- Summary: DataFrame hint parameter can be str, float or int (was: DataFrame hint parameter can be str, list, float or int) > DataFrame hint parameter can be str, float or int > - > > Key: SPARK-41871 > URL: https://issues.apache.org/jira/browse/SPARK-41871 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.range(10e10).toDF("id") > such_a_nice_list = ["itworks1", "itworks2", "itworks3"] > hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 556, in test_extended_hint_types > hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 482, in hint > raise TypeError( > TypeError: param should be a int or str, but got float 1.2345{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41884) DataFrame `toPandas` parity in return types
[ https://issues.apache.org/jira/browse/SPARK-41884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41884: -- Description: {code:java} import numpy as np import pandas as pd df = self.spark.createDataFrame( [[[("a", 2, 3.0), ("a", 2, 3.0)]], [[("b", 5, 6.0), ("b", 5, 6.0)]]], "array_struct_col Array>", ) for is_arrow_enabled in [True, False]: with self.sql_conf({"spark.sql.execution.arrow.pyspark.enabled": is_arrow_enabled}): pdf = df.toPandas() self.assertEqual(type(pdf), pd.DataFrame) self.assertEqual(type(pdf["array_struct_col"]), pd.Series) if is_arrow_enabled: self.assertEqual(type(pdf["array_struct_col"][0]), np.ndarray) else: self.assertEqual(type(pdf["array_struct_col"][0]), list){code} {code:java} Traceback (most recent call last): 1415 File "/__w/spark/spark/python/pyspark/sql/tests/test_dataframe.py", line 1202, in test_to_pandas_for_array_of_struct 1416df = self.spark.createDataFrame( 1417 File "/__w/spark/spark/python/pyspark/sql/connect/session.py", line 264, in createDataFrame 1418table = pa.Table.from_pylist([dict(zip(_cols, list(item))) for item in _data]) 1419 File "pyarrow/table.pxi", line 3700, in pyarrow.lib.Table.from_pylist 1420 File "pyarrow/table.pxi", line 5221, in pyarrow.lib._from_pylist 1421 File "pyarrow/table.pxi", line 3575, in pyarrow.lib.Table.from_arrays 1422 File "pyarrow/table.pxi", line 1383, in pyarrow.lib._sanitize_arrays 1423 File "pyarrow/table.pxi", line 1364, in pyarrow.lib._schema_from_arrays 1424 File "pyarrow/array.pxi", line 320, in pyarrow.lib.array 1425 File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array 1426 File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status 1427 File "pyarrow/error.pxi", line 123, in pyarrow.lib.check_status 1428pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object{code} {code:java} import numpy as np pdf = self._to_pandas() types = pdf.dtypes self.assertEqual(types[0], np.int32) self.assertEqual(types[1], np.object) self.assertEqual(types[2], np.bool) self.assertEqual(types[3], np.float32) self.assertEqual(types[4], np.object) # datetime.date self.assertEqual(types[5], "datetime64[ns]") self.assertEqual(types[6], "datetime64[ns]") self.assertEqual(types[7], "timedelta64[ns]") {code} {code:java} Traceback (most recent call last): 1434 File "/__w/spark/spark/python/pyspark/sql/tests/test_dataframe.py", line 1039, in test_to_pandas 1435 self.assertEqual(types[5], "datetime64[ns]") 1436AssertionError: datetime64[ns, Etc/UTC] != 'datetime64[ns]' 1437 {code} was: {code:java} schema = StructType( [StructField("i", StringType(), True), StructField("j", IntegerType(), True)] ) df = self.spark.createDataFrame([("a", 1)], schema) schema1 = StructType([StructField("j", StringType()), StructField("i", StringType())]) df1 = df.to(schema1) self.assertEqual(schema1, df1.schema) self.assertEqual(df.count(), df1.count()) schema2 = StructType([StructField("j", LongType())]) df2 = df.to(schema2) self.assertEqual(schema2, df2.schema) self.assertEqual(df.count(), df2.count()) schema3 = StructType([StructField("struct", schema1, False)]) df3 = df.select(struct("i", "j").alias("struct")).to(schema3) self.assertEqual(schema3, df3.schema) self.assertEqual(df.count(), df3.count()) # incompatible field nullability schema4 = StructType([StructField("j", LongType(), False)]) self.assertRaisesRegex( AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4) ){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 1486, in test_to self.assertRaisesRegex( AssertionError: AnalysisException not raised by {code} > DataFrame `toPandas` parity in return types > --- > > Key: SPARK-41884 > URL: https://issues.apache.org/jira/browse/SPARK-41884 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > import numpy as np > import pandas as pd > df = self.spark.createDataFrame( > [[[("a", 2, 3.0), ("a", 2, 3.0)]], [[("b", 5, 6.0), ("b", 5, 6.0)]]], > "array_struct_col Array>", > ) > for is_arrow_enabled in [True, False]: > with self.sql_conf({"spark.sql.execution.arrow.pyspark.enabled": > is_arrow_enabled}): > pdf = df.toPandas() > self.assertEqual(type(pdf), pd.DataFrame) > self.assertEqual(type(pdf["array_struct_col"]), pd.Series) > if is_arrow_enabled: > self.assertEqual(type(pdf["array_struct_col"][0]), np.ndarray) > else: > self.assertEqual(type(pdf["array_struct_col"][0]), list){code} > {code:java} > Trac
[jira] [Created] (SPARK-41884) DataFrame `toPandas` parity in return types
Sandeep Singh created SPARK-41884: - Summary: DataFrame `toPandas` parity in return types Key: SPARK-41884 URL: https://issues.apache.org/jira/browse/SPARK-41884 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} schema = StructType( [StructField("i", StringType(), True), StructField("j", IntegerType(), True)] ) df = self.spark.createDataFrame([("a", 1)], schema) schema1 = StructType([StructField("j", StringType()), StructField("i", StringType())]) df1 = df.to(schema1) self.assertEqual(schema1, df1.schema) self.assertEqual(df.count(), df1.count()) schema2 = StructType([StructField("j", LongType())]) df2 = df.to(schema2) self.assertEqual(schema2, df2.schema) self.assertEqual(df.count(), df2.count()) schema3 = StructType([StructField("struct", schema1, False)]) df3 = df.select(struct("i", "j").alias("struct")).to(schema3) self.assertEqual(schema3, df3.schema) self.assertEqual(df.count(), df3.count()) # incompatible field nullability schema4 = StructType([StructField("j", LongType(), False)]) self.assertRaisesRegex( AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4) ){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 1486, in test_to self.assertRaisesRegex( AssertionError: AnalysisException not raised by {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41878) Add JIRAs or messages for skipped messages
[ https://issues.apache.org/jira/browse/SPARK-41878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41878: -- Description: Add JIRAs or Messages for all the skipped messages. (was: 5 tests pass now. Should enable them.) > Add JIRAs or messages for skipped messages > -- > > Key: SPARK-41878 > URL: https://issues.apache.org/jira/browse/SPARK-41878 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > Add JIRAs or Messages for all the skipped messages. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41878) Add JIRAs or messages for skipped messages
Sandeep Singh created SPARK-41878: - Summary: Add JIRAs or messages for skipped messages Key: SPARK-41878 URL: https://issues.apache.org/jira/browse/SPARK-41878 Project: Spark Issue Type: Sub-task Components: Connect, Tests Affects Versions: 3.4.0 Reporter: Sandeep Singh Assignee: Hyukjin Kwon Fix For: 3.4.0 5 tests pass now. Should enable them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41877) SparkSession.createDataFrame error parity
[ https://issues.apache.org/jira/browse/SPARK-41877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41877: -- Description: {code:java} df = self.spark.createDataFrame( [ (1, 10, 1.0, "one"), (2, 20, 2.0, "two"), (3, 30, 3.0, "three"), ], ["id", "int", "double", "str"], ) with self.subTest(desc="with none identifier"): with self.assertRaisesRegex(AssertionError, "ids must not be None"): df.unpivot(None, ["int", "double"], "var", "val"){code} Error: {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 575, in test_unpivot with self.assertRaisesRegex(AssertionError, "ids must not be None"): AssertionError: AssertionError not raised{code} was: {code:java} df = self.spark.createDataFrame([(1, 2)], ["c", "c"]){code} Error: {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 65, in test_duplicated_column_names df = self.spark.createDataFrame([(1, 2)], ["c", "c"]) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", line 277, in createDataFrame raise ValueError( ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements{code} > SparkSession.createDataFrame error parity > - > > Key: SPARK-41877 > URL: https://issues.apache.org/jira/browse/SPARK-41877 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.createDataFrame( > [ > (1, 10, 1.0, "one"), > (2, 20, 2.0, "two"), > (3, 30, 3.0, "three"), > ], > ["id", "int", "double", "str"], > ) > with self.subTest(desc="with none identifier"): > with self.assertRaisesRegex(AssertionError, "ids must not be None"): > df.unpivot(None, ["int", "double"], "var", "val"){code} > Error: > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 575, in test_unpivot > with self.assertRaisesRegex(AssertionError, "ids must not be None"): > AssertionError: AssertionError not raised{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41877) SparkSession.createDataFrame error parity
Sandeep Singh created SPARK-41877: - Summary: SparkSession.createDataFrame error parity Key: SPARK-41877 URL: https://issues.apache.org/jira/browse/SPARK-41877 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} df = self.spark.createDataFrame([(1, 2)], ["c", "c"]){code} Error: {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 65, in test_duplicated_column_names df = self.spark.createDataFrame([(1, 2)], ["c", "c"]) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", line 277, in createDataFrame raise ValueError( ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41876) Implement DataFrame `toLocalIterator`
Sandeep Singh created SPARK-41876: - Summary: Implement DataFrame `toLocalIterator` Key: SPARK-41876 URL: https://issues.apache.org/jira/browse/SPARK-41876 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} schema = StructType( [StructField("i", StringType(), True), StructField("j", IntegerType(), True)] ) df = self.spark.createDataFrame([("a", 1)], schema) schema1 = StructType([StructField("j", StringType()), StructField("i", StringType())]) df1 = df.to(schema1) self.assertEqual(schema1, df1.schema) self.assertEqual(df.count(), df1.count()) schema2 = StructType([StructField("j", LongType())]) df2 = df.to(schema2) self.assertEqual(schema2, df2.schema) self.assertEqual(df.count(), df2.count()) schema3 = StructType([StructField("struct", schema1, False)]) df3 = df.select(struct("i", "j").alias("struct")).to(schema3) self.assertEqual(schema3, df3.schema) self.assertEqual(df.count(), df3.count()) # incompatible field nullability schema4 = StructType([StructField("j", LongType(), False)]) self.assertRaisesRegex( AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4) ){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 1486, in test_to self.assertRaisesRegex( AssertionError: AnalysisException not raised by {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41875) Throw proper errors in Dataset.to()
[ https://issues.apache.org/jira/browse/SPARK-41875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41875: -- Description: {code:java} schema = StructType( [StructField("i", StringType(), True), StructField("j", IntegerType(), True)] ) df = self.spark.createDataFrame([("a", 1)], schema) schema1 = StructType([StructField("j", StringType()), StructField("i", StringType())]) df1 = df.to(schema1) self.assertEqual(schema1, df1.schema) self.assertEqual(df.count(), df1.count()) schema2 = StructType([StructField("j", LongType())]) df2 = df.to(schema2) self.assertEqual(schema2, df2.schema) self.assertEqual(df.count(), df2.count()) schema3 = StructType([StructField("struct", schema1, False)]) df3 = df.select(struct("i", "j").alias("struct")).to(schema3) self.assertEqual(schema3, df3.schema) self.assertEqual(df.count(), df3.count()) # incompatible field nullability schema4 = StructType([StructField("j", LongType(), False)]) self.assertRaisesRegex( AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4) ){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 1486, in test_to self.assertRaisesRegex( AssertionError: AnalysisException not raised by {code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 401, in pyspark.sql.connect.dataframe.DataFrame.sample Failed example: df.sample(0.5, 3).count() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.sample(0.5, 3).count() TypeError: DataFrame.sample() takes 2 positional arguments but 3 were given ** File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 411, in pyspark.sql.connect.dataframe.DataFrame.sample Failed example: df.sample(False, fraction=1.0).count() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.sample(False, fraction=1.0).count() TypeError: DataFrame.sample() got multiple values for argument 'fraction'{code} > Throw proper errors in Dataset.to() > --- > > Key: SPARK-41875 > URL: https://issues.apache.org/jira/browse/SPARK-41875 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > schema = StructType( > [StructField("i", StringType(), True), StructField("j", IntegerType(), > True)] > ) > df = self.spark.createDataFrame([("a", 1)], schema) > schema1 = StructType([StructField("j", StringType()), StructField("i", > StringType())]) > df1 = df.to(schema1) > self.assertEqual(schema1, df1.schema) > self.assertEqual(df.count(), df1.count()) > schema2 = StructType([StructField("j", LongType())]) > df2 = df.to(schema2) > self.assertEqual(schema2, df2.schema) > self.assertEqual(df.count(), df2.count()) > schema3 = StructType([StructField("struct", schema1, False)]) > df3 = df.select(struct("i", "j").alias("struct")).to(schema3) > self.assertEqual(schema3, df3.schema) > self.assertEqual(df.count(), df3.count()) > # incompatible field nullability > schema4 = StructType([StructField("j", LongType(), False)]) > self.assertRaisesRegex( > AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4) > ){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 1486, in test_to > self.assertRaisesRegex( > AssertionError: AnalysisException not raised by {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41875) Throw proper errors in Dataset.to()
Sandeep Singh created SPARK-41875: - Summary: Throw proper errors in Dataset.to() Key: SPARK-41875 URL: https://issues.apache.org/jira/browse/SPARK-41875 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 401, in pyspark.sql.connect.dataframe.DataFrame.sample Failed example: df.sample(0.5, 3).count() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.sample(0.5, 3).count() TypeError: DataFrame.sample() takes 2 positional arguments but 3 were given ** File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 411, in pyspark.sql.connect.dataframe.DataFrame.sample Failed example: df.sample(False, fraction=1.0).count() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.sample(False, fraction=1.0).count() TypeError: DataFrame.sample() got multiple values for argument 'fraction'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41874) Implement DataFrame `sameSemantics`
Sandeep Singh created SPARK-41874: - Summary: Implement DataFrame `sameSemantics` Key: SPARK-41874 URL: https://issues.apache.org/jira/browse/SPARK-41874 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41872) Fix DataFrame createDataframe handling of None
[ https://issues.apache.org/jira/browse/SPARK-41872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41872: -- Summary: Fix DataFrame createDataframe handling of None (was: Fix DataFrame fillna with bool) > Fix DataFrame createDataframe handling of None > -- > > Key: SPARK-41872 > URL: https://issues.apache.org/jira/browse/SPARK-41872 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > row = self.spark.createDataFrame([("Alice", None, None, None)], > schema).fillna(True).first() > self.assertEqual(row.age, None){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 231, in test_fillna > self.assertEqual(row.age, None) > AssertionError: nan != None{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41872) Fix DataFrame createDataframe handling of None
[ https://issues.apache.org/jira/browse/SPARK-41872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41872: -- Description: {code:java} row = self.spark.createDataFrame([("Alice", None, None, None)], schema).fillna(True).first() self.assertEqual(row.age, None){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 231, in test_fillna self.assertEqual(row.age, None) AssertionError: nan != None{code} {code:java} row = ( self.spark.createDataFrame([("Alice", 10, None)], schema) .replace(10, 20, subset=["name", "height"]) .first() ) self.assertEqual(row.name, "Alice") self.assertEqual(row.age, 10) self.assertEqual(row.height, None) {code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 372, in test_replace self.assertEqual(row.height, None) AssertionError: nan != None {code} was: {code:java} row = self.spark.createDataFrame([("Alice", None, None, None)], schema).fillna(True).first() self.assertEqual(row.age, None){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 231, in test_fillna self.assertEqual(row.age, None) AssertionError: nan != None{code} > Fix DataFrame createDataframe handling of None > -- > > Key: SPARK-41872 > URL: https://issues.apache.org/jira/browse/SPARK-41872 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > row = self.spark.createDataFrame([("Alice", None, None, None)], > schema).fillna(True).first() > self.assertEqual(row.age, None){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 231, in test_fillna > self.assertEqual(row.age, None) > AssertionError: nan != None{code} > > {code:java} > row = ( > self.spark.createDataFrame([("Alice", 10, None)], schema) > .replace(10, 20, subset=["name", "height"]) > .first() > ) > self.assertEqual(row.name, "Alice") > self.assertEqual(row.age, 10) > self.assertEqual(row.height, None) {code} > {code:java} > Traceback (most recent call last): File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 372, in test_replace self.assertEqual(row.height, None) > AssertionError: nan != None > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41873) Implement DataFrame `pandas_api`
[ https://issues.apache.org/jira/browse/SPARK-41873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41873: -- Summary: Implement DataFrame `pandas_api` (was: Implement DataFrameReader `pandas_api`) > Implement DataFrame `pandas_api` > > > Key: SPARK-41873 > URL: https://issues.apache.org/jira/browse/SPARK-41873 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 276, in pyspark.sql.connect.functions.input_file_name > Failed example: > df = spark.read.text(path) > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > df = spark.read.text(path) > AttributeError: 'DataFrameReader' object has no attribute 'text'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41873) Implement DataFrameReader `pandas_api`
Sandeep Singh created SPARK-41873: - Summary: Implement DataFrameReader `pandas_api` Key: SPARK-41873 URL: https://issues.apache.org/jira/browse/SPARK-41873 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 276, in pyspark.sql.connect.functions.input_file_name Failed example: df = spark.read.text(path) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df = spark.read.text(path) AttributeError: 'DataFrameReader' object has no attribute 'text'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41873) Implement DataFrame `pandas_api`
[ https://issues.apache.org/jira/browse/SPARK-41873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41873: -- Description: (was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 276, in pyspark.sql.connect.functions.input_file_name Failed example: df = spark.read.text(path) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df = spark.read.text(path) AttributeError: 'DataFrameReader' object has no attribute 'text'{code}) > Implement DataFrame `pandas_api` > > > Key: SPARK-41873 > URL: https://issues.apache.org/jira/browse/SPARK-41873 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41872) Fix DataFrame fillna with bool
[ https://issues.apache.org/jira/browse/SPARK-41872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41872: -- Description: {code:java} row = self.spark.createDataFrame([("Alice", None, None, None)], schema).fillna(True).first() self.assertEqual(row.age, None){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 231, in test_fillna self.assertEqual(row.age, None) AssertionError: nan != None{code} was: {code:java} df = self.spark.range(10e10).toDF("id") such_a_nice_list = ["itworks1", "itworks2", "itworks3"] hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 556, in test_extended_hint_types hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 482, in hint raise TypeError( TypeError: param should be a int or str, but got float 1.2345{code} > Fix DataFrame fillna with bool > -- > > Key: SPARK-41872 > URL: https://issues.apache.org/jira/browse/SPARK-41872 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > row = self.spark.createDataFrame([("Alice", None, None, None)], > schema).fillna(True).first() > self.assertEqual(row.age, None){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 231, in test_fillna > self.assertEqual(row.age, None) > AssertionError: nan != None{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41872) Fix DataFrame fillna with bool
Sandeep Singh created SPARK-41872: - Summary: Fix DataFrame fillna with bool Key: SPARK-41872 URL: https://issues.apache.org/jira/browse/SPARK-41872 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} df = self.spark.range(10e10).toDF("id") such_a_nice_list = ["itworks1", "itworks2", "itworks3"] hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 556, in test_extended_hint_types hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 482, in hint raise TypeError( TypeError: param should be a int or str, but got float 1.2345{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41871) DataFrame hint parameter can be str, list, float or int
Sandeep Singh created SPARK-41871: - Summary: DataFrame hint parameter can be str, list, float or int Key: SPARK-41871 URL: https://issues.apache.org/jira/browse/SPARK-41871 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} df = self.spark.createDataFrame([("Alice", 50), ("Alice", 60)], ["name", "age"]) # shouldn't drop a non-null row self.assertEqual(df.dropDuplicates().count(), 2) self.assertEqual(df.dropDuplicates(["name"]).count(), 1) self.assertEqual(df.dropDuplicates(["name", "age"]).count(), 2) type_error_msg = "Parameter 'subset' must be a list of columns" with self.assertRaisesRegex(TypeError, type_error_msg): df.dropDuplicates("name"){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 128, in test_drop_duplicates with self.assertRaisesRegex(TypeError, type_error_msg): AssertionError: TypeError not raised{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41871) DataFrame hint parameter can be str, list, float or int
[ https://issues.apache.org/jira/browse/SPARK-41871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41871: -- Description: {code:java} df = self.spark.range(10e10).toDF("id") such_a_nice_list = ["itworks1", "itworks2", "itworks3"] hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 556, in test_extended_hint_types hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 482, in hint raise TypeError( TypeError: param should be a int or str, but got float 1.2345{code} was: {code:java} df = self.spark.createDataFrame([("Alice", 50), ("Alice", 60)], ["name", "age"]) # shouldn't drop a non-null row self.assertEqual(df.dropDuplicates().count(), 2) self.assertEqual(df.dropDuplicates(["name"]).count(), 1) self.assertEqual(df.dropDuplicates(["name", "age"]).count(), 2) type_error_msg = "Parameter 'subset' must be a list of columns" with self.assertRaisesRegex(TypeError, type_error_msg): df.dropDuplicates("name"){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 128, in test_drop_duplicates with self.assertRaisesRegex(TypeError, type_error_msg): AssertionError: TypeError not raised{code} > DataFrame hint parameter can be str, list, float or int > --- > > Key: SPARK-41871 > URL: https://issues.apache.org/jira/browse/SPARK-41871 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.range(10e10).toDF("id") > such_a_nice_list = ["itworks1", "itworks2", "itworks3"] > hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 556, in test_extended_hint_types > hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 482, in hint > raise TypeError( > TypeError: param should be a int or str, but got float 1.2345{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41870) Handle duplicate columns in `createDataFrame`
Sandeep Singh created SPARK-41870: - Summary: Handle duplicate columns in `createDataFrame` Key: SPARK-41870 URL: https://issues.apache.org/jira/browse/SPARK-41870 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} import array data = [Row(longarray=array.array("l", [-9223372036854775808, 0, 9223372036854775807]))] df = self.spark.createDataFrame(data) {code} Error: {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 1220, in test_create_dataframe_from_array_of_long df = self.spark.createDataFrame(data) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", line 260, in createDataFrame table = pa.Table.from_pylist([row.asDict(recursive=True) for row in _data]) File "pyarrow/table.pxi", line 3700, in pyarrow.lib.Table.from_pylist File "pyarrow/table.pxi", line 5221, in pyarrow.lib._from_pylist File "pyarrow/table.pxi", line 3575, in pyarrow.lib.Table.from_arrays File "pyarrow/table.pxi", line 1383, in pyarrow.lib._sanitize_arrays File "pyarrow/table.pxi", line 1364, in pyarrow.lib._schema_from_arrays File "pyarrow/array.pxi", line 320, in pyarrow.lib.array File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Could not convert array('l', [-9223372036854775808, 0, 9223372036854775807]) with type array.array: did not recognize Python value type when inferring an Arrow data type{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41870) Handle duplicate columns in `createDataFrame`
[ https://issues.apache.org/jira/browse/SPARK-41870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41870: -- Description: {code:java} df = self.spark.createDataFrame([(1, 2)], ["c", "c"]){code} Error: {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 65, in test_duplicated_column_names df = self.spark.createDataFrame([(1, 2)], ["c", "c"]) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", line 277, in createDataFrame raise ValueError( ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements{code} was: {code:java} import array data = [Row(longarray=array.array("l", [-9223372036854775808, 0, 9223372036854775807]))] df = self.spark.createDataFrame(data) {code} Error: {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 1220, in test_create_dataframe_from_array_of_long df = self.spark.createDataFrame(data) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", line 260, in createDataFrame table = pa.Table.from_pylist([row.asDict(recursive=True) for row in _data]) File "pyarrow/table.pxi", line 3700, in pyarrow.lib.Table.from_pylist File "pyarrow/table.pxi", line 5221, in pyarrow.lib._from_pylist File "pyarrow/table.pxi", line 3575, in pyarrow.lib.Table.from_arrays File "pyarrow/table.pxi", line 1383, in pyarrow.lib._sanitize_arrays File "pyarrow/table.pxi", line 1364, in pyarrow.lib._schema_from_arrays File "pyarrow/array.pxi", line 320, in pyarrow.lib.array File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Could not convert array('l', [-9223372036854775808, 0, 9223372036854775807]) with type array.array: did not recognize Python value type when inferring an Arrow data type{code} > Handle duplicate columns in `createDataFrame` > - > > Key: SPARK-41870 > URL: https://issues.apache.org/jira/browse/SPARK-41870 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.createDataFrame([(1, 2)], ["c", "c"]){code} > Error: > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 65, in test_duplicated_column_names > df = self.spark.createDataFrame([(1, 2)], ["c", "c"]) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", > line 277, in createDataFrame > raise ValueError( > ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 > elements{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41869) DataFrame dropDuplicates should throw error on non list argument
[ https://issues.apache.org/jira/browse/SPARK-41869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41869: -- Description: {code:java} df = self.spark.createDataFrame([("Alice", 50), ("Alice", 60)], ["name", "age"]) # shouldn't drop a non-null row self.assertEqual(df.dropDuplicates().count(), 2) self.assertEqual(df.dropDuplicates(["name"]).count(), 1) self.assertEqual(df.dropDuplicates(["name", "age"]).count(), 2) type_error_msg = "Parameter 'subset' must be a list of columns" with self.assertRaisesRegex(TypeError, type_error_msg): df.dropDuplicates("name"){code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 128, in test_drop_duplicates with self.assertRaisesRegex(TypeError, type_error_msg): AssertionError: TypeError not raised{code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1270, in pyspark.sql.connect.functions.explode Failed example: eDF.select(explode(eDF.mapfield).alias("key", "value")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in eDF.select(explode(eDF.mapfield).alias("key", "value")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type "STRUCT" while it's required to be "MAP". Plan: {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1364, in pyspark.sql.connect.functions.inline Failed example: df.select(inline(df.structlist)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(inline(df.structlist)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is of type "ARRAY" while it's required to be "STRUCT". Plan: {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1411, in pyspark.sql.connect.functions.map_filter Failed example: df.select(map_filter( "data", lambda _, v: v > 30.0).alias("data_filtered") ).show(truncate=False) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(map_filter( File "/Users/s.singh/personal/spark-
[jira] [Created] (SPARK-41869) DataFrame dropDuplicates should throw error on non list argument
Sandeep Singh created SPARK-41869: - Summary: DataFrame dropDuplicates should throw error on non list argument Key: SPARK-41869 URL: https://issues.apache.org/jira/browse/SPARK-41869 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1270, in pyspark.sql.connect.functions.explode Failed example: eDF.select(explode(eDF.mapfield).alias("key", "value")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in eDF.select(explode(eDF.mapfield).alias("key", "value")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type "STRUCT" while it's required to be "MAP". Plan: {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1364, in pyspark.sql.connect.functions.inline Failed example: df.select(inline(df.structlist)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(inline(df.structlist)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is of type "ARRAY" while it's required to be "STRUCT". Plan: {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1411, in pyspark.sql.connect.functions.map_filter Failed example: df.select(map_filter( "data", lambda _, v: v > 30.0).alias("data_filtered") ).show(truncate=False) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(map_filter( File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pand
[jira] [Commented] (SPARK-41855) `createDataFrame` doesn't handle None/NaN properly
[ https://issues.apache.org/jira/browse/SPARK-41855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654255#comment-17654255 ] Sandeep Singh commented on SPARK-41855: --- [~podongfeng] there is another failure which might be similar {code:java} self.assertEqual( self.spark.createDataFrame(data=[Decimal("NaN")], schema="decimal").collect(), [Row(value=None)], ) {code} cc: [~gurwls223] > `createDataFrame` doesn't handle None/NaN properly > -- > > Key: SPARK-41855 > URL: https://issues.apache.org/jira/browse/SPARK-41855 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > > {code:python} > data = [Row(id=1, value=float("NaN")), Row(id=2, value=42.0), > Row(id=3, value=None)] > # +---+-+ > # | id|value| > # +---+-+ > # | 1| NaN| > # | 2| 42.0| > # | 3| null| > # +---+-+ > cdf = self.connect.createDataFrame(data) > sdf = self.spark.createDataFrame(data) > print() > print() > print(cdf._show_string(100, 100, False)) > print() > print(cdf.schema) > print() > print(sdf._jdf.showString(100, 100, False)) > print() > print(sdf.schema) > self.compare_by_show(cdf, sdf) > {code} > {code:java} > +---+-+ > | id|value| > +---+-+ > | 1| null| > | 2| 42.0| > | 3| null| > +---+-+ > StructType([StructField('id', LongType(), True), StructField('value', > DoubleType(), True)]) > +---+-+ > | id|value| > +---+-+ > | 1| NaN| > | 2| 42.0| > | 3| null| > +---+-+ > StructType([StructField('id', LongType(), True), StructField('value', > DoubleType(), True)]) > {code} > this issue is due to that `createDataFrame` can't handle None/NaN properly: > 1, in the conversion from local data to pd.DataFrame, it automatically > converts both None and NaN to NaN > 2, then in the conversion from pd.DataFrame to pa.Table, it always converts > NaN to null -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41856) Enable test_freqItems, test_input_files, test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found
[ https://issues.apache.org/jira/browse/SPARK-41856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41856: -- Summary: Enable test_freqItems, test_input_files, test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found (was: Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found) > Enable test_freqItems, test_input_files, test_toDF_with_schema_string, > test_to_pandas_required_pandas_not_found > --- > > Key: SPARK-41856 > URL: https://issues.apache.org/jira/browse/SPARK-41856 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > 5 tests pass now. Should enable them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41868) Support data type Duration(NANOSECOND)
[ https://issues.apache.org/jira/browse/SPARK-41868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41868: -- Description: {code:java} import pandas as pd from datetime import timedelta df = self.spark.createDataFrame(pd.DataFrame({"a": [timedelta(microseconds=123)]})) {code} {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 1291, in test_create_dataframe_from_pandas_with_day_time_interval self.assertEqual(df.toPandas().a.iloc[0], timedelta(microseconds=123)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 623, in _handle_error raise SparkConnectException(status.message, info.reason) from None pyspark.sql.connect.client.SparkConnectException: (org.apache.spark.SparkUnsupportedOperationException) Unsupported data type: Duration(NANOSECOND){code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1966, in pyspark.sql.connect.functions.hour Failed example: df.select(hour('ts').alias('hour')).collect() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(hour('ts').alias('hour')).collect() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1017, in collect pdf = self.toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 623, in _handle_error raise SparkConnectException(status.message, info.reason) from None pyspark.sql.connect.client.SparkConnectException: (org.apache.spark.SparkUnsupportedOperationException) Unsupported data type: Timestamp(NANOSECOND, null){code} > Support data type Duration(NANOSECOND) > -- > > Key: SPARK-41868 > URL: https://issues.apache.org/jira/browse/SPARK-41868 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > import pandas as pd > from datetime import timedelta > df = self.spark.createDataFrame(pd.DataFrame({"a": > [timedelta(microseconds=123)]})) {code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 1291, in test_create_dataframe_from_pandas_with_day_time_interval > self.assertEqual(df.toPandas().a.iloc[0], timedelta(microseconds=123)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 623, in _handle_error > raise SparkConnectException(status.message, info.reason) from None > pyspark.sql.connect.client.SparkConnectException: > (org.apache.spark.SparkUnsupportedOperationException) Unsupported data type: > Duration(NANOSECOND){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41868) Support data type Duration(NANOSECOND)
Sandeep Singh created SPARK-41868: - Summary: Support data type Duration(NANOSECOND) Key: SPARK-41868 URL: https://issues.apache.org/jira/browse/SPARK-41868 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1966, in pyspark.sql.connect.functions.hour Failed example: df.select(hour('ts').alias('hour')).collect() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(hour('ts').alias('hour')).collect() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1017, in collect pdf = self.toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 623, in _handle_error raise SparkConnectException(status.message, info.reason) from None pyspark.sql.connect.client.SparkConnectException: (org.apache.spark.SparkUnsupportedOperationException) Unsupported data type: Timestamp(NANOSECOND, null){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41866) Make `createDataFrame` support array
[ https://issues.apache.org/jira/browse/SPARK-41866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41866: -- Description: {code:java} import array data = [Row(longarray=array.array("l", [-9223372036854775808, 0, 9223372036854775807]))] df = self.spark.createDataFrame(data) {code} Error: {code:java} Traceback (most recent call last): File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", line 1220, in test_create_dataframe_from_array_of_long df = self.spark.createDataFrame(data) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", line 260, in createDataFrame table = pa.Table.from_pylist([row.asDict(recursive=True) for row in _data]) File "pyarrow/table.pxi", line 3700, in pyarrow.lib.Table.from_pylist File "pyarrow/table.pxi", line 5221, in pyarrow.lib._from_pylist File "pyarrow/table.pxi", line 3575, in pyarrow.lib.Table.from_arrays File "pyarrow/table.pxi", line 1383, in pyarrow.lib._sanitize_arrays File "pyarrow/table.pxi", line 1364, in pyarrow.lib._schema_from_arrays File "pyarrow/array.pxi", line 320, in pyarrow.lib.array File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Could not convert array('l', [-9223372036854775808, 0, 9223372036854775807]) with type array.array: did not recognize Python value type when inferring an Arrow data type{code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 2331, in pyspark.sql.connect.functions.call_udf Failed example: _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType()) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType()) AttributeError: 'SparkSession' object has no attribute 'udf'{code} > Make `createDataFrame` support array > > > Key: SPARK-41866 > URL: https://issues.apache.org/jira/browse/SPARK-41866 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > import array > data = [Row(longarray=array.array("l", [-9223372036854775808, 0, > 9223372036854775807]))] > df = self.spark.createDataFrame(data) {code} > Error: > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 1220, in test_create_dataframe_from_array_of_long > df = self.spark.createDataFrame(data) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", > line 260, in createDataFrame > table = pa.Table.from_pylist([row.asDict(recursive=True) for row in > _data]) > File "pyarrow/table.pxi", line 3700, in pyarrow.lib.Table.from_pylist > File "pyarrow/table.pxi", line 5221, in pyarrow.lib._from_pylist > File "pyarrow/table.pxi", line 3575, in pyarrow.lib.Table.from_arrays > File "pyarrow/table.pxi", line 1383, in pyarrow.lib._sanitize_arrays > File "pyarrow/table.pxi", line 1364, in pyarrow.lib._schema_from_arrays > File "pyarrow/array.pxi", line 320, in pyarrow.lib.array > File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array > File "pyarrow/error.pxi", line 144, in > pyarrow.lib.pyarrow_internal_check_status > File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Could not convert array('l', [-9223372036854775808, > 0, 9223372036854775807]) with type array.array: did not recognize Python > value type when inferring an Arrow data type{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41866) Make `createDataFrame` support array
Sandeep Singh created SPARK-41866: - Summary: Make `createDataFrame` support array Key: SPARK-41866 URL: https://issues.apache.org/jira/browse/SPARK-41866 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 2331, in pyspark.sql.connect.functions.call_udf Failed example: _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType()) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType()) AttributeError: 'SparkSession' object has no attribute 'udf'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41856) Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found
[ https://issues.apache.org/jira/browse/SPARK-41856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654239#comment-17654239 ] Sandeep Singh commented on SPARK-41856: --- [~gurwls223] for some reason its still assigned to you > Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, > test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found > -- > > Key: SPARK-41856 > URL: https://issues.apache.org/jira/browse/SPARK-41856 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > 5 tests pass now. Should enable them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41857) Enable test_between_function, test_datetime_functions, test_expr, test_math_functions, test_window_functions_cumulative_sum, test_corr, test_cov, test_crosstab, test_app
[ https://issues.apache.org/jira/browse/SPARK-41857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41857: -- Summary: Enable test_between_function, test_datetime_functions, test_expr, test_math_functions, test_window_functions_cumulative_sum, test_corr, test_cov, test_crosstab, test_approxQuantile (was: Enable test_between_function, test_datetime_functions, test_expr, test_function_parity, test_math_functions, test_window_functions_cumulative_sum, test_corr, test_cov, test_crosstab, test_approxQuantile) > Enable test_between_function, test_datetime_functions, test_expr, > test_math_functions, test_window_functions_cumulative_sum, test_corr, > test_cov, test_crosstab, test_approxQuantile > > > Key: SPARK-41857 > URL: https://issues.apache.org/jira/browse/SPARK-41857 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41857) Enable test_between_function, test_datetime_functions, test_expr, test_function_parity, test_math_functions, test_window_functions_cumulative_sum, test_corr, test_cov, t
[ https://issues.apache.org/jira/browse/SPARK-41857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41857: -- Summary: Enable test_between_function, test_datetime_functions, test_expr, test_function_parity, test_math_functions, test_window_functions_cumulative_sum, test_corr, test_cov, test_crosstab, test_approxQuantile (was: Enable 10 tests that pass) > Enable test_between_function, test_datetime_functions, test_expr, > test_function_parity, test_math_functions, > test_window_functions_cumulative_sum, test_corr, test_cov, test_crosstab, > test_approxQuantile > -- > > Key: SPARK-41857 > URL: https://issues.apache.org/jira/browse/SPARK-41857 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41857) Enable 10 tests that pass
Sandeep Singh created SPARK-41857: - Summary: Enable 10 tests that pass Key: SPARK-41857 URL: https://issues.apache.org/jira/browse/SPARK-41857 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh Assignee: Hyukjin Kwon Fix For: 3.4.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41856) Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found
[ https://issues.apache.org/jira/browse/SPARK-41856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41856: -- Description: 5 tests pass now. Should enable them. (was: These tests pass now. Should enable them.) > Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, > test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found > -- > > Key: SPARK-41856 > URL: https://issues.apache.org/jira/browse/SPARK-41856 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > 5 tests pass now. Should enable them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41856) Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found
Sandeep Singh created SPARK-41856: - Summary: Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found Key: SPARK-41856 URL: https://issues.apache.org/jira/browse/SPARK-41856 Project: Spark Issue Type: Sub-task Components: Connect, Tests Affects Versions: 3.4.0 Reporter: Sandeep Singh Assignee: Hyukjin Kwon Fix For: 3.4.0 These tests pass now. Should enable them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41852) Fix `pmod` function
[ https://issues.apache.org/jira/browse/SPARK-41852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653750#comment-17653750 ] Sandeep Singh commented on SPARK-41852: --- [~podongfeng] these are from the doctests {code:java} >>> from pyspark.sql.functions import pmod >>> df = spark.createDataFrame([ ... (1.0, float('nan')), (float('nan'), 2.0), (10.0, 3.0), ... (float('nan'), float('nan')), (-3.0, 4.0), (-10.0, 3.0), ... (-5.0, -6.0), (7.0, -8.0), (1.0, 2.0)], ... ("a", "b")) >>> df.select(pmod("a", "b")).show() {code} > Fix `pmod` function > --- > > Key: SPARK-41852 > URL: https://issues.apache.org/jira/browse/SPARK-41852 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 622, in pyspark.sql.connect.functions.pmod > Failed example: > df.select(pmod("a", "b")).show() > Expected: > +--+ > |pmod(a, b)| > +--+ > | NaN| > | NaN| > | 1.0| > | NaN| > | 1.0| > | 2.0| > | -5.0| > | 7.0| > | 1.0| > +--+ > Got: > +--+ > |pmod(a, b)| > +--+ > | null| > | null| > | 1.0| > | null| > | 1.0| > | 2.0| > | -5.0| > | 7.0| > | 1.0| > +--+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41851) Fix `nanvl` function
[ https://issues.apache.org/jira/browse/SPARK-41851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653751#comment-17653751 ] Sandeep Singh commented on SPARK-41851: --- [~podongfeng] {code:java} >>> df = spark.createDataFrame([(1.0, float('nan')), (float('nan'), 2.0)], >>> ("a", "b")) >>> df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, >>> df.b).alias("r2")).collect() {code} > Fix `nanvl` function > > > Key: SPARK-41851 > URL: https://issues.apache.org/jira/browse/SPARK-41851 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 313, in pyspark.sql.connect.functions.nanvl > Failed example: > df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, > df.b).alias("r2")).collect() > Expected: > [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)] > Got: > [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41847) DataFrame mapfield,structlist invalid type
[ https://issues.apache.org/jira/browse/SPARK-41847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41847: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1270, in pyspark.sql.connect.functions.explode Failed example: eDF.select(explode(eDF.mapfield).alias("key", "value")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in eDF.select(explode(eDF.mapfield).alias("key", "value")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type "STRUCT" while it's required to be "MAP". Plan: {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1364, in pyspark.sql.connect.functions.inline Failed example: df.select(inline(df.structlist)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(inline(df.structlist)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is of type "ARRAY" while it's required to be "STRUCT". Plan: {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1411, in pyspark.sql.connect.functions.map_filter Failed example: df.select(map_filter( "data", lambda _, v: v > 30.0).alias("data_filtered") ).show(truncate=False) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(map_filter( File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error)
[jira] [Created] (SPARK-41852) Fix `pmod` function
Sandeep Singh created SPARK-41852: - Summary: Fix `pmod` function Key: SPARK-41852 URL: https://issues.apache.org/jira/browse/SPARK-41852 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Sandeep Singh Fix For: 3.4.0 {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 313, in pyspark.sql.connect.functions.nanvl Failed example: df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, df.b).alias("r2")).collect() Expected: [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)] Got: [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41852) Fix `pmod` function
[ https://issues.apache.org/jira/browse/SPARK-41852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41852: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 622, in pyspark.sql.connect.functions.pmod Failed example: df.select(pmod("a", "b")).show() Expected: +--+ |pmod(a, b)| +--+ | NaN| | NaN| | 1.0| | NaN| | 1.0| | 2.0| | -5.0| | 7.0| | 1.0| +--+ Got: +--+ |pmod(a, b)| +--+ | null| | null| | 1.0| | null| | 1.0| | 2.0| | -5.0| | 7.0| | 1.0| +--+ {code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 313, in pyspark.sql.connect.functions.nanvl Failed example: df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, df.b).alias("r2")).collect() Expected: [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)] Got: [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code} > Fix `pmod` function > --- > > Key: SPARK-41852 > URL: https://issues.apache.org/jira/browse/SPARK-41852 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 622, in pyspark.sql.connect.functions.pmod > Failed example: > df.select(pmod("a", "b")).show() > Expected: > +--+ > |pmod(a, b)| > +--+ > | NaN| > | NaN| > | 1.0| > | NaN| > | 1.0| > | 2.0| > | -5.0| > | 7.0| > | 1.0| > +--+ > Got: > +--+ > |pmod(a, b)| > +--+ > | null| > | null| > | 1.0| > | null| > | 1.0| > | 2.0| > | -5.0| > | 7.0| > | 1.0| > +--+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41851) Fix `nanvl` function
Sandeep Singh created SPARK-41851: - Summary: Fix `nanvl` function Key: SPARK-41851 URL: https://issues.apache.org/jira/browse/SPARK-41851 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Sandeep Singh Fix For: 3.4.0 {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 801, in pyspark.sql.connect.functions.count Failed example: df.select(count(expr("*")), count(df.alphabets)).show() Expected: +++ |count(1)|count(alphabets)| +++ | 4| 3| +++ Got: +++ |count(alphabets)|count(alphabets)| +++ | 3| 3| +++ {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41851) Fix `nanvl` function
[ https://issues.apache.org/jira/browse/SPARK-41851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41851: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 313, in pyspark.sql.connect.functions.nanvl Failed example: df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, df.b).alias("r2")).collect() Expected: [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)] Got: [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 801, in pyspark.sql.connect.functions.count Failed example: df.select(count(expr("*")), count(df.alphabets)).show() Expected: +++ |count(1)|count(alphabets)| +++ | 4| 3| +++ Got: +++ |count(alphabets)|count(alphabets)| +++ | 3| 3| +++ {code} > Fix `nanvl` function > > > Key: SPARK-41851 > URL: https://issues.apache.org/jira/browse/SPARK-41851 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 313, in pyspark.sql.connect.functions.nanvl > Failed example: > df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, > df.b).alias("r2")).collect() > Expected: > [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)] > Got: > [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41847) DataFrame mapfield,structlist invalid type
[ https://issues.apache.org/jira/browse/SPARK-41847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41847: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1270, in pyspark.sql.connect.functions.explode Failed example: eDF.select(explode(eDF.mapfield).alias("key", "value")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in eDF.select(explode(eDF.mapfield).alias("key", "value")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type "STRUCT" while it's required to be "MAP". Plan: {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1364, in pyspark.sql.connect.functions.inline Failed example: df.select(inline(df.structlist)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(inline(df.structlist)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is of type "ARRAY" while it's required to be "STRUCT". Plan: {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1411, in pyspark.sql.connect.functions.map_filter Failed example: df.select(map_filter( "data", lambda _, v: v > 30.0).alias("data_filtered") ).show(truncate=False) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(map_filter( File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error)
[jira] [Commented] (SPARK-41850) Fix `isnan` function
[ https://issues.apache.org/jira/browse/SPARK-41850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653738#comment-17653738 ] Sandeep Singh commented on SPARK-41850: --- This should be moved under SPARK-41283 > Fix `isnan` function > > > Key: SPARK-41850 > URL: https://issues.apache.org/jira/browse/SPARK-41850 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 288, in pyspark.sql.connect.functions.isnan > Failed example: > df.select("a", "b", isnan("a").alias("r1"), > isnan(df.b).alias("r2")).show() > Expected: > +---+---+-+-+ > | a| b| r1| r2| > +---+---+-+-+ > |1.0|NaN|false| true| > |NaN|2.0| true|false| > +---+---+-+-+ > Got: > +++-+-+ > | a| b| r1| r2| > +++-+-+ > | 1.0|null|false|false| > |null| 2.0|false|false| > +++-+-+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41850) Fix `isnan` function
[ https://issues.apache.org/jira/browse/SPARK-41850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41850: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 288, in pyspark.sql.connect.functions.isnan Failed example: df.select("a", "b", isnan("a").alias("r1"), isnan(df.b).alias("r2")).show() Expected: +---+---+-+-+ | a| b| r1| r2| +---+---+-+-+ |1.0|NaN|false| true| |NaN|2.0| true|false| +---+---+-+-+ Got: +++-+-+ | a| b| r1| r2| +++-+-+ | 1.0|null|false|false| |null| 2.0|false|false| +++-+-+ {code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 276, in pyspark.sql.connect.functions.input_file_name Failed example: df = spark.read.text(path) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df = spark.read.text(path) AttributeError: 'DataFrameReader' object has no attribute 'text'{code} > Fix `isnan` function > > > Key: SPARK-41850 > URL: https://issues.apache.org/jira/browse/SPARK-41850 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 288, in pyspark.sql.connect.functions.isnan > Failed example: > df.select("a", "b", isnan("a").alias("r1"), > isnan(df.b).alias("r2")).show() > Expected: > +---+---+-+-+ > | a| b| r1| r2| > +---+---+-+-+ > |1.0|NaN|false| true| > |NaN|2.0| true|false| > +---+---+-+-+ > Got: > +++-+-+ > | a| b| r1| r2| > +++-+-+ > | 1.0|null|false|false| > |null| 2.0|false|false| > +++-+-+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41850) Fix `isnan` function
[ https://issues.apache.org/jira/browse/SPARK-41850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41850: -- Summary: Fix `isnan` function (was: Fix DataFrameReader.isnan) > Fix `isnan` function > > > Key: SPARK-41850 > URL: https://issues.apache.org/jira/browse/SPARK-41850 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 276, in pyspark.sql.connect.functions.input_file_name > Failed example: > df = spark.read.text(path) > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > df = spark.read.text(path) > AttributeError: 'DataFrameReader' object has no attribute 'text'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41850) Fix DataFrameReader.isnan
Sandeep Singh created SPARK-41850: - Summary: Fix DataFrameReader.isnan Key: SPARK-41850 URL: https://issues.apache.org/jira/browse/SPARK-41850 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 276, in pyspark.sql.connect.functions.input_file_name Failed example: df = spark.read.text(path) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df = spark.read.text(path) AttributeError: 'DataFrameReader' object has no attribute 'text'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41849) Implement DataFrameReader.text
Sandeep Singh created SPARK-41849: - Summary: Implement DataFrameReader.text Key: SPARK-41849 URL: https://issues.apache.org/jira/browse/SPARK-41849 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1270, in pyspark.sql.connect.functions.explode Failed example: eDF.select(explode(eDF.mapfield).alias("key", "value")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in eDF.select(explode(eDF.mapfield).alias("key", "value")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type "STRUCT" while it's required to be "MAP". Plan: {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1364, in pyspark.sql.connect.functions.inline Failed example: df.select(inline(df.structlist)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(inline(df.structlist)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is of type "ARRAY" while it's required to be "STRUCT". Plan: {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41849) Implement DataFrameReader.text
[ https://issues.apache.org/jira/browse/SPARK-41849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41849: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 276, in pyspark.sql.connect.functions.input_file_name Failed example: df = spark.read.text(path) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df = spark.read.text(path) AttributeError: 'DataFrameReader' object has no attribute 'text'{code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1270, in pyspark.sql.connect.functions.explode Failed example: eDF.select(explode(eDF.mapfield).alias("key", "value")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in eDF.select(explode(eDF.mapfield).alias("key", "value")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type "STRUCT" while it's required to be "MAP". Plan: {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1364, in pyspark.sql.connect.functions.inline Failed example: df.select(inline(df.structlist)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(inline(df.structlist)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is of type "ARRAY" while it's required to be "STRUCT". Plan: {code} > Implement DataFrameReader.text > -- > > Key: SPARK-41849 > URL: https://issues.apache.org/jira/browse/SPARK-41849 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 276, in pyspark.sql.connect.functions.input_file_name > Failed example: > df = spark.read.text(path) > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > li
[jira] [Updated] (SPARK-41847) DataFrame mapfield,structlist invalid type
[ https://issues.apache.org/jira/browse/SPARK-41847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41847: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1270, in pyspark.sql.connect.functions.explode Failed example: eDF.select(explode(eDF.mapfield).alias("key", "value")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in eDF.select(explode(eDF.mapfield).alias("key", "value")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type "STRUCT" while it's required to be "MAP". Plan: {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1364, in pyspark.sql.connect.functions.inline Failed example: df.select(inline(df.structlist)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(inline(df.structlist)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is of type "ARRAY" while it's required to be "STRUCT". Plan: {code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1270, in pyspark.sql.connect.functions.explode Failed example: eDF.select(explode(eDF.mapfield).alias("key", "value")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in eDF.select(explode(eDF.mapfield).alias("key", "value")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error)
[jira] [Updated] (SPARK-41847) DataFrame mapfield,structlist invalid type
[ https://issues.apache.org/jira/browse/SPARK-41847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41847: -- Summary: DataFrame mapfield,structlist invalid type (was: DataFrame mapfield invalid type) > DataFrame mapfield,structlist invalid type > -- > > Key: SPARK-41847 > URL: https://issues.apache.org/jira/browse/SPARK-41847 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1270, in pyspark.sql.connect.functions.explode > Failed example: > eDF.select(explode(eDF.mapfield).alias("key", "value")).show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > eDF.select(explode(eDF.mapfield).alias("key", "value")).show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 534, in show > print(self._show_string(n, truncate, vertical)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 423, in _show_string > ).toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 619, in _handle_error > raise SparkConnectAnalysisException( > pyspark.sql.connect.client.SparkConnectAnalysisException: > [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type > "STRUCT" while it's required to be "MAP". > Plan: {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41847) DataFrame mapfield invalid type
[ https://issues.apache.org/jira/browse/SPARK-41847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41847: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1270, in pyspark.sql.connect.functions.explode Failed example: eDF.select(explode(eDF.mapfield).alias("key", "value")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in eDF.select(explode(eDF.mapfield).alias("key", "value")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type "STRUCT" while it's required to be "MAP". Plan: {code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1098, in pyspark.sql.connect.functions.rank Failed example: df.withColumn("drank", rank().over(w)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.withColumn("drank", rank().over(w)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `value` cannot be resolved. Did you mean one of the following? [`_1`] Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS drank#4003] +- Project [0#3998L AS _1#4000L] +- LocalRelation [0#3998L] {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1032, in pyspark.sql.connect.functions.cume_dist Failed example: df.withColumn("cd", cume_dist().over(w)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.withColumn("cd", cume_dist().over(w)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py",
[jira] [Created] (SPARK-41847) DataFrame mapfield invalid type
Sandeep Singh created SPARK-41847: - Summary: DataFrame mapfield invalid type Key: SPARK-41847 URL: https://issues.apache.org/jira/browse/SPARK-41847 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1098, in pyspark.sql.connect.functions.rank Failed example: df.withColumn("drank", rank().over(w)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.withColumn("drank", rank().over(w)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `value` cannot be resolved. Did you mean one of the following? [`_1`] Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS drank#4003] +- Project [0#3998L AS _1#4000L] +- LocalRelation [0#3998L] {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1032, in pyspark.sql.connect.functions.cume_dist Failed example: df.withColumn("cd", cume_dist().over(w)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.withColumn("cd", cume_dist().over(w)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `value` cannot be resolved. Did you mean one of the following? [`_1`] Plan: 'Project [_1#2202L, cume_dist() windowspecdefinition('value ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS cd#2205] +- Project [0#2200L AS _1#2202L] +- LocalRelation [0#2200L] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41846) DataFrame windowspec functions : unresolved columns
[ https://issues.apache.org/jira/browse/SPARK-41846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41846: -- Summary: DataFrame windowspec functions : unresolved columns (was: DataFrame aggregation functions : unresolved columns) > DataFrame windowspec functions : unresolved columns > --- > > Key: SPARK-41846 > URL: https://issues.apache.org/jira/browse/SPARK-41846 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1098, in pyspark.sql.connect.functions.rank > Failed example: > df.withColumn("drank", rank().over(w)).show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > df.withColumn("drank", rank().over(w)).show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 534, in show > print(self._show_string(n, truncate, vertical)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 423, in _show_string > ).toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 619, in _handle_error > raise SparkConnectAnalysisException( > pyspark.sql.connect.client.SparkConnectAnalysisException: > [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name > `value` cannot be resolved. Did you mean one of the following? [`_1`] > Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS > FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) > AS drank#4003] > +- Project [0#3998L AS _1#4000L] > +- LocalRelation [0#3998L] {code} > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1032, in pyspark.sql.connect.functions.cume_dist > Failed example: > df.withColumn("cd", cume_dist().over(w)).show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > df.withColumn("cd", cume_dist().over(w)).show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 534, in show > print(self._show_string(n, truncate, vertical)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 423, in _show_string > ).toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 619, in _handle_error > raise SparkConnectAnalysisException( > pyspark.sql.connect.client.SparkConnectAnalysisException: > [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name > `value` cannot be resolved. Did you mean one of the following? [`_1`] > Plan: 'Project [_1#2202L, cume_dist() windowspecdefinition('value ASC > NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), > currentrow$())) AS cd#2205] > +- Project [0#2200L AS _1#2202L] > +- LocalRelation [0#2200L] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...
[jira] [Updated] (SPARK-41846) DataFrame aggregation functions : unresolved columns
[ https://issues.apache.org/jira/browse/SPARK-41846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41846: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1098, in pyspark.sql.connect.functions.rank Failed example: df.withColumn("drank", rank().over(w)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.withColumn("drank", rank().over(w)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `value` cannot be resolved. Did you mean one of the following? [`_1`] Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS drank#4003] +- Project [0#3998L AS _1#4000L] +- LocalRelation [0#3998L] {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1032, in pyspark.sql.connect.functions.cume_dist Failed example: df.withColumn("cd", cume_dist().over(w)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.withColumn("cd", cume_dist().over(w)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `value` cannot be resolved. Did you mean one of the following? [`_1`] Plan: 'Project [_1#2202L, cume_dist() windowspecdefinition('value ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS cd#2205] +- Project [0#2200L AS _1#2202L] +- LocalRelation [0#2200L] {code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1098, in pyspark.sql.connect.functions.rank Failed example: df.withColumn("drank", rank().over(w)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.withColumn("drank", rank().over(w)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.sin