[jira] [Created] (SPARK-41839) Implement SparkSession.sparkContext
Sandeep Singh created SPARK-41839: - Summary: Implement SparkSession.sparkContext Key: SPARK-41839 URL: https://issues.apache.org/jira/browse/SPARK-41839 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 2119, in pyspark.sql.connect.functions.unix_timestamp Failed example: spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") AttributeError: 'SparkSession' object has no attribute 'conf'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41840) DataFrame.show(): 'Column' object is not callable
Sandeep Singh created SPARK-41840: - Summary: DataFrame.show(): 'Column' object is not callable Key: SPARK-41840 URL: https://issues.apache.org/jira/browse/SPARK-41840 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1472, in pyspark.sql.connect.functions.posexplode_outer Failed example: df.select("id", "a_map", posexplode_outer("an_array")).show() Expected: +---+--+++ | id| a_map| pos| col| +---+--+++ | 1|{x -> 1.0}| 0| foo| | 1|{x -> 1.0}| 1| bar| | 2| {}|null|null| | 3| null|null|null| +---+--+++ Got: +---+--+++ | id| a_map| pos| col| +---+--+++ | 1| {1.0}| 0| foo| | 1| {1.0}| 1| bar| | 2|{null}|null|null| | 3| null|null|null| +---+--+++ {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41840) DataFrame.show(): 'Column' object is not callable
[ https://issues.apache.org/jira/browse/SPARK-41840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41840: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 855, in pyspark.sql.connect.functions.first Failed example: df.groupby("name").agg(first("age", ignorenulls=True)).orderBy("name").show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.groupby("name").agg(first("age", ignorenulls=True)).orderBy("name").show() TypeError: 'Column' object is not callable{code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1472, in pyspark.sql.connect.functions.posexplode_outer Failed example: df.select("id", "a_map", posexplode_outer("an_array")).show() Expected: +---+--+++ | id| a_map| pos| col| +---+--+++ | 1|{x -> 1.0}| 0| foo| | 1|{x -> 1.0}| 1| bar| | 2| {}|null|null| | 3| null|null|null| +---+--+++ Got: +---+--+++ | id| a_map| pos| col| +---+--+++ | 1| {1.0}| 0| foo| | 1| {1.0}| 1| bar| | 2|{null}|null|null| | 3| null|null|null| +---+--+++ {code} > DataFrame.show(): 'Column' object is not callable > - > > Key: SPARK-41840 > URL: https://issues.apache.org/jira/browse/SPARK-41840 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 855, in pyspark.sql.connect.functions.first > Failed example: > df.groupby("name").agg(first("age", > ignorenulls=True)).orderBy("name").show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > df.groupby("name").agg(first("age", > ignorenulls=True)).orderBy("name").show() > TypeError: 'Column' object is not callable{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41804) InterpretedUnsafeProjection doesn't properly handle an array of UDTs
[ https://issues.apache.org/jira/browse/SPARK-41804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653709#comment-17653709 ] Apache Spark commented on SPARK-41804: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/39349 > InterpretedUnsafeProjection doesn't properly handle an array of UDTs > > > Key: SPARK-41804 > URL: https://issues.apache.org/jira/browse/SPARK-41804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Bruce Robbins >Priority: Major > > Reproduction steps: > {noformat} > // create a file of vector data > import org.apache.spark.ml.linalg.{DenseVector, Vector} > case class TestRow(varr: Array[Vector]) > val values = Array(0.1d, 0.2d, 0.3d) > val dv = new DenseVector(values).asInstanceOf[Vector] > val ds = Seq(TestRow(Array(dv, dv))).toDS > ds.coalesce(1).write.mode("overwrite").format("parquet").save("vector_data") > // this works > spark.read.format("parquet").load("vector_data").collect > sql("set spark.sql.codegen.wholeStage=false") > sql("set spark.sql.codegen.factoryMode=NO_CODEGEN") > // this will get an error > spark.read.format("parquet").load("vector_data").collect > {noformat} > The error varies each time you run it, e.g.: > {noformat} > Sparse vectors require that the dimension of the indices match the dimension > of the values. > You provided 2 indices and 6619240 values. > {noformat} > or > {noformat} > org.apache.spark.SparkRuntimeException: Error while decoding: > java.lang.NegativeArraySizeException > {noformat} > or > {noformat} > java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.toDoubleArray(UnsafeArrayData.java:414) > {noformat} > or > {noformat} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGBUS (0xa) at pc=0x0001120c9d30, pid=64213, tid=0x1003 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_311-b11) (build > 1.8.0_311-b11) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.311-b11 mixed mode bsd-amd64 > compressed oops) > # Problematic frame: > # V [libjvm.dylib+0xc9d30] acl_CopyRight+0x29 > # > # Failed to write core dump. Core dumps have been disabled. To enable core > dumping, try "ulimit -c unlimited" before starting Java again > # > # An error report file with more information is saved as: > # //hs_err_pid64213.log > Compiled method (nm) 582142 11318 n 0 sun.misc.Unsafe::copyMemory > (native) > total in heap [0x00011efa8890,0x00011efa8be8] = 856 > relocation [0x00011efa89b8,0x00011efa89f8] = 64 > main code [0x00011efa8a00,0x00011efa8be8] = 488 > Compiled method (nm) 582142 11318 n 0 sun.misc.Unsafe::copyMemory > (native) > total in heap [0x00011efa8890,0x00011efa8be8] = 856 > relocation [0x00011efa89b8,0x00011efa89f8] = 64 > main code [0x00011efa8a00,0x00011efa8be8] = 488 > # > # If you would like to submit a bug report, please visit: > # http://bugreport.java.com/bugreport/crash.jsp > # > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41804) InterpretedUnsafeProjection doesn't properly handle an array of UDTs
[ https://issues.apache.org/jira/browse/SPARK-41804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41804: Assignee: Apache Spark > InterpretedUnsafeProjection doesn't properly handle an array of UDTs > > > Key: SPARK-41804 > URL: https://issues.apache.org/jira/browse/SPARK-41804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Bruce Robbins >Assignee: Apache Spark >Priority: Major > > Reproduction steps: > {noformat} > // create a file of vector data > import org.apache.spark.ml.linalg.{DenseVector, Vector} > case class TestRow(varr: Array[Vector]) > val values = Array(0.1d, 0.2d, 0.3d) > val dv = new DenseVector(values).asInstanceOf[Vector] > val ds = Seq(TestRow(Array(dv, dv))).toDS > ds.coalesce(1).write.mode("overwrite").format("parquet").save("vector_data") > // this works > spark.read.format("parquet").load("vector_data").collect > sql("set spark.sql.codegen.wholeStage=false") > sql("set spark.sql.codegen.factoryMode=NO_CODEGEN") > // this will get an error > spark.read.format("parquet").load("vector_data").collect > {noformat} > The error varies each time you run it, e.g.: > {noformat} > Sparse vectors require that the dimension of the indices match the dimension > of the values. > You provided 2 indices and 6619240 values. > {noformat} > or > {noformat} > org.apache.spark.SparkRuntimeException: Error while decoding: > java.lang.NegativeArraySizeException > {noformat} > or > {noformat} > java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.toDoubleArray(UnsafeArrayData.java:414) > {noformat} > or > {noformat} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGBUS (0xa) at pc=0x0001120c9d30, pid=64213, tid=0x1003 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_311-b11) (build > 1.8.0_311-b11) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.311-b11 mixed mode bsd-amd64 > compressed oops) > # Problematic frame: > # V [libjvm.dylib+0xc9d30] acl_CopyRight+0x29 > # > # Failed to write core dump. Core dumps have been disabled. To enable core > dumping, try "ulimit -c unlimited" before starting Java again > # > # An error report file with more information is saved as: > # //hs_err_pid64213.log > Compiled method (nm) 582142 11318 n 0 sun.misc.Unsafe::copyMemory > (native) > total in heap [0x00011efa8890,0x00011efa8be8] = 856 > relocation [0x00011efa89b8,0x00011efa89f8] = 64 > main code [0x00011efa8a00,0x00011efa8be8] = 488 > Compiled method (nm) 582142 11318 n 0 sun.misc.Unsafe::copyMemory > (native) > total in heap [0x00011efa8890,0x00011efa8be8] = 856 > relocation [0x00011efa89b8,0x00011efa89f8] = 64 > main code [0x00011efa8a00,0x00011efa8be8] = 488 > # > # If you would like to submit a bug report, please visit: > # http://bugreport.java.com/bugreport/crash.jsp > # > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41804) InterpretedUnsafeProjection doesn't properly handle an array of UDTs
[ https://issues.apache.org/jira/browse/SPARK-41804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41804: Assignee: (was: Apache Spark) > InterpretedUnsafeProjection doesn't properly handle an array of UDTs > > > Key: SPARK-41804 > URL: https://issues.apache.org/jira/browse/SPARK-41804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Bruce Robbins >Priority: Major > > Reproduction steps: > {noformat} > // create a file of vector data > import org.apache.spark.ml.linalg.{DenseVector, Vector} > case class TestRow(varr: Array[Vector]) > val values = Array(0.1d, 0.2d, 0.3d) > val dv = new DenseVector(values).asInstanceOf[Vector] > val ds = Seq(TestRow(Array(dv, dv))).toDS > ds.coalesce(1).write.mode("overwrite").format("parquet").save("vector_data") > // this works > spark.read.format("parquet").load("vector_data").collect > sql("set spark.sql.codegen.wholeStage=false") > sql("set spark.sql.codegen.factoryMode=NO_CODEGEN") > // this will get an error > spark.read.format("parquet").load("vector_data").collect > {noformat} > The error varies each time you run it, e.g.: > {noformat} > Sparse vectors require that the dimension of the indices match the dimension > of the values. > You provided 2 indices and 6619240 values. > {noformat} > or > {noformat} > org.apache.spark.SparkRuntimeException: Error while decoding: > java.lang.NegativeArraySizeException > {noformat} > or > {noformat} > java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.toDoubleArray(UnsafeArrayData.java:414) > {noformat} > or > {noformat} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGBUS (0xa) at pc=0x0001120c9d30, pid=64213, tid=0x1003 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_311-b11) (build > 1.8.0_311-b11) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.311-b11 mixed mode bsd-amd64 > compressed oops) > # Problematic frame: > # V [libjvm.dylib+0xc9d30] acl_CopyRight+0x29 > # > # Failed to write core dump. Core dumps have been disabled. To enable core > dumping, try "ulimit -c unlimited" before starting Java again > # > # An error report file with more information is saved as: > # //hs_err_pid64213.log > Compiled method (nm) 582142 11318 n 0 sun.misc.Unsafe::copyMemory > (native) > total in heap [0x00011efa8890,0x00011efa8be8] = 856 > relocation [0x00011efa89b8,0x00011efa89f8] = 64 > main code [0x00011efa8a00,0x00011efa8be8] = 488 > Compiled method (nm) 582142 11318 n 0 sun.misc.Unsafe::copyMemory > (native) > total in heap [0x00011efa8890,0x00011efa8be8] = 856 > relocation [0x00011efa89b8,0x00011efa89f8] = 64 > main code [0x00011efa8a00,0x00011efa8be8] = 488 > # > # If you would like to submit a bug report, please visit: > # http://bugreport.java.com/bugreport/crash.jsp > # > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-41818) Support DataFrameWriter.saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-41818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-41818: -- > Support DataFrameWriter.saveAsTable > --- > > Key: SPARK-41818 > URL: https://issues.apache.org/jira/browse/SPARK-41818 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", > line 369, in pyspark.sql.connect.readwriter.DataFrameWriter.insertInto > Failed example: > df.write.saveAsTable("tblA") > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.readwriter.DataFrameWriter.insertInto[2]>", line 1, in > > df.write.saveAsTable("tblA") > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", > line 350, in saveAsTable > > self._spark.client.execute_command(self._write.command(self._spark.client)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 459, in execute_command > self._execute(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 547, in _execute > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 623, in _handle_error > raise SparkConnectException(status.message, info.reason) from None > pyspark.sql.connect.client.SparkConnectException: > (java.lang.ClassNotFoundException) .DefaultSource{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41818) Support DataFrameWriter.saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-41818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41818. -- Resolution: Fixed > Support DataFrameWriter.saveAsTable > --- > > Key: SPARK-41818 > URL: https://issues.apache.org/jira/browse/SPARK-41818 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", > line 369, in pyspark.sql.connect.readwriter.DataFrameWriter.insertInto > Failed example: > df.write.saveAsTable("tblA") > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.readwriter.DataFrameWriter.insertInto[2]>", line 1, in > > df.write.saveAsTable("tblA") > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", > line 350, in saveAsTable > > self._spark.client.execute_command(self._write.command(self._spark.client)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 459, in execute_command > self._execute(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 547, in _execute > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 623, in _handle_error > raise SparkConnectException(status.message, info.reason) from None > pyspark.sql.connect.client.SparkConnectException: > (java.lang.ClassNotFoundException) .DefaultSource{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41818) Support DataFrameWriter.saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-41818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41818: - Parent: (was: SPARK-41281) Issue Type: Bug (was: Sub-task) > Support DataFrameWriter.saveAsTable > --- > > Key: SPARK-41818 > URL: https://issues.apache.org/jira/browse/SPARK-41818 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", > line 369, in pyspark.sql.connect.readwriter.DataFrameWriter.insertInto > Failed example: > df.write.saveAsTable("tblA") > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.readwriter.DataFrameWriter.insertInto[2]>", line 1, in > > df.write.saveAsTable("tblA") > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", > line 350, in saveAsTable > > self._spark.client.execute_command(self._write.command(self._spark.client)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 459, in execute_command > self._execute(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 547, in _execute > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 623, in _handle_error > raise SparkConnectException(status.message, info.reason) from None > pyspark.sql.connect.client.SparkConnectException: > (java.lang.ClassNotFoundException) .DefaultSource{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41818) Support DataFrameWriter.saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-41818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41818: - Epic Link: SPARK-39375 > Support DataFrameWriter.saveAsTable > --- > > Key: SPARK-41818 > URL: https://issues.apache.org/jira/browse/SPARK-41818 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", > line 369, in pyspark.sql.connect.readwriter.DataFrameWriter.insertInto > Failed example: > df.write.saveAsTable("tblA") > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.readwriter.DataFrameWriter.insertInto[2]>", line 1, in > > df.write.saveAsTable("tblA") > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", > line 350, in saveAsTable > > self._spark.client.execute_command(self._write.command(self._spark.client)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 459, in execute_command > self._execute(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 547, in _execute > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 623, in _handle_error > raise SparkConnectException(status.message, info.reason) from None > pyspark.sql.connect.client.SparkConnectException: > (java.lang.ClassNotFoundException) .DefaultSource{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41817) SparkSession.read support reading with schema
[ https://issues.apache.org/jira/browse/SPARK-41817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41817: - Parent: (was: SPARK-41281) Issue Type: Bug (was: Sub-task) > SparkSession.read support reading with schema > - > > Key: SPARK-41817 > URL: https://issues.apache.org/jira/browse/SPARK-41817 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", > line 122, in pyspark.sql.connect.readwriter.DataFrameReader.load > Failed example: > with tempfile.TemporaryDirectory() as d: > # Write a DataFrame into a CSV file with a header > df = spark.createDataFrame([{"age": 100, "name": "Hyukjin Kwon"}]) > df.write.option("header", > True).mode("overwrite").format("csv").save(d) > # Read the CSV file as a DataFrame with 'nullValue' option set to > 'Hyukjin Kwon', > # and 'header' option set to `True`. > df = spark.read.load( > d, schema=df.schema, format="csv", nullValue="Hyukjin Kwon", > header=True) > df.printSchema() > df.show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.readwriter.DataFrameReader.load[1]>", line 10, in > df.printSchema() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1039, in printSchema > print(self._tree_string()) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1035, in _tree_string > query = self._plan.to_proto(self._session.client) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line > 92, in to_proto > plan.root.CopyFrom(self.plan(session)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line > 245, in plan > plan.read.data_source.schema = self.schema > TypeError: bad argument type for built-in operation {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41817) SparkSession.read support reading with schema
[ https://issues.apache.org/jira/browse/SPARK-41817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41817: - Epic Link: SPARK-39375 > SparkSession.read support reading with schema > - > > Key: SPARK-41817 > URL: https://issues.apache.org/jira/browse/SPARK-41817 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", > line 122, in pyspark.sql.connect.readwriter.DataFrameReader.load > Failed example: > with tempfile.TemporaryDirectory() as d: > # Write a DataFrame into a CSV file with a header > df = spark.createDataFrame([{"age": 100, "name": "Hyukjin Kwon"}]) > df.write.option("header", > True).mode("overwrite").format("csv").save(d) > # Read the CSV file as a DataFrame with 'nullValue' option set to > 'Hyukjin Kwon', > # and 'header' option set to `True`. > df = spark.read.load( > d, schema=df.schema, format="csv", nullValue="Hyukjin Kwon", > header=True) > df.printSchema() > df.show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.readwriter.DataFrameReader.load[1]>", line 10, in > df.printSchema() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1039, in printSchema > print(self._tree_string()) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1035, in _tree_string > query = self._plan.to_proto(self._session.client) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line > 92, in to_proto > plan.root.CopyFrom(self.plan(session)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line > 245, in plan > plan.read.data_source.schema = self.schema > TypeError: bad argument type for built-in operation {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41818) Support DataFrameWriter.saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-41818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41818: - Epic Link: (was: SPARK-39375) > Support DataFrameWriter.saveAsTable > --- > > Key: SPARK-41818 > URL: https://issues.apache.org/jira/browse/SPARK-41818 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", > line 369, in pyspark.sql.connect.readwriter.DataFrameWriter.insertInto > Failed example: > df.write.saveAsTable("tblA") > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.readwriter.DataFrameWriter.insertInto[2]>", line 1, in > > df.write.saveAsTable("tblA") > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", > line 350, in saveAsTable > > self._spark.client.execute_command(self._write.command(self._spark.client)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 459, in execute_command > self._execute(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 547, in _execute > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 623, in _handle_error > raise SparkConnectException(status.message, info.reason) from None > pyspark.sql.connect.client.SparkConnectException: > (java.lang.ClassNotFoundException) .DefaultSource{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41818) Support DataFrameWriter.saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-41818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41818: - Parent: SPARK-41284 Issue Type: Sub-task (was: Bug) > Support DataFrameWriter.saveAsTable > --- > > Key: SPARK-41818 > URL: https://issues.apache.org/jira/browse/SPARK-41818 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", > line 369, in pyspark.sql.connect.readwriter.DataFrameWriter.insertInto > Failed example: > df.write.saveAsTable("tblA") > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.readwriter.DataFrameWriter.insertInto[2]>", line 1, in > > df.write.saveAsTable("tblA") > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", > line 350, in saveAsTable > > self._spark.client.execute_command(self._write.command(self._spark.client)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 459, in execute_command > self._execute(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 547, in _execute > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 623, in _handle_error > raise SparkConnectException(status.message, info.reason) from None > pyspark.sql.connect.client.SparkConnectException: > (java.lang.ClassNotFoundException) .DefaultSource{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41817) SparkSession.read support reading with schema
[ https://issues.apache.org/jira/browse/SPARK-41817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41817: - Epic Link: (was: SPARK-39375) > SparkSession.read support reading with schema > - > > Key: SPARK-41817 > URL: https://issues.apache.org/jira/browse/SPARK-41817 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", > line 122, in pyspark.sql.connect.readwriter.DataFrameReader.load > Failed example: > with tempfile.TemporaryDirectory() as d: > # Write a DataFrame into a CSV file with a header > df = spark.createDataFrame([{"age": 100, "name": "Hyukjin Kwon"}]) > df.write.option("header", > True).mode("overwrite").format("csv").save(d) > # Read the CSV file as a DataFrame with 'nullValue' option set to > 'Hyukjin Kwon', > # and 'header' option set to `True`. > df = spark.read.load( > d, schema=df.schema, format="csv", nullValue="Hyukjin Kwon", > header=True) > df.printSchema() > df.show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.readwriter.DataFrameReader.load[1]>", line 10, in > df.printSchema() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1039, in printSchema > print(self._tree_string()) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1035, in _tree_string > query = self._plan.to_proto(self._session.client) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line > 92, in to_proto > plan.root.CopyFrom(self.plan(session)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line > 245, in plan > plan.read.data_source.schema = self.schema > TypeError: bad argument type for built-in operation {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41817) SparkSession.read support reading with schema
[ https://issues.apache.org/jira/browse/SPARK-41817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41817: - Parent: SPARK-41284 Issue Type: Sub-task (was: Bug) > SparkSession.read support reading with schema > - > > Key: SPARK-41817 > URL: https://issues.apache.org/jira/browse/SPARK-41817 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", > line 122, in pyspark.sql.connect.readwriter.DataFrameReader.load > Failed example: > with tempfile.TemporaryDirectory() as d: > # Write a DataFrame into a CSV file with a header > df = spark.createDataFrame([{"age": 100, "name": "Hyukjin Kwon"}]) > df.write.option("header", > True).mode("overwrite").format("csv").save(d) > # Read the CSV file as a DataFrame with 'nullValue' option set to > 'Hyukjin Kwon', > # and 'header' option set to `True`. > df = spark.read.load( > d, schema=df.schema, format="csv", nullValue="Hyukjin Kwon", > header=True) > df.printSchema() > df.show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File " pyspark.sql.connect.readwriter.DataFrameReader.load[1]>", line 10, in > df.printSchema() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1039, in printSchema > print(self._tree_string()) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1035, in _tree_string > query = self._plan.to_proto(self._session.client) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line > 92, in to_proto > plan.root.CopyFrom(self.plan(session)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line > 245, in plan > plan.read.data_source.schema = self.schema > TypeError: bad argument type for built-in operation {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41659) Enable doctests in pyspark.sql.connect.readwriter
[ https://issues.apache.org/jira/browse/SPARK-41659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41659. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39331 [https://github.com/apache/spark/pull/39331] > Enable doctests in pyspark.sql.connect.readwriter > - > > Key: SPARK-41659 > URL: https://issues.apache.org/jira/browse/SPARK-41659 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41819) Implement Dataframe.rdd getNumPartitions
[ https://issues.apache.org/jira/browse/SPARK-41819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41819: - Epic Link: SPARK-39375 > Implement Dataframe.rdd getNumPartitions > > > Key: SPARK-41819 > URL: https://issues.apache.org/jira/browse/SPARK-41819 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 243, in pyspark.sql.connect.dataframe.DataFrame.coalesce > Failed example: > df.coalesce(1).rdd.getNumPartitions() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", > line 1, in > df.coalesce(1).rdd.getNumPartitions() > AttributeError: 'function' object has no attribute > 'getNumPartitions'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41659) Enable doctests in pyspark.sql.connect.readwriter
[ https://issues.apache.org/jira/browse/SPARK-41659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41659: Assignee: Hyukjin Kwon > Enable doctests in pyspark.sql.connect.readwriter > - > > Key: SPARK-41659 > URL: https://issues.apache.org/jira/browse/SPARK-41659 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41819) Implement Dataframe.rdd getNumPartitions
[ https://issues.apache.org/jira/browse/SPARK-41819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41819: - Parent: (was: SPARK-41281) Issue Type: Bug (was: Sub-task) > Implement Dataframe.rdd getNumPartitions > > > Key: SPARK-41819 > URL: https://issues.apache.org/jira/browse/SPARK-41819 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 243, in pyspark.sql.connect.dataframe.DataFrame.coalesce > Failed example: > df.coalesce(1).rdd.getNumPartitions() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", > line 1, in > df.coalesce(1).rdd.getNumPartitions() > AttributeError: 'function' object has no attribute > 'getNumPartitions'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41819) Implement Dataframe.rdd getNumPartitions
[ https://issues.apache.org/jira/browse/SPARK-41819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41819: - Parent: SPARK-41279 Issue Type: Sub-task (was: Bug) > Implement Dataframe.rdd getNumPartitions > > > Key: SPARK-41819 > URL: https://issues.apache.org/jira/browse/SPARK-41819 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 243, in pyspark.sql.connect.dataframe.DataFrame.coalesce > Failed example: > df.coalesce(1).rdd.getNumPartitions() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", > line 1, in > df.coalesce(1).rdd.getNumPartitions() > AttributeError: 'function' object has no attribute > 'getNumPartitions'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41819) Implement Dataframe.rdd getNumPartitions
[ https://issues.apache.org/jira/browse/SPARK-41819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41819: - Epic Link: (was: SPARK-39375) > Implement Dataframe.rdd getNumPartitions > > > Key: SPARK-41819 > URL: https://issues.apache.org/jira/browse/SPARK-41819 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 243, in pyspark.sql.connect.dataframe.DataFrame.coalesce > Failed example: > df.coalesce(1).rdd.getNumPartitions() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", > line 1, in > df.coalesce(1).rdd.getNumPartitions() > AttributeError: 'function' object has no attribute > 'getNumPartitions'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41828) Implement creating empty Dataframe
[ https://issues.apache.org/jira/browse/SPARK-41828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41828: - Epic Link: SPARK-39375 > Implement creating empty Dataframe > -- > > Key: SPARK-41828 > URL: https://issues.apache.org/jira/browse/SPARK-41828 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 99, in pyspark.sql.connect.dataframe.DataFrame.isEmpty > Failed example: > df_empty = spark.createDataFrame([], 'a STRING') > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", > line 1, in > df_empty = spark.createDataFrame([], 'a STRING') > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", > line 186, in createDataFrame > raise ValueError("Input data cannot be empty") > ValueError: Input data cannot be empty{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41828) Implement creating empty Dataframe
[ https://issues.apache.org/jira/browse/SPARK-41828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41828: - Parent: (was: SPARK-41279) Issue Type: Bug (was: Sub-task) > Implement creating empty Dataframe > -- > > Key: SPARK-41828 > URL: https://issues.apache.org/jira/browse/SPARK-41828 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 99, in pyspark.sql.connect.dataframe.DataFrame.isEmpty > Failed example: > df_empty = spark.createDataFrame([], 'a STRING') > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", > line 1, in > df_empty = spark.createDataFrame([], 'a STRING') > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", > line 186, in createDataFrame > raise ValueError("Input data cannot be empty") > ValueError: Input data cannot be empty{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41828) Implement creating empty Dataframe
[ https://issues.apache.org/jira/browse/SPARK-41828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41828: - Parent: SPARK-41281 Issue Type: Sub-task (was: Bug) > Implement creating empty Dataframe > -- > > Key: SPARK-41828 > URL: https://issues.apache.org/jira/browse/SPARK-41828 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 99, in pyspark.sql.connect.dataframe.DataFrame.isEmpty > Failed example: > df_empty = spark.createDataFrame([], 'a STRING') > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", > line 1, in > df_empty = spark.createDataFrame([], 'a STRING') > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", > line 186, in createDataFrame > raise ValueError("Input data cannot be empty") > ValueError: Input data cannot be empty{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41828) Implement creating empty Dataframe
[ https://issues.apache.org/jira/browse/SPARK-41828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41828: - Epic Link: (was: SPARK-39375) > Implement creating empty Dataframe > -- > > Key: SPARK-41828 > URL: https://issues.apache.org/jira/browse/SPARK-41828 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 99, in pyspark.sql.connect.dataframe.DataFrame.isEmpty > Failed example: > df_empty = spark.createDataFrame([], 'a STRING') > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", > line 1, in > df_empty = spark.createDataFrame([], 'a STRING') > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", > line 186, in createDataFrame > raise ValueError("Input data cannot be empty") > ValueError: Input data cannot be empty{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41835) Implement `transform_keys` function
[ https://issues.apache.org/jira/browse/SPARK-41835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41835: Assignee: (was: Ruifeng Zheng) > Implement `transform_keys` function > --- > > Key: SPARK-41835 > URL: https://issues.apache.org/jira/browse/SPARK-41835 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41835) Implement `transform_keys` function
[ https://issues.apache.org/jira/browse/SPARK-41835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653713#comment-17653713 ] Hyukjin Kwon commented on SPARK-41835: -- test output? > Implement `transform_keys` function > --- > > Key: SPARK-41835 > URL: https://issues.apache.org/jira/browse/SPARK-41835 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41839) Implement SparkSession.sparkContext
[ https://issues.apache.org/jira/browse/SPARK-41839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653715#comment-17653715 ] Hyukjin Kwon commented on SPARK-41839: -- test output? > Implement SparkSession.sparkContext > --- > > Key: SPARK-41839 > URL: https://issues.apache.org/jira/browse/SPARK-41839 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41836) Implement `transform_values` function
[ https://issues.apache.org/jira/browse/SPARK-41836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653714#comment-17653714 ] Hyukjin Kwon commented on SPARK-41836: -- test output? > Implement `transform_values` function > - > > Key: SPARK-41836 > URL: https://issues.apache.org/jira/browse/SPARK-41836 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41803) log() function variations are missing
[ https://issues.apache.org/jira/browse/SPARK-41803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41803: Assignee: Martin Grund > log() function variations are missing > - > > Key: SPARK-41803 > URL: https://issues.apache.org/jira/browse/SPARK-41803 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41803) log() function variations are missing
[ https://issues.apache.org/jira/browse/SPARK-41803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41803: Assignee: Ruifeng Zheng (was: Martin Grund) > log() function variations are missing > - > > Key: SPARK-41803 > URL: https://issues.apache.org/jira/browse/SPARK-41803 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41803) log() function variations are missing
[ https://issues.apache.org/jira/browse/SPARK-41803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41803. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39339 [https://github.com/apache/spark/pull/39339] > log() function variations are missing > - > > Key: SPARK-41803 > URL: https://issues.apache.org/jira/browse/SPARK-41803 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41659) Enable doctests in pyspark.sql.connect.readwriter
[ https://issues.apache.org/jira/browse/SPARK-41659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41659: Assignee: Sandeep Singh (was: Hyukjin Kwon) > Enable doctests in pyspark.sql.connect.readwriter > - > > Key: SPARK-41659 > URL: https://issues.apache.org/jira/browse/SPARK-41659 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41655) Enable doctests in pyspark.sql.connect.column
[ https://issues.apache.org/jira/browse/SPARK-41655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41655: Assignee: Sandeep Singh (was: Hyukjin Kwon) > Enable doctests in pyspark.sql.connect.column > - > > Key: SPARK-41655 > URL: https://issues.apache.org/jira/browse/SPARK-41655 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41654) Enable doctests in pyspark.sql.connect.window
[ https://issues.apache.org/jira/browse/SPARK-41654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41654: Assignee: Sandeep Singh (was: Hyukjin Kwon) > Enable doctests in pyspark.sql.connect.window > - > > Key: SPARK-41654 > URL: https://issues.apache.org/jira/browse/SPARK-41654 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41653) Test parity: enable doctests in Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-41653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41653: Assignee: Sandeep Singh (was: Hyukjin Kwon) > Test parity: enable doctests in Spark Connect > - > > Key: SPARK-41653 > URL: https://issues.apache.org/jira/browse/SPARK-41653 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Sandeep Singh >Priority: Major > > We should actually run the doctests of Spark Connect. > We should add something like > https://github.com/apache/spark/blob/master/python/pyspark/sql/column.py#L1227-L1247 > to Spark Connect modules, and add the module into > https://github.com/apache/spark/blob/master/dev/sparktestsupport/modules.py#L507 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41804) InterpretedUnsafeProjection doesn't properly handle an array of UDTs
[ https://issues.apache.org/jira/browse/SPARK-41804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41804: Assignee: Bruce Robbins > InterpretedUnsafeProjection doesn't properly handle an array of UDTs > > > Key: SPARK-41804 > URL: https://issues.apache.org/jira/browse/SPARK-41804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > > Reproduction steps: > {noformat} > // create a file of vector data > import org.apache.spark.ml.linalg.{DenseVector, Vector} > case class TestRow(varr: Array[Vector]) > val values = Array(0.1d, 0.2d, 0.3d) > val dv = new DenseVector(values).asInstanceOf[Vector] > val ds = Seq(TestRow(Array(dv, dv))).toDS > ds.coalesce(1).write.mode("overwrite").format("parquet").save("vector_data") > // this works > spark.read.format("parquet").load("vector_data").collect > sql("set spark.sql.codegen.wholeStage=false") > sql("set spark.sql.codegen.factoryMode=NO_CODEGEN") > // this will get an error > spark.read.format("parquet").load("vector_data").collect > {noformat} > The error varies each time you run it, e.g.: > {noformat} > Sparse vectors require that the dimension of the indices match the dimension > of the values. > You provided 2 indices and 6619240 values. > {noformat} > or > {noformat} > org.apache.spark.SparkRuntimeException: Error while decoding: > java.lang.NegativeArraySizeException > {noformat} > or > {noformat} > java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.toDoubleArray(UnsafeArrayData.java:414) > {noformat} > or > {noformat} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGBUS (0xa) at pc=0x0001120c9d30, pid=64213, tid=0x1003 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_311-b11) (build > 1.8.0_311-b11) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.311-b11 mixed mode bsd-amd64 > compressed oops) > # Problematic frame: > # V [libjvm.dylib+0xc9d30] acl_CopyRight+0x29 > # > # Failed to write core dump. Core dumps have been disabled. To enable core > dumping, try "ulimit -c unlimited" before starting Java again > # > # An error report file with more information is saved as: > # //hs_err_pid64213.log > Compiled method (nm) 582142 11318 n 0 sun.misc.Unsafe::copyMemory > (native) > total in heap [0x00011efa8890,0x00011efa8be8] = 856 > relocation [0x00011efa89b8,0x00011efa89f8] = 64 > main code [0x00011efa8a00,0x00011efa8be8] = 488 > Compiled method (nm) 582142 11318 n 0 sun.misc.Unsafe::copyMemory > (native) > total in heap [0x00011efa8890,0x00011efa8be8] = 856 > relocation [0x00011efa89b8,0x00011efa89f8] = 64 > main code [0x00011efa8a00,0x00011efa8be8] = 488 > # > # If you would like to submit a bug report, please visit: > # http://bugreport.java.com/bugreport/crash.jsp > # > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41804) InterpretedUnsafeProjection doesn't properly handle an array of UDTs
[ https://issues.apache.org/jira/browse/SPARK-41804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41804. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39349 [https://github.com/apache/spark/pull/39349] > InterpretedUnsafeProjection doesn't properly handle an array of UDTs > > > Key: SPARK-41804 > URL: https://issues.apache.org/jira/browse/SPARK-41804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > Fix For: 3.4.0 > > > Reproduction steps: > {noformat} > // create a file of vector data > import org.apache.spark.ml.linalg.{DenseVector, Vector} > case class TestRow(varr: Array[Vector]) > val values = Array(0.1d, 0.2d, 0.3d) > val dv = new DenseVector(values).asInstanceOf[Vector] > val ds = Seq(TestRow(Array(dv, dv))).toDS > ds.coalesce(1).write.mode("overwrite").format("parquet").save("vector_data") > // this works > spark.read.format("parquet").load("vector_data").collect > sql("set spark.sql.codegen.wholeStage=false") > sql("set spark.sql.codegen.factoryMode=NO_CODEGEN") > // this will get an error > spark.read.format("parquet").load("vector_data").collect > {noformat} > The error varies each time you run it, e.g.: > {noformat} > Sparse vectors require that the dimension of the indices match the dimension > of the values. > You provided 2 indices and 6619240 values. > {noformat} > or > {noformat} > org.apache.spark.SparkRuntimeException: Error while decoding: > java.lang.NegativeArraySizeException > {noformat} > or > {noformat} > java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.toDoubleArray(UnsafeArrayData.java:414) > {noformat} > or > {noformat} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGBUS (0xa) at pc=0x0001120c9d30, pid=64213, tid=0x1003 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_311-b11) (build > 1.8.0_311-b11) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.311-b11 mixed mode bsd-amd64 > compressed oops) > # Problematic frame: > # V [libjvm.dylib+0xc9d30] acl_CopyRight+0x29 > # > # Failed to write core dump. Core dumps have been disabled. To enable core > dumping, try "ulimit -c unlimited" before starting Java again > # > # An error report file with more information is saved as: > # //hs_err_pid64213.log > Compiled method (nm) 582142 11318 n 0 sun.misc.Unsafe::copyMemory > (native) > total in heap [0x00011efa8890,0x00011efa8be8] = 856 > relocation [0x00011efa89b8,0x00011efa89f8] = 64 > main code [0x00011efa8a00,0x00011efa8be8] = 488 > Compiled method (nm) 582142 11318 n 0 sun.misc.Unsafe::copyMemory > (native) > total in heap [0x00011efa8890,0x00011efa8be8] = 856 > relocation [0x00011efa89b8,0x00011efa89f8] = 64 > main code [0x00011efa8a00,0x00011efa8be8] = 488 > # > # If you would like to submit a bug report, please visit: > # http://bugreport.java.com/bugreport/crash.jsp > # > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41841) Support PyPI packaging without JVM
Hyukjin Kwon created SPARK-41841: Summary: Support PyPI packaging without JVM Key: SPARK-41841 URL: https://issues.apache.org/jira/browse/SPARK-41841 Project: Spark Issue Type: Sub-task Components: Build, Connect Affects Versions: 3.4.0 Reporter: Hyukjin Kwon We should support pip install pyspark without JVM so Spark Connect can be real lightweight library. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41835) Implement `transform_keys` function
[ https://issues.apache.org/jira/browse/SPARK-41835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653722#comment-17653722 ] Ruifeng Zheng commented on SPARK-41835: --- this function was added > Implement `transform_keys` function > --- > > Key: SPARK-41835 > URL: https://issues.apache.org/jira/browse/SPARK-41835 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41656) Enable doctests in pyspark.sql.connect.dataframe
[ https://issues.apache.org/jira/browse/SPARK-41656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41656: Assignee: Sandeep Singh > Enable doctests in pyspark.sql.connect.dataframe > > > Key: SPARK-41656 > URL: https://issues.apache.org/jira/browse/SPARK-41656 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41656) Enable doctests in pyspark.sql.connect.dataframe
[ https://issues.apache.org/jira/browse/SPARK-41656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41656. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39346 [https://github.com/apache/spark/pull/39346] > Enable doctests in pyspark.sql.connect.dataframe > > > Key: SPARK-41656 > URL: https://issues.apache.org/jira/browse/SPARK-41656 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41842) Support data type Timestamp(NANOSECOND, null)
Sandeep Singh created SPARK-41842: - Summary: Support data type Timestamp(NANOSECOND, null) Key: SPARK-41842 URL: https://issues.apache.org/jira/browse/SPARK-41842 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 99, in pyspark.sql.connect.dataframe.DataFrame.isEmpty Failed example: df_empty = spark.createDataFrame([], 'a STRING') Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df_empty = spark.createDataFrame([], 'a STRING') File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", line 186, in createDataFrame raise ValueError("Input data cannot be empty") ValueError: Input data cannot be empty{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41842) Support data type Timestamp(NANOSECOND, null)
[ https://issues.apache.org/jira/browse/SPARK-41842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41842: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1966, in pyspark.sql.connect.functions.hour Failed example: df.select(hour('ts').alias('hour')).collect() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(hour('ts').alias('hour')).collect() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1017, in collect pdf = self.toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 623, in _handle_error raise SparkConnectException(status.message, info.reason) from None pyspark.sql.connect.client.SparkConnectException: (org.apache.spark.SparkUnsupportedOperationException) Unsupported data type: Timestamp(NANOSECOND, null){code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 99, in pyspark.sql.connect.dataframe.DataFrame.isEmpty Failed example: df_empty = spark.createDataFrame([], 'a STRING') Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df_empty = spark.createDataFrame([], 'a STRING') File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", line 186, in createDataFrame raise ValueError("Input data cannot be empty") ValueError: Input data cannot be empty{code} > Support data type Timestamp(NANOSECOND, null) > - > > Key: SPARK-41842 > URL: https://issues.apache.org/jira/browse/SPARK-41842 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1966, in pyspark.sql.connect.functions.hour > Failed example: > df.select(hour('ts').alias('hour')).collect() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > df.select(hour('ts').alias('hour')).collect() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1017, in collect > pdf = self.toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 623, in _handle_error > raise SparkConnectException(status.message, info.reason) from None > pyspark.sql.connect.client.SparkConnectException: > (org.apache.spark.SparkUnsupportedOperationException) Unsupported data type: > Timestamp(NANOSECOND, null){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41842) Support data type Timestamp(NANOSECOND, null)
[ https://issues.apache.org/jira/browse/SPARK-41842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653728#comment-17653728 ] Sandeep Singh commented on SPARK-41842: --- Not sure about the EPIC for this one. > Support data type Timestamp(NANOSECOND, null) > - > > Key: SPARK-41842 > URL: https://issues.apache.org/jira/browse/SPARK-41842 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1966, in pyspark.sql.connect.functions.hour > Failed example: > df.select(hour('ts').alias('hour')).collect() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > df.select(hour('ts').alias('hour')).collect() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1017, in collect > pdf = self.toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 623, in _handle_error > raise SparkConnectException(status.message, info.reason) from None > pyspark.sql.connect.client.SparkConnectException: > (org.apache.spark.SparkUnsupportedOperationException) Unsupported data type: > Timestamp(NANOSECOND, null){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41843) Implement SparkSession.udf
Sandeep Singh created SPARK-41843: - Summary: Implement SparkSession.udf Key: SPARK-41843 URL: https://issues.apache.org/jira/browse/SPARK-41843 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1966, in pyspark.sql.connect.functions.hour Failed example: df.select(hour('ts').alias('hour')).collect() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(hour('ts').alias('hour')).collect() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1017, in collect pdf = self.toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 623, in _handle_error raise SparkConnectException(status.message, info.reason) from None pyspark.sql.connect.client.SparkConnectException: (org.apache.spark.SparkUnsupportedOperationException) Unsupported data type: Timestamp(NANOSECOND, null){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41843) Implement SparkSession.udf
[ https://issues.apache.org/jira/browse/SPARK-41843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41843: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 2331, in pyspark.sql.connect.functions.call_udf Failed example: _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType()) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType()) AttributeError: 'SparkSession' object has no attribute 'udf'{code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1966, in pyspark.sql.connect.functions.hour Failed example: df.select(hour('ts').alias('hour')).collect() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(hour('ts').alias('hour')).collect() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1017, in collect pdf = self.toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 623, in _handle_error raise SparkConnectException(status.message, info.reason) from None pyspark.sql.connect.client.SparkConnectException: (org.apache.spark.SparkUnsupportedOperationException) Unsupported data type: Timestamp(NANOSECOND, null){code} > Implement SparkSession.udf > -- > > Key: SPARK-41843 > URL: https://issues.apache.org/jira/browse/SPARK-41843 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 2331, in pyspark.sql.connect.functions.call_udf > Failed example: > _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType()) > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType()) > AttributeError: 'SparkSession' object has no attribute 'udf'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41835) Implement `transform_keys` function
[ https://issues.apache.org/jira/browse/SPARK-41835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41835: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1611, in pyspark.sql.connect.functions.transform_keys Failed example: df.select(transform_keys( "data", lambda k, _: upper(k)).alias("data_upper") ).show(truncate=False) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(transform_keys( File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "transform_keys(data, lambdafunction(upper(x_11), x_11, y_12))" due to data type mismatch: Parameter 1 requires the "MAP" type, however "data" has the type "STRUCT". Plan: 'Project [transform_keys(data#4493, lambdafunction('upper(lambda 'x_11), lambda 'x_11, lambda 'y_12, false)) AS data_upper#4496] +- Project [0#4488L AS id#4492L, 1#4489 AS data#4493] +- LocalRelation [0#4488L, 1#4489] {code} > Implement `transform_keys` function > --- > > Key: SPARK-41835 > URL: https://issues.apache.org/jira/browse/SPARK-41835 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1611, in pyspark.sql.connect.functions.transform_keys > Failed example: > df.select(transform_keys( > "data", lambda k, _: upper(k)).alias("data_upper") > ).show(truncate=False) > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > df.select(transform_keys( > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 534, in show > print(self._show_string(n, truncate, vertical)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 423, in _show_string > ).toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 619, in _handle_error > raise SparkConnectAnalysisException( > pyspark.sql.connect.client.SparkConnectAnalysisException: > [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve > "transform_keys(data, lambdafunction(upper(x_11), x_11, y_12))" due to data > type mismatch: Parameter 1 requires the "MAP" type, however "data" has the > type "STRUCT". > Plan: 'Project [transform_keys(data#4493, lambdafunction('upper(lambda > 'x_11), lambda 'x_11, lambda 'y_12, false)) AS data_upper#4496] > +- Project [0#4488L AS id#4492L, 1#4489 AS data#4493] > +- LocalRelation [0#4488L, 1#4489] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) ---
[jira] [Commented] (SPARK-41835) Implement `transform_keys` function
[ https://issues.apache.org/jira/browse/SPARK-41835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653731#comment-17653731 ] Sandeep Singh commented on SPARK-41835: --- My bad, error is about expected input types. > Implement `transform_keys` function > --- > > Key: SPARK-41835 > URL: https://issues.apache.org/jira/browse/SPARK-41835 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1611, in pyspark.sql.connect.functions.transform_keys > Failed example: > df.select(transform_keys( > "data", lambda k, _: upper(k)).alias("data_upper") > ).show(truncate=False) > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > df.select(transform_keys( > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 534, in show > print(self._show_string(n, truncate, vertical)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 423, in _show_string > ).toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 619, in _handle_error > raise SparkConnectAnalysisException( > pyspark.sql.connect.client.SparkConnectAnalysisException: > [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve > "transform_keys(data, lambdafunction(upper(x_11), x_11, y_12))" due to data > type mismatch: Parameter 1 requires the "MAP" type, however "data" has the > type "STRUCT". > Plan: 'Project [transform_keys(data#4493, lambdafunction('upper(lambda > 'x_11), lambda 'x_11, lambda 'y_12, false)) AS data_upper#4496] > +- Project [0#4488L AS id#4492L, 1#4489 AS data#4493] > +- LocalRelation [0#4488L, 1#4489] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41844) Implement `intX2` function
[ https://issues.apache.org/jira/browse/SPARK-41844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41844: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 2332, in pyspark.sql.connect.functions.call_udf Failed example: df.select(call_udf("intX2", "id")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(call_udf("intX2", "id")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [UNRESOLVED_ROUTINE] Cannot resolve function `intX2` on search path [`system`.`builtin`, `system`.`session`, `spark_catalog`.`default`]. Plan: {code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1611, in pyspark.sql.connect.functions.transform_keys Failed example: df.select(transform_keys( "data", lambda k, _: upper(k)).alias("data_upper") ).show(truncate=False) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(transform_keys( File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "transform_keys(data, lambdafunction(upper(x_11), x_11, y_12))" due to data type mismatch: Parameter 1 requires the "MAP" type, however "data" has the type "STRUCT". Plan: 'Project [transform_keys(data#4493, lambdafunction('upper(lambda 'x_11), lambda 'x_11, lambda 'y_12, false)) AS data_upper#4496] +- Project [0#4488L AS id#4492L, 1#4489 AS data#4493] +- LocalRelation [0#4488L, 1#4489] {code} > Implement `intX2` function > -- > > Key: SPARK-41844 > URL: https://issues.apache.org/jira/browse/SPARK-41844 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 2332, in pyspark.sql.connect.functions.call_udf > Failed example: > df.select(call_udf("intX2", "id")).show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > df.select(call_udf("intX2", "id")).show() > File > "/Users/s
[jira] [Created] (SPARK-41844) Implement `intX2` function
Sandeep Singh created SPARK-41844: - Summary: Implement `intX2` function Key: SPARK-41844 URL: https://issues.apache.org/jira/browse/SPARK-41844 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Sandeep Singh Fix For: 3.4.0 {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1611, in pyspark.sql.connect.functions.transform_keys Failed example: df.select(transform_keys( "data", lambda k, _: upper(k)).alias("data_upper") ).show(truncate=False) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(transform_keys( File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "transform_keys(data, lambdafunction(upper(x_11), x_11, y_12))" due to data type mismatch: Parameter 1 requires the "MAP" type, however "data" has the type "STRUCT". Plan: 'Project [transform_keys(data#4493, lambdafunction('upper(lambda 'x_11), lambda 'x_11, lambda 'y_12, false)) AS data_upper#4496] +- Project [0#4488L AS id#4492L, 1#4489 AS data#4493] +- LocalRelation [0#4488L, 1#4489] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41844) Implement `intX2` function
[ https://issues.apache.org/jira/browse/SPARK-41844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh resolved SPARK-41844. --- Resolution: Invalid > Implement `intX2` function > -- > > Key: SPARK-41844 > URL: https://issues.apache.org/jira/browse/SPARK-41844 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 2332, in pyspark.sql.connect.functions.call_udf > Failed example: > df.select(call_udf("intX2", "id")).show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > df.select(call_udf("intX2", "id")).show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 534, in show > print(self._show_string(n, truncate, vertical)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 423, in _show_string > ).toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 619, in _handle_error > raise SparkConnectAnalysisException( > pyspark.sql.connect.client.SparkConnectAnalysisException: > [UNRESOLVED_ROUTINE] Cannot resolve function `intX2` on search path > [`system`.`builtin`, `system`.`session`, `spark_catalog`.`default`]. > Plan: {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41845) Fix `count(expr("*"))` function
Sandeep Singh created SPARK-41845: - Summary: Fix `count(expr("*"))` function Key: SPARK-41845 URL: https://issues.apache.org/jira/browse/SPARK-41845 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Sandeep Singh Fix For: 3.4.0 {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 2332, in pyspark.sql.connect.functions.call_udf Failed example: df.select(call_udf("intX2", "id")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(call_udf("intX2", "id")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [UNRESOLVED_ROUTINE] Cannot resolve function `intX2` on search path [`system`.`builtin`, `system`.`session`, `spark_catalog`.`default`]. Plan: {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41845) Fix `count(expr("*"))` function
[ https://issues.apache.org/jira/browse/SPARK-41845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41845: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 801, in pyspark.sql.connect.functions.count Failed example: df.select(count(expr("*")), count(df.alphabets)).show() Expected: +++ |count(1)|count(alphabets)| +++ | 4| 3| +++ Got: +++ |count(alphabets)|count(alphabets)| +++ | 3| 3| +++ {code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 2332, in pyspark.sql.connect.functions.call_udf Failed example: df.select(call_udf("intX2", "id")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(call_udf("intX2", "id")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [UNRESOLVED_ROUTINE] Cannot resolve function `intX2` on search path [`system`.`builtin`, `system`.`session`, `spark_catalog`.`default`]. Plan: {code} > Fix `count(expr("*"))` function > --- > > Key: SPARK-41845 > URL: https://issues.apache.org/jira/browse/SPARK-41845 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 801, in pyspark.sql.connect.functions.count > Failed example: > df.select(count(expr("*")), count(df.alphabets)).show() > Expected: > +++ > |count(1)|count(alphabets)| > +++ > | 4| 3| > +++ > Got: > +++ > |count(alphabets)|count(alphabets)| > +++ > | 3| 3| > +++ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41823) DataFrame.join creating ambiguous column names
[ https://issues.apache.org/jira/browse/SPARK-41823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh resolved SPARK-41823. --- Resolution: Duplicate > DataFrame.join creating ambiguous column names > -- > > Key: SPARK-41823 > URL: https://issues.apache.org/jira/browse/SPARK-41823 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 254, in pyspark.sql.connect.dataframe.DataFrame.drop > Failed example: > df.join(df2, df.name == df2.name, 'inner').drop('name').show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > df.join(df2, df.name == df2.name, 'inner').drop('name').show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 534, in show > print(self._show_string(n, truncate, vertical)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 423, in _show_string > ).toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 619, in _handle_error > raise SparkConnectAnalysisException( > pyspark.sql.connect.client.SparkConnectAnalysisException: > [AMBIGUOUS_REFERENCE] Reference `name` is ambiguous, could be: [`name`, > `name`]. > Plan: {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37521) insert overwrite table but the partition information stored in Metastore was not changed
[ https://issues.apache.org/jira/browse/SPARK-37521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jingxiong zhong resolved SPARK-37521. - Resolution: Won't Fix > insert overwrite table but the partition information stored in Metastore was > not changed > > > Key: SPARK-37521 > URL: https://issues.apache.org/jira/browse/SPARK-37521 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 > Environment: spark3.2.0 > hive2.3.9 > metastore2.3.9 >Reporter: jingxiong zhong >Priority: Major > > I create a partitioned table in SparkSQL, insert a data entry, add a regular > field, and finally insert a new data entry into the partition,The query is > normal in SparkSQL, but the return value of the newly inserted field is NULL > in Hive 2.3.9 > for example > create table updata_col_test1(a int) partitioned by (dt string); > insert overwrite table updata_col_test1 partition(dt='20200101') values(1); > insert overwrite table updata_col_test1 partition(dt='20200102') values(1); > insert overwrite table updata_col_test1 partition(dt='20200103') values(1); > alter table updata_col_test1 add columns (b int); > insert overwrite table updata_col_test1 partition(dt) values(1, 2, > '20200101'); fail > insert overwrite table updata_col_test1 partition(dt='20200101') values(1, > 2); fail > insert overwrite table updata_col_test1 partition(dt='20200104') values(1, > 2); sucessfully -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37677) spark on k8s, when the user want to push python3.6.6.zip to the pod , but no permission to execute
[ https://issues.apache.org/jira/browse/SPARK-37677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653733#comment-17653733 ] jingxiong zhong commented on SPARK-37677: - At present, I have repaired Hadoop version 3.3.5, but it has not been released yet. In the future, Spark needs to update the Hadoop version to solve this problem.[~valux] > spark on k8s, when the user want to push python3.6.6.zip to the pod , but no > permission to execute > -- > > Key: SPARK-37677 > URL: https://issues.apache.org/jira/browse/SPARK-37677 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: jingxiong zhong >Priority: Major > > In cluster mode, I hava another question that when I unzip python3.6.6.zip in > pod , but no permission to execute, my execute operation as follows: > {code:sh} > spark-submit \ > --archives ./python3.6.6.zip#python3.6.6 \ > --conf "spark.pyspark.python=python3.6.6/python3.6.6/bin/python3" \ > --conf "spark.pyspark.driver.python=python3.6.6/python3.6.6/bin/python3" \ > --conf spark.kubernetes.container.image.pullPolicy=Always \ > ./examples/src/main/python/pi.py 100 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41846) DataFrame aggregation functions : unresolved columns
Sandeep Singh created SPARK-41846: - Summary: DataFrame aggregation functions : unresolved columns Key: SPARK-41846 URL: https://issues.apache.org/jira/browse/SPARK-41846 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code} File "/.../spark/python/pyspark/sql/connect/column.py", line 106, in pyspark.sql.connect.column.Column.eqNullSafe Failed example: df1.join(df2, df1["value"] == df2["value"]).count() Exception raised: Traceback (most recent call last): File "/.../miniconda3/envs/python3.9/lib/python3.9/doctest.py", line 1336, in __run exec(compile(example.source, filename, "single", File "", line 1, in df1.join(df2, df1["value"] == df2["value"]).count() File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 151, in count pdd = self.agg(_invoke_function("count", lit(1))).toPandas() File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/.../spark/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/.../spark/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/.../spark/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [AMBIGUOUS_REFERENCE] Reference `value` is ambiguous, could be: [`value`, `value`]. {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39853) Support stage level schedule for standalone cluster when dynamic allocation is disabled
[ https://issues.apache.org/jira/browse/SPARK-39853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-39853: - Fix Version/s: 3.4.0 > Support stage level schedule for standalone cluster when dynamic allocation > is disabled > --- > > Key: SPARK-39853 > URL: https://issues.apache.org/jira/browse/SPARK-39853 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: huangtengfei >Assignee: huangtengfei >Priority: Major > Fix For: 3.4.0 > > > [SPARK-39062|https://issues.apache.org/jira/browse/SPARK-39062] added stage > level schedule support for standalone cluster when dynamic allocation was > enabled, spark would request for executors for different resource profiles. > While when dynamic allocation is disabled, we can also leverage stage level > schedule to schedule tasks based on resource profile(task resource requests) > to executors with default resource profile. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41846) DataFrame aggregation functions : unresolved columns
[ https://issues.apache.org/jira/browse/SPARK-41846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41846: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1098, in pyspark.sql.connect.functions.rank Failed example: df.withColumn("drank", rank().over(w)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.withColumn("drank", rank().over(w)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `value` cannot be resolved. Did you mean one of the following? [`_1`] Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS drank#4003] +- Project [0#3998L AS _1#4000L] +- LocalRelation [0#3998L] {code} was: {code} File "/.../spark/python/pyspark/sql/connect/column.py", line 106, in pyspark.sql.connect.column.Column.eqNullSafe Failed example: df1.join(df2, df1["value"] == df2["value"]).count() Exception raised: Traceback (most recent call last): File "/.../miniconda3/envs/python3.9/lib/python3.9/doctest.py", line 1336, in __run exec(compile(example.source, filename, "single", File "", line 1, in df1.join(df2, df1["value"] == df2["value"]).count() File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 151, in count pdd = self.agg(_invoke_function("count", lit(1))).toPandas() File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/.../spark/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/.../spark/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/.../spark/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [AMBIGUOUS_REFERENCE] Reference `value` is ambiguous, could be: [`value`, `value`]. {code} > DataFrame aggregation functions : unresolved columns > > > Key: SPARK-41846 > URL: https://issues.apache.org/jira/browse/SPARK-41846 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1098, in pyspark.sql.connect.functions.rank > Failed example: > df.withColumn("drank", rank().over(w)).show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > df.withColumn("drank", rank().over(w)).show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 534, in show > print(self._show_string(n, truncate, vertical)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 423, in _show_string > ).toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File
[jira] [Updated] (SPARK-41846) DataFrame aggregation functions : unresolved columns
[ https://issues.apache.org/jira/browse/SPARK-41846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41846: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1098, in pyspark.sql.connect.functions.rank Failed example: df.withColumn("drank", rank().over(w)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.withColumn("drank", rank().over(w)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `value` cannot be resolved. Did you mean one of the following? [`_1`] Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS drank#4003] +- Project [0#3998L AS _1#4000L] +- LocalRelation [0#3998L] {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1032, in pyspark.sql.connect.functions.cume_dist Failed example: df.withColumn("cd", cume_dist().over(w)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.withColumn("cd", cume_dist().over(w)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `value` cannot be resolved. Did you mean one of the following? [`_1`] Plan: 'Project [_1#2202L, cume_dist() windowspecdefinition('value ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS cd#2205] +- Project [0#2200L AS _1#2202L] +- LocalRelation [0#2200L] {code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1098, in pyspark.sql.connect.functions.rank Failed example: df.withColumn("drank", rank().over(w)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.withColumn("drank", rank().over(w)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.sin
[jira] [Updated] (SPARK-41846) DataFrame windowspec functions : unresolved columns
[ https://issues.apache.org/jira/browse/SPARK-41846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41846: -- Summary: DataFrame windowspec functions : unresolved columns (was: DataFrame aggregation functions : unresolved columns) > DataFrame windowspec functions : unresolved columns > --- > > Key: SPARK-41846 > URL: https://issues.apache.org/jira/browse/SPARK-41846 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1098, in pyspark.sql.connect.functions.rank > Failed example: > df.withColumn("drank", rank().over(w)).show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > df.withColumn("drank", rank().over(w)).show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 534, in show > print(self._show_string(n, truncate, vertical)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 423, in _show_string > ).toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 619, in _handle_error > raise SparkConnectAnalysisException( > pyspark.sql.connect.client.SparkConnectAnalysisException: > [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name > `value` cannot be resolved. Did you mean one of the following? [`_1`] > Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS > FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) > AS drank#4003] > +- Project [0#3998L AS _1#4000L] > +- LocalRelation [0#3998L] {code} > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1032, in pyspark.sql.connect.functions.cume_dist > Failed example: > df.withColumn("cd", cume_dist().over(w)).show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > df.withColumn("cd", cume_dist().over(w)).show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 534, in show > print(self._show_string(n, truncate, vertical)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 423, in _show_string > ).toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 619, in _handle_error > raise SparkConnectAnalysisException( > pyspark.sql.connect.client.SparkConnectAnalysisException: > [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name > `value` cannot be resolved. Did you mean one of the following? [`_1`] > Plan: 'Project [_1#2202L, cume_dist() windowspecdefinition('value ASC > NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), > currentrow$())) AS cd#2205] > +- Project [0#2200L AS _1#2202L] > +- LocalRelation [0#2200L] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...
[jira] [Created] (SPARK-41847) DataFrame mapfield invalid type
Sandeep Singh created SPARK-41847: - Summary: DataFrame mapfield invalid type Key: SPARK-41847 URL: https://issues.apache.org/jira/browse/SPARK-41847 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1098, in pyspark.sql.connect.functions.rank Failed example: df.withColumn("drank", rank().over(w)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.withColumn("drank", rank().over(w)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `value` cannot be resolved. Did you mean one of the following? [`_1`] Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS drank#4003] +- Project [0#3998L AS _1#4000L] +- LocalRelation [0#3998L] {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1032, in pyspark.sql.connect.functions.cume_dist Failed example: df.withColumn("cd", cume_dist().over(w)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.withColumn("cd", cume_dist().over(w)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `value` cannot be resolved. Did you mean one of the following? [`_1`] Plan: 'Project [_1#2202L, cume_dist() windowspecdefinition('value ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS cd#2205] +- Project [0#2200L AS _1#2202L] +- LocalRelation [0#2200L] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41847) DataFrame mapfield invalid type
[ https://issues.apache.org/jira/browse/SPARK-41847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41847: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1270, in pyspark.sql.connect.functions.explode Failed example: eDF.select(explode(eDF.mapfield).alias("key", "value")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in eDF.select(explode(eDF.mapfield).alias("key", "value")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type "STRUCT" while it's required to be "MAP". Plan: {code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1098, in pyspark.sql.connect.functions.rank Failed example: df.withColumn("drank", rank().over(w)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.withColumn("drank", rank().over(w)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `value` cannot be resolved. Did you mean one of the following? [`_1`] Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS drank#4003] +- Project [0#3998L AS _1#4000L] +- LocalRelation [0#3998L] {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1032, in pyspark.sql.connect.functions.cume_dist Failed example: df.withColumn("cd", cume_dist().over(w)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.withColumn("cd", cume_dist().over(w)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py",
[jira] [Updated] (SPARK-41847) DataFrame mapfield,structlist invalid type
[ https://issues.apache.org/jira/browse/SPARK-41847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41847: -- Summary: DataFrame mapfield,structlist invalid type (was: DataFrame mapfield invalid type) > DataFrame mapfield,structlist invalid type > -- > > Key: SPARK-41847 > URL: https://issues.apache.org/jira/browse/SPARK-41847 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1270, in pyspark.sql.connect.functions.explode > Failed example: > eDF.select(explode(eDF.mapfield).alias("key", "value")).show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > eDF.select(explode(eDF.mapfield).alias("key", "value")).show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 534, in show > print(self._show_string(n, truncate, vertical)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 423, in _show_string > ).toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 619, in _handle_error > raise SparkConnectAnalysisException( > pyspark.sql.connect.client.SparkConnectAnalysisException: > [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type > "STRUCT" while it's required to be "MAP". > Plan: {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41847) DataFrame mapfield,structlist invalid type
[ https://issues.apache.org/jira/browse/SPARK-41847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41847: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1270, in pyspark.sql.connect.functions.explode Failed example: eDF.select(explode(eDF.mapfield).alias("key", "value")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in eDF.select(explode(eDF.mapfield).alias("key", "value")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type "STRUCT" while it's required to be "MAP". Plan: {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1364, in pyspark.sql.connect.functions.inline Failed example: df.select(inline(df.structlist)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(inline(df.structlist)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is of type "ARRAY" while it's required to be "STRUCT". Plan: {code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1270, in pyspark.sql.connect.functions.explode Failed example: eDF.select(explode(eDF.mapfield).alias("key", "value")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in eDF.select(explode(eDF.mapfield).alias("key", "value")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error)
[jira] [Created] (SPARK-41848) Tasks are over-scheduled with TaskResourceProfile
wuyi created SPARK-41848: Summary: Tasks are over-scheduled with TaskResourceProfile Key: SPARK-41848 URL: https://issues.apache.org/jira/browse/SPARK-41848 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.0 Reporter: wuyi {code:java} test("SPARK-XXX") { val conf = new SparkConf().setAppName("test").setMaster("local-cluster[1,4,1024]") sc = new SparkContext(conf) val req = new TaskResourceRequests().cpus(3) val rp = new ResourceProfileBuilder().require(req).build() val res = sc.parallelize(Seq(0, 1), 2).withResources(rp).map { x => Thread.sleep(5000) x * 2 }.collect() assert(res === Array(0, 2)) } {code} In this test, tasks are supposed to be scheduled in order since each task requires 3 cores but the executor only has 4 cores. However, we noticed 2 tasks are launched concurrently from the logs. It turns out that we used the TaskResourceProfile (taskCpus=3) of the taskset for task scheduling: {code:java} val rpId = taskSet.taskSet.resourceProfileId val taskSetProf = sc.resourceProfileManager.resourceProfileFromId(rpId) val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(taskSetProf, conf) {code} but the ResourceProfile (taskCpus=1) of the executor for updating the free cores in ExecutorData: {code:java} val rpId = executorData.resourceProfileId val prof = scheduler.sc.resourceProfileManager.resourceProfileFromId(rpId) val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(prof, conf) executorData.freeCores -= taskCpus {code} which results in the inconsistency of the available cores. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41849) Implement DataFrameReader.text
[ https://issues.apache.org/jira/browse/SPARK-41849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41849: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 276, in pyspark.sql.connect.functions.input_file_name Failed example: df = spark.read.text(path) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df = spark.read.text(path) AttributeError: 'DataFrameReader' object has no attribute 'text'{code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1270, in pyspark.sql.connect.functions.explode Failed example: eDF.select(explode(eDF.mapfield).alias("key", "value")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in eDF.select(explode(eDF.mapfield).alias("key", "value")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type "STRUCT" while it's required to be "MAP". Plan: {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1364, in pyspark.sql.connect.functions.inline Failed example: df.select(inline(df.structlist)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(inline(df.structlist)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is of type "ARRAY" while it's required to be "STRUCT". Plan: {code} > Implement DataFrameReader.text > -- > > Key: SPARK-41849 > URL: https://issues.apache.org/jira/browse/SPARK-41849 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 276, in pyspark.sql.connect.functions.input_file_name > Failed example: > df = spark.read.text(path) > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > li
[jira] [Created] (SPARK-41849) Implement DataFrameReader.text
Sandeep Singh created SPARK-41849: - Summary: Implement DataFrameReader.text Key: SPARK-41849 URL: https://issues.apache.org/jira/browse/SPARK-41849 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1270, in pyspark.sql.connect.functions.explode Failed example: eDF.select(explode(eDF.mapfield).alias("key", "value")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in eDF.select(explode(eDF.mapfield).alias("key", "value")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type "STRUCT" while it's required to be "MAP". Plan: {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1364, in pyspark.sql.connect.functions.inline Failed example: df.select(inline(df.structlist)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(inline(df.structlist)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is of type "ARRAY" while it's required to be "STRUCT". Plan: {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41814) Column.eqNullSafe fails on NaN comparison
[ https://issues.apache.org/jira/browse/SPARK-41814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653736#comment-17653736 ] Ruifeng Zheng commented on SPARK-41814: --- this issue is due to: 1, the conversion from rows to pd.DataFrame, which automatically convert null to NaN 2, then the conversion from pd.DataFrame to pa.Table, which convert NaN to null > Column.eqNullSafe fails on NaN comparison > - > > Key: SPARK-41814 > URL: https://issues.apache.org/jira/browse/SPARK-41814 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > File "/.../spark/python/pyspark/sql/connect/column.py", line 115, in > pyspark.sql.connect.column.Column.eqNullSafe > Failed example: > df2.select( > df2['value'].eqNullSafe(None), > df2['value'].eqNullSafe(float('NaN')), > df2['value'].eqNullSafe(42.0) > ).show() > Expected: > ++---++ > |(value <=> NULL)|(value <=> NaN)|(value <=> 42.0)| > ++---++ > | false| true| false| > | false| false|true| > |true| false| false| > ++---++ > Got: > ++---++ > |(value <=> NULL)|(value <=> NaN)|(value <=> 42.0)| > ++---++ > |true| false| false| > | false| false|true| > |true| false| false| > ++---++ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41850) Fix DataFrameReader.isnan
Sandeep Singh created SPARK-41850: - Summary: Fix DataFrameReader.isnan Key: SPARK-41850 URL: https://issues.apache.org/jira/browse/SPARK-41850 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Sandeep Singh {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 276, in pyspark.sql.connect.functions.input_file_name Failed example: df = spark.read.text(path) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df = spark.read.text(path) AttributeError: 'DataFrameReader' object has no attribute 'text'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41850) Fix `isnan` function
[ https://issues.apache.org/jira/browse/SPARK-41850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41850: -- Summary: Fix `isnan` function (was: Fix DataFrameReader.isnan) > Fix `isnan` function > > > Key: SPARK-41850 > URL: https://issues.apache.org/jira/browse/SPARK-41850 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 276, in pyspark.sql.connect.functions.input_file_name > Failed example: > df = spark.read.text(path) > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > df = spark.read.text(path) > AttributeError: 'DataFrameReader' object has no attribute 'text'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41850) Fix `isnan` function
[ https://issues.apache.org/jira/browse/SPARK-41850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41850: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 288, in pyspark.sql.connect.functions.isnan Failed example: df.select("a", "b", isnan("a").alias("r1"), isnan(df.b).alias("r2")).show() Expected: +---+---+-+-+ | a| b| r1| r2| +---+---+-+-+ |1.0|NaN|false| true| |NaN|2.0| true|false| +---+---+-+-+ Got: +++-+-+ | a| b| r1| r2| +++-+-+ | 1.0|null|false|false| |null| 2.0|false|false| +++-+-+ {code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 276, in pyspark.sql.connect.functions.input_file_name Failed example: df = spark.read.text(path) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df = spark.read.text(path) AttributeError: 'DataFrameReader' object has no attribute 'text'{code} > Fix `isnan` function > > > Key: SPARK-41850 > URL: https://issues.apache.org/jira/browse/SPARK-41850 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 288, in pyspark.sql.connect.functions.isnan > Failed example: > df.select("a", "b", isnan("a").alias("r1"), > isnan(df.b).alias("r2")).show() > Expected: > +---+---+-+-+ > | a| b| r1| r2| > +---+---+-+-+ > |1.0|NaN|false| true| > |NaN|2.0| true|false| > +---+---+-+-+ > Got: > +++-+-+ > | a| b| r1| r2| > +++-+-+ > | 1.0|null|false|false| > |null| 2.0|false|false| > +++-+-+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41850) Fix `isnan` function
[ https://issues.apache.org/jira/browse/SPARK-41850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653738#comment-17653738 ] Sandeep Singh commented on SPARK-41850: --- This should be moved under SPARK-41283 > Fix `isnan` function > > > Key: SPARK-41850 > URL: https://issues.apache.org/jira/browse/SPARK-41850 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 288, in pyspark.sql.connect.functions.isnan > Failed example: > df.select("a", "b", isnan("a").alias("r1"), > isnan(df.b).alias("r2")).show() > Expected: > +---+---+-+-+ > | a| b| r1| r2| > +---+---+-+-+ > |1.0|NaN|false| true| > |NaN|2.0| true|false| > +---+---+-+-+ > Got: > +++-+-+ > | a| b| r1| r2| > +++-+-+ > | 1.0|null|false|false| > |null| 2.0|false|false| > +++-+-+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41847) DataFrame mapfield,structlist invalid type
[ https://issues.apache.org/jira/browse/SPARK-41847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41847: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1270, in pyspark.sql.connect.functions.explode Failed example: eDF.select(explode(eDF.mapfield).alias("key", "value")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in eDF.select(explode(eDF.mapfield).alias("key", "value")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type "STRUCT" while it's required to be "MAP". Plan: {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1364, in pyspark.sql.connect.functions.inline Failed example: df.select(inline(df.structlist)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(inline(df.structlist)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is of type "ARRAY" while it's required to be "STRUCT". Plan: {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1411, in pyspark.sql.connect.functions.map_filter Failed example: df.select(map_filter( "data", lambda _, v: v > 30.0).alias("data_filtered") ).show(truncate=False) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(map_filter( File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error)
[jira] [Updated] (SPARK-41851) Fix `nanvl` function
[ https://issues.apache.org/jira/browse/SPARK-41851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41851: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 313, in pyspark.sql.connect.functions.nanvl Failed example: df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, df.b).alias("r2")).collect() Expected: [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)] Got: [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 801, in pyspark.sql.connect.functions.count Failed example: df.select(count(expr("*")), count(df.alphabets)).show() Expected: +++ |count(1)|count(alphabets)| +++ | 4| 3| +++ Got: +++ |count(alphabets)|count(alphabets)| +++ | 3| 3| +++ {code} > Fix `nanvl` function > > > Key: SPARK-41851 > URL: https://issues.apache.org/jira/browse/SPARK-41851 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 313, in pyspark.sql.connect.functions.nanvl > Failed example: > df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, > df.b).alias("r2")).collect() > Expected: > [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)] > Got: > [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41851) Fix `nanvl` function
Sandeep Singh created SPARK-41851: - Summary: Fix `nanvl` function Key: SPARK-41851 URL: https://issues.apache.org/jira/browse/SPARK-41851 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Sandeep Singh Fix For: 3.4.0 {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 801, in pyspark.sql.connect.functions.count Failed example: df.select(count(expr("*")), count(df.alphabets)).show() Expected: +++ |count(1)|count(alphabets)| +++ | 4| 3| +++ Got: +++ |count(alphabets)|count(alphabets)| +++ | 3| 3| +++ {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41848) Tasks are over-scheduled with TaskResourceProfile
[ https://issues.apache.org/jira/browse/SPARK-41848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653739#comment-17653739 ] wuyi commented on SPARK-41848: -- cc [~ivoson] > Tasks are over-scheduled with TaskResourceProfile > - > > Key: SPARK-41848 > URL: https://issues.apache.org/jira/browse/SPARK-41848 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: wuyi >Priority: Major > > {code:java} > test("SPARK-XXX") { > val conf = new > SparkConf().setAppName("test").setMaster("local-cluster[1,4,1024]") > sc = new SparkContext(conf) > val req = new TaskResourceRequests().cpus(3) > val rp = new ResourceProfileBuilder().require(req).build() > val res = sc.parallelize(Seq(0, 1), 2).withResources(rp).map { x => > Thread.sleep(5000) > x * 2 > }.collect() > assert(res === Array(0, 2)) > } {code} > In this test, tasks are supposed to be scheduled in order since each task > requires 3 cores but the executor only has 4 cores. However, we noticed 2 > tasks are launched concurrently from the logs. > It turns out that we used the TaskResourceProfile (taskCpus=3) of the taskset > for task scheduling: > {code:java} > val rpId = taskSet.taskSet.resourceProfileId > val taskSetProf = sc.resourceProfileManager.resourceProfileFromId(rpId) > val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(taskSetProf, > conf) {code} > but the ResourceProfile (taskCpus=1) of the executor for updating the free > cores in ExecutorData: > {code:java} > val rpId = executorData.resourceProfileId > val prof = scheduler.sc.resourceProfileManager.resourceProfileFromId(rpId) > val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(prof, conf) > executorData.freeCores -= taskCpus {code} > which results in the inconsistency of the available cores. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41848) Tasks are over-scheduled with TaskResourceProfile
[ https://issues.apache.org/jira/browse/SPARK-41848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-41848: - Priority: Blocker (was: Major) > Tasks are over-scheduled with TaskResourceProfile > - > > Key: SPARK-41848 > URL: https://issues.apache.org/jira/browse/SPARK-41848 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: wuyi >Priority: Blocker > > {code:java} > test("SPARK-XXX") { > val conf = new > SparkConf().setAppName("test").setMaster("local-cluster[1,4,1024]") > sc = new SparkContext(conf) > val req = new TaskResourceRequests().cpus(3) > val rp = new ResourceProfileBuilder().require(req).build() > val res = sc.parallelize(Seq(0, 1), 2).withResources(rp).map { x => > Thread.sleep(5000) > x * 2 > }.collect() > assert(res === Array(0, 2)) > } {code} > In this test, tasks are supposed to be scheduled in order since each task > requires 3 cores but the executor only has 4 cores. However, we noticed 2 > tasks are launched concurrently from the logs. > It turns out that we used the TaskResourceProfile (taskCpus=3) of the taskset > for task scheduling: > {code:java} > val rpId = taskSet.taskSet.resourceProfileId > val taskSetProf = sc.resourceProfileManager.resourceProfileFromId(rpId) > val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(taskSetProf, > conf) {code} > but the ResourceProfile (taskCpus=1) of the executor for updating the free > cores in ExecutorData: > {code:java} > val rpId = executorData.resourceProfileId > val prof = scheduler.sc.resourceProfileManager.resourceProfileFromId(rpId) > val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(prof, conf) > executorData.freeCores -= taskCpus {code} > which results in the inconsistency of the available cores. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41852) Fix `pmod` function
Sandeep Singh created SPARK-41852: - Summary: Fix `pmod` function Key: SPARK-41852 URL: https://issues.apache.org/jira/browse/SPARK-41852 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Sandeep Singh Fix For: 3.4.0 {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 313, in pyspark.sql.connect.functions.nanvl Failed example: df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, df.b).alias("r2")).collect() Expected: [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)] Got: [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41852) Fix `pmod` function
[ https://issues.apache.org/jira/browse/SPARK-41852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41852: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 622, in pyspark.sql.connect.functions.pmod Failed example: df.select(pmod("a", "b")).show() Expected: +--+ |pmod(a, b)| +--+ | NaN| | NaN| | 1.0| | NaN| | 1.0| | 2.0| | -5.0| | 7.0| | 1.0| +--+ Got: +--+ |pmod(a, b)| +--+ | null| | null| | 1.0| | null| | 1.0| | 2.0| | -5.0| | 7.0| | 1.0| +--+ {code} was: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 313, in pyspark.sql.connect.functions.nanvl Failed example: df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, df.b).alias("r2")).collect() Expected: [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)] Got: [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code} > Fix `pmod` function > --- > > Key: SPARK-41852 > URL: https://issues.apache.org/jira/browse/SPARK-41852 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 622, in pyspark.sql.connect.functions.pmod > Failed example: > df.select(pmod("a", "b")).show() > Expected: > +--+ > |pmod(a, b)| > +--+ > | NaN| > | NaN| > | 1.0| > | NaN| > | 1.0| > | 2.0| > | -5.0| > | 7.0| > | 1.0| > +--+ > Got: > +--+ > |pmod(a, b)| > +--+ > | null| > | null| > | 1.0| > | null| > | 1.0| > | 2.0| > | -5.0| > | 7.0| > | 1.0| > +--+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41853) Use Map in place of SortedMap for ErrorClassesJsonReader
Ted Yu created SPARK-41853: -- Summary: Use Map in place of SortedMap for ErrorClassesJsonReader Key: SPARK-41853 URL: https://issues.apache.org/jira/browse/SPARK-41853 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 3.2.3 Reporter: Ted Yu The use of SortedMap in ErrorClassesJsonReader was mostly for making tests easier to write. This PR replaces SortedMap with Map since SortedMap is slower compared to Map. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41853) Use Map in place of SortedMap for ErrorClassesJsonReader
[ https://issues.apache.org/jira/browse/SPARK-41853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41853: Assignee: (was: Apache Spark) > Use Map in place of SortedMap for ErrorClassesJsonReader > > > Key: SPARK-41853 > URL: https://issues.apache.org/jira/browse/SPARK-41853 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.2.3 >Reporter: Ted Yu >Priority: Minor > > The use of SortedMap in ErrorClassesJsonReader was mostly for making tests > easier to write. > This PR replaces SortedMap with Map since SortedMap is slower compared to Map. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41853) Use Map in place of SortedMap for ErrorClassesJsonReader
[ https://issues.apache.org/jira/browse/SPARK-41853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653740#comment-17653740 ] Apache Spark commented on SPARK-41853: -- User 'tedyu' has created a pull request for this issue: https://github.com/apache/spark/pull/39351 > Use Map in place of SortedMap for ErrorClassesJsonReader > > > Key: SPARK-41853 > URL: https://issues.apache.org/jira/browse/SPARK-41853 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.2.3 >Reporter: Ted Yu >Priority: Minor > > The use of SortedMap in ErrorClassesJsonReader was mostly for making tests > easier to write. > This PR replaces SortedMap with Map since SortedMap is slower compared to Map. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41853) Use Map in place of SortedMap for ErrorClassesJsonReader
[ https://issues.apache.org/jira/browse/SPARK-41853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41853: Assignee: Apache Spark > Use Map in place of SortedMap for ErrorClassesJsonReader > > > Key: SPARK-41853 > URL: https://issues.apache.org/jira/browse/SPARK-41853 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.2.3 >Reporter: Ted Yu >Assignee: Apache Spark >Priority: Minor > > The use of SortedMap in ErrorClassesJsonReader was mostly for making tests > easier to write. > This PR replaces SortedMap with Map since SortedMap is slower compared to Map. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41847) DataFrame mapfield,structlist invalid type
[ https://issues.apache.org/jira/browse/SPARK-41847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-41847: -- Description: {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1270, in pyspark.sql.connect.functions.explode Failed example: eDF.select(explode(eDF.mapfield).alias("key", "value")).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in eDF.select(explode(eDF.mapfield).alias("key", "value")).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type "STRUCT" while it's required to be "MAP". Plan: {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1364, in pyspark.sql.connect.functions.inline Failed example: df.select(inline(df.structlist)).show() Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(inline(df.structlist)).show() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 619, in _handle_error raise SparkConnectAnalysisException( pyspark.sql.connect.client.SparkConnectAnalysisException: [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is of type "ARRAY" while it's required to be "STRUCT". Plan: {code} {code:java} File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", line 1411, in pyspark.sql.connect.functions.map_filter Failed example: df.select(map_filter( "data", lambda _, v: v > 30.0).alias("data_filtered") ).show(truncate=False) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "", line 1, in df.select(map_filter( File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 534, in show print(self._show_string(n, truncate, vertical)) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 423, in _show_string ).toPandas() File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", line 1031, in toPandas return self._session.client.to_pandas(query) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error)
[jira] [Commented] (SPARK-41853) Use Map in place of SortedMap for ErrorClassesJsonReader
[ https://issues.apache.org/jira/browse/SPARK-41853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653741#comment-17653741 ] Apache Spark commented on SPARK-41853: -- User 'tedyu' has created a pull request for this issue: https://github.com/apache/spark/pull/39351 > Use Map in place of SortedMap for ErrorClassesJsonReader > > > Key: SPARK-41853 > URL: https://issues.apache.org/jira/browse/SPARK-41853 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.2.3 >Reporter: Ted Yu >Priority: Minor > > The use of SortedMap in ErrorClassesJsonReader was mostly for making tests > easier to write. > This PR replaces SortedMap with Map since SortedMap is slower compared to Map. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41852) Fix `pmod` function
[ https://issues.apache.org/jira/browse/SPARK-41852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653743#comment-17653743 ] Ruifeng Zheng commented on SPARK-41852: --- could you please also provide the code to create the dataframe? a known issue is that `session.createDataFrame` doesn't handle NaN/None correctly. > Fix `pmod` function > --- > > Key: SPARK-41852 > URL: https://issues.apache.org/jira/browse/SPARK-41852 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 622, in pyspark.sql.connect.functions.pmod > Failed example: > df.select(pmod("a", "b")).show() > Expected: > +--+ > |pmod(a, b)| > +--+ > | NaN| > | NaN| > | 1.0| > | NaN| > | 1.0| > | 2.0| > | -5.0| > | 7.0| > | 1.0| > +--+ > Got: > +--+ > |pmod(a, b)| > +--+ > | null| > | null| > | 1.0| > | null| > | 1.0| > | 2.0| > | -5.0| > | 7.0| > | 1.0| > +--+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-41814) Column.eqNullSafe fails on NaN comparison
[ https://issues.apache.org/jira/browse/SPARK-41814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653736#comment-17653736 ] Ruifeng Zheng edited comment on SPARK-41814 at 1/3/23 3:06 AM: --- this issue is due to that `createDataFrame` can't handle NaN/None properly: 1, the conversion from rows to pd.DataFrame, which automatically convert null to NaN 2, then the conversion from pd.DataFrame to pa.Table, which convert NaN to null was (Author: podongfeng): this issue is due to: 1, the conversion from rows to pd.DataFrame, which automatically convert null to NaN 2, then the conversion from pd.DataFrame to pa.Table, which convert NaN to null > Column.eqNullSafe fails on NaN comparison > - > > Key: SPARK-41814 > URL: https://issues.apache.org/jira/browse/SPARK-41814 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > File "/.../spark/python/pyspark/sql/connect/column.py", line 115, in > pyspark.sql.connect.column.Column.eqNullSafe > Failed example: > df2.select( > df2['value'].eqNullSafe(None), > df2['value'].eqNullSafe(float('NaN')), > df2['value'].eqNullSafe(42.0) > ).show() > Expected: > ++---++ > |(value <=> NULL)|(value <=> NaN)|(value <=> 42.0)| > ++---++ > | false| true| false| > | false| false|true| > |true| false| false| > ++---++ > Got: > ++---++ > |(value <=> NULL)|(value <=> NaN)|(value <=> 42.0)| > ++---++ > |true| false| false| > | false| false|true| > |true| false| false| > ++---++ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41851) Fix `nanvl` function
[ https://issues.apache.org/jira/browse/SPARK-41851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653746#comment-17653746 ] Ruifeng Zheng commented on SPARK-41851: --- could you please also provide the code to create the dataframe? a known issue is that `session.createDataFrame` doesn't handle NaN/None correctly. https://issues.apache.org/jira/browse/SPARK-41814 > Fix `nanvl` function > > > Key: SPARK-41851 > URL: https://issues.apache.org/jira/browse/SPARK-41851 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 313, in pyspark.sql.connect.functions.nanvl > Failed example: > df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, > df.b).alias("r2")).collect() > Expected: > [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)] > Got: > [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-41814) Column.eqNullSafe fails on NaN comparison
[ https://issues.apache.org/jira/browse/SPARK-41814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653736#comment-17653736 ] Ruifeng Zheng edited comment on SPARK-41814 at 1/3/23 3:12 AM: --- this issue is due to that `createDataFrame` can't handle NaN/None properly: 1, the conversion from rows to pd.DataFrame, which automatically convert None to NaN 2, then the conversion from pd.DataFrame to pa.Table, which convert NaN to null was (Author: podongfeng): this issue is due to that `createDataFrame` can't handle NaN/None properly: 1, the conversion from rows to pd.DataFrame, which automatically convert null to NaN 2, then the conversion from pd.DataFrame to pa.Table, which convert NaN to null > Column.eqNullSafe fails on NaN comparison > - > > Key: SPARK-41814 > URL: https://issues.apache.org/jira/browse/SPARK-41814 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > File "/.../spark/python/pyspark/sql/connect/column.py", line 115, in > pyspark.sql.connect.column.Column.eqNullSafe > Failed example: > df2.select( > df2['value'].eqNullSafe(None), > df2['value'].eqNullSafe(float('NaN')), > df2['value'].eqNullSafe(42.0) > ).show() > Expected: > ++---++ > |(value <=> NULL)|(value <=> NaN)|(value <=> 42.0)| > ++---++ > | false| true| false| > | false| false|true| > |true| false| false| > ++---++ > Got: > ++---++ > |(value <=> NULL)|(value <=> NaN)|(value <=> 42.0)| > ++---++ > |true| false| false| > | false| false|true| > |true| false| false| > ++---++ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41815) Column.isNull returns nan instead of None
[ https://issues.apache.org/jira/browse/SPARK-41815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653748#comment-17653748 ] Ruifeng Zheng commented on SPARK-41815: --- similar to the issue in `createDataFrame` https://issues.apache.org/jira/browse/SPARK-41814 > Column.isNull returns nan instead of None > - > > Key: SPARK-41815 > URL: https://issues.apache.org/jira/browse/SPARK-41815 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > File "/.../spark/python/pyspark/sql/connect/column.py", line 99, in > pyspark.sql.connect.column.Column.isNull > Failed example: > df.filter(df.height.isNull()).collect() > Expected: > [Row(name='Alice', height=None)] > Got: > [Row(name='Alice', height=nan)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41852) Fix `pmod` function
[ https://issues.apache.org/jira/browse/SPARK-41852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653750#comment-17653750 ] Sandeep Singh commented on SPARK-41852: --- [~podongfeng] these are from the doctests {code:java} >>> from pyspark.sql.functions import pmod >>> df = spark.createDataFrame([ ... (1.0, float('nan')), (float('nan'), 2.0), (10.0, 3.0), ... (float('nan'), float('nan')), (-3.0, 4.0), (-10.0, 3.0), ... (-5.0, -6.0), (7.0, -8.0), (1.0, 2.0)], ... ("a", "b")) >>> df.select(pmod("a", "b")).show() {code} > Fix `pmod` function > --- > > Key: SPARK-41852 > URL: https://issues.apache.org/jira/browse/SPARK-41852 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 622, in pyspark.sql.connect.functions.pmod > Failed example: > df.select(pmod("a", "b")).show() > Expected: > +--+ > |pmod(a, b)| > +--+ > | NaN| > | NaN| > | 1.0| > | NaN| > | 1.0| > | 2.0| > | -5.0| > | 7.0| > | 1.0| > +--+ > Got: > +--+ > |pmod(a, b)| > +--+ > | null| > | null| > | 1.0| > | null| > | 1.0| > | 2.0| > | -5.0| > | 7.0| > | 1.0| > +--+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41851) Fix `nanvl` function
[ https://issues.apache.org/jira/browse/SPARK-41851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653751#comment-17653751 ] Sandeep Singh commented on SPARK-41851: --- [~podongfeng] {code:java} >>> df = spark.createDataFrame([(1.0, float('nan')), (float('nan'), 2.0)], >>> ("a", "b")) >>> df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, >>> df.b).alias("r2")).collect() {code} > Fix `nanvl` function > > > Key: SPARK-41851 > URL: https://issues.apache.org/jira/browse/SPARK-41851 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 313, in pyspark.sql.connect.functions.nanvl > Failed example: > df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, > df.b).alias("r2")).collect() > Expected: > [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)] > Got: > [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39995) PySpark installation doesn't support Scala 2.13 binaries
[ https://issues.apache.org/jira/browse/SPARK-39995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653752#comment-17653752 ] Hyukjin Kwon commented on SPARK-39995: -- For: {quote} Also, env vars (e.g. PYSPARK_HADOOP_VERSION) are easy to use with pip and CLI but not possible with package managers like Poetry. {quote} We can't do this because of the issue in pip itself, see SPARK-32837 > PySpark installation doesn't support Scala 2.13 binaries > > > Key: SPARK-39995 > URL: https://issues.apache.org/jira/browse/SPARK-39995 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Oleksandr Shevchenko >Priority: Major > > [PyPi|https://pypi.org/project/pyspark/] doesn't support Spark binary > [installation|https://spark.apache.org/docs/latest/api/python/getting_started/install.html#using-pypi] > for Scala 2.13. > Currently, the setup > [script|https://github.com/apache/spark/blob/master/python/pyspark/install.py] > allows to set versions of Spark, Hadoop (PYSPARK_HADOOP_VERSION), and mirror > (PYSPARK_RELEASE_MIRROR) to download needed Spark binaries, but it's always > Scala 2.12 compatible binaries. There isn't any parameter to download > "spark-3.3.0-bin-hadoop3-scala2.13.tgz". > It's possible to download Spark manually and set the needed SPARK_HOME, but > it's hard to use with pip or Poetry. > Also, env vars (e.g. PYSPARK_HADOOP_VERSION) are easy to use with pip and CLI > but not possible with package managers like Poetry. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39995) PySpark installation doesn't support Scala 2.13 binaries
[ https://issues.apache.org/jira/browse/SPARK-39995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653753#comment-17653753 ] Hyukjin Kwon commented on SPARK-39995: -- I think i will be able to pick this up before Spark 3.4. > PySpark installation doesn't support Scala 2.13 binaries > > > Key: SPARK-39995 > URL: https://issues.apache.org/jira/browse/SPARK-39995 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Oleksandr Shevchenko >Priority: Major > > [PyPi|https://pypi.org/project/pyspark/] doesn't support Spark binary > [installation|https://spark.apache.org/docs/latest/api/python/getting_started/install.html#using-pypi] > for Scala 2.13. > Currently, the setup > [script|https://github.com/apache/spark/blob/master/python/pyspark/install.py] > allows to set versions of Spark, Hadoop (PYSPARK_HADOOP_VERSION), and mirror > (PYSPARK_RELEASE_MIRROR) to download needed Spark binaries, but it's always > Scala 2.12 compatible binaries. There isn't any parameter to download > "spark-3.3.0-bin-hadoop3-scala2.13.tgz". > It's possible to download Spark manually and set the needed SPARK_HOME, but > it's hard to use with pip or Poetry. > Also, env vars (e.g. PYSPARK_HADOOP_VERSION) are easy to use with pip and CLI > but not possible with package managers like Poetry. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41854) Automatic reformat/check python/setup.py
Hyukjin Kwon created SPARK-41854: Summary: Automatic reformat/check python/setup.py Key: SPARK-41854 URL: https://issues.apache.org/jira/browse/SPARK-41854 Project: Spark Issue Type: Test Components: Build, PySpark Affects Versions: 3.4.0 Reporter: Hyukjin Kwon python/setup.py should be also reformatted via ./dev/reformat-python -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41854) Automatic reformat/check python/setup.py
[ https://issues.apache.org/jira/browse/SPARK-41854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41854: Assignee: (was: Apache Spark) > Automatic reformat/check python/setup.py > - > > Key: SPARK-41854 > URL: https://issues.apache.org/jira/browse/SPARK-41854 > Project: Spark > Issue Type: Test > Components: Build, PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > python/setup.py should be also reformatted via ./dev/reformat-python -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41854) Automatic reformat/check python/setup.py
[ https://issues.apache.org/jira/browse/SPARK-41854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653756#comment-17653756 ] Apache Spark commented on SPARK-41854: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39352 > Automatic reformat/check python/setup.py > - > > Key: SPARK-41854 > URL: https://issues.apache.org/jira/browse/SPARK-41854 > Project: Spark > Issue Type: Test > Components: Build, PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > python/setup.py should be also reformatted via ./dev/reformat-python -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org