[jira] [Created] (SPARK-41839) Implement SparkSession.sparkContext

2023-01-02 Thread Sandeep Singh (Jira)
Sandeep Singh created SPARK-41839:
-

 Summary: Implement SparkSession.sparkContext
 Key: SPARK-41839
 URL: https://issues.apache.org/jira/browse/SPARK-41839
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 2119, in pyspark.sql.connect.functions.unix_timestamp
Failed example:
    spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, 
in 
        spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
    AttributeError: 'SparkSession' object has no attribute 'conf'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41840) DataFrame.show(): 'Column' object is not callable

2023-01-02 Thread Sandeep Singh (Jira)
Sandeep Singh created SPARK-41840:
-

 Summary: DataFrame.show(): 'Column' object is not callable
 Key: SPARK-41840
 URL: https://issues.apache.org/jira/browse/SPARK-41840
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1472, in pyspark.sql.connect.functions.posexplode_outer
Failed example:
    df.select("id", "a_map", posexplode_outer("an_array")).show()
Expected:
    +---+--+++
    | id|     a_map| pos| col|
    +---+--+++
    |  1|{x -> 1.0}|   0| foo|
    |  1|{x -> 1.0}|   1| bar|
    |  2|        {}|null|null|
    |  3|      null|null|null|
    +---+--+++
Got:
    +---+--+++
    | id| a_map| pos| col|
    +---+--+++
    |  1| {1.0}|   0| foo|
    |  1| {1.0}|   1| bar|
    |  2|{null}|null|null|
    |  3|  null|null|null|
    +---+--+++
    {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41840) DataFrame.show(): 'Column' object is not callable

2023-01-02 Thread Sandeep Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41840:
--
Description: 
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 855, in pyspark.sql.connect.functions.first
Failed example:
    df.groupby("name").agg(first("age", 
ignorenulls=True)).orderBy("name").show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.groupby("name").agg(first("age", 
ignorenulls=True)).orderBy("name").show()
    TypeError: 'Column' object is not callable{code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1472, in pyspark.sql.connect.functions.posexplode_outer
Failed example:
    df.select("id", "a_map", posexplode_outer("an_array")).show()
Expected:
    +---+--+++
    | id|     a_map| pos| col|
    +---+--+++
    |  1|{x -> 1.0}|   0| foo|
    |  1|{x -> 1.0}|   1| bar|
    |  2|        {}|null|null|
    |  3|      null|null|null|
    +---+--+++
Got:
    +---+--+++
    | id| a_map| pos| col|
    +---+--+++
    |  1| {1.0}|   0| foo|
    |  1| {1.0}|   1| bar|
    |  2|{null}|null|null|
    |  3|  null|null|null|
    +---+--+++
    {code}


> DataFrame.show(): 'Column' object is not callable
> -
>
> Key: SPARK-41840
> URL: https://issues.apache.org/jira/browse/SPARK-41840
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 855, in pyspark.sql.connect.functions.first
> Failed example:
>     df.groupby("name").agg(first("age", 
> ignorenulls=True)).orderBy("name").show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.groupby("name").agg(first("age", 
> ignorenulls=True)).orderBy("name").show()
>     TypeError: 'Column' object is not callable{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41804) InterpretedUnsafeProjection doesn't properly handle an array of UDTs

2023-01-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653709#comment-17653709
 ] 

Apache Spark commented on SPARK-41804:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/39349

> InterpretedUnsafeProjection doesn't properly handle an array of UDTs
> 
>
> Key: SPARK-41804
> URL: https://issues.apache.org/jira/browse/SPARK-41804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Reproduction steps:
> {noformat}
> // create a file of vector data
> import org.apache.spark.ml.linalg.{DenseVector, Vector}
> case class TestRow(varr: Array[Vector])
> val values = Array(0.1d, 0.2d, 0.3d)
> val dv = new DenseVector(values).asInstanceOf[Vector]
> val ds = Seq(TestRow(Array(dv, dv))).toDS
> ds.coalesce(1).write.mode("overwrite").format("parquet").save("vector_data")
> // this works
> spark.read.format("parquet").load("vector_data").collect
> sql("set spark.sql.codegen.wholeStage=false")
> sql("set spark.sql.codegen.factoryMode=NO_CODEGEN")
> // this will get an error
> spark.read.format("parquet").load("vector_data").collect
> {noformat}
> The error varies each time you run it, e.g.:
> {noformat}
> Sparse vectors require that the dimension of the indices match the dimension 
> of the values.
> You provided 2 indices and  6619240 values.
> {noformat}
> or
> {noformat}
> org.apache.spark.SparkRuntimeException: Error while decoding: 
> java.lang.NegativeArraySizeException
> {noformat}
> or
> {noformat}
> java.lang.OutOfMemoryError: Java heap space
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.toDoubleArray(UnsafeArrayData.java:414)
> {noformat}
> or
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGBUS (0xa) at pc=0x0001120c9d30, pid=64213, tid=0x1003
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_311-b11) (build 
> 1.8.0_311-b11)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.311-b11 mixed mode bsd-amd64 
> compressed oops)
> # Problematic frame:
> # V  [libjvm.dylib+0xc9d30]  acl_CopyRight+0x29
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # //hs_err_pid64213.log
> Compiled method (nm)  582142 11318 n 0   sun.misc.Unsafe::copyMemory 
> (native)
>  total in heap  [0x00011efa8890,0x00011efa8be8] = 856
>  relocation [0x00011efa89b8,0x00011efa89f8] = 64
>  main code  [0x00011efa8a00,0x00011efa8be8] = 488
> Compiled method (nm)  582142 11318 n 0   sun.misc.Unsafe::copyMemory 
> (native)
>  total in heap  [0x00011efa8890,0x00011efa8be8] = 856
>  relocation [0x00011efa89b8,0x00011efa89f8] = 64
>  main code  [0x00011efa8a00,0x00011efa8be8] = 488
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41804) InterpretedUnsafeProjection doesn't properly handle an array of UDTs

2023-01-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41804:


Assignee: Apache Spark

> InterpretedUnsafeProjection doesn't properly handle an array of UDTs
> 
>
> Key: SPARK-41804
> URL: https://issues.apache.org/jira/browse/SPARK-41804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Major
>
> Reproduction steps:
> {noformat}
> // create a file of vector data
> import org.apache.spark.ml.linalg.{DenseVector, Vector}
> case class TestRow(varr: Array[Vector])
> val values = Array(0.1d, 0.2d, 0.3d)
> val dv = new DenseVector(values).asInstanceOf[Vector]
> val ds = Seq(TestRow(Array(dv, dv))).toDS
> ds.coalesce(1).write.mode("overwrite").format("parquet").save("vector_data")
> // this works
> spark.read.format("parquet").load("vector_data").collect
> sql("set spark.sql.codegen.wholeStage=false")
> sql("set spark.sql.codegen.factoryMode=NO_CODEGEN")
> // this will get an error
> spark.read.format("parquet").load("vector_data").collect
> {noformat}
> The error varies each time you run it, e.g.:
> {noformat}
> Sparse vectors require that the dimension of the indices match the dimension 
> of the values.
> You provided 2 indices and  6619240 values.
> {noformat}
> or
> {noformat}
> org.apache.spark.SparkRuntimeException: Error while decoding: 
> java.lang.NegativeArraySizeException
> {noformat}
> or
> {noformat}
> java.lang.OutOfMemoryError: Java heap space
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.toDoubleArray(UnsafeArrayData.java:414)
> {noformat}
> or
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGBUS (0xa) at pc=0x0001120c9d30, pid=64213, tid=0x1003
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_311-b11) (build 
> 1.8.0_311-b11)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.311-b11 mixed mode bsd-amd64 
> compressed oops)
> # Problematic frame:
> # V  [libjvm.dylib+0xc9d30]  acl_CopyRight+0x29
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # //hs_err_pid64213.log
> Compiled method (nm)  582142 11318 n 0   sun.misc.Unsafe::copyMemory 
> (native)
>  total in heap  [0x00011efa8890,0x00011efa8be8] = 856
>  relocation [0x00011efa89b8,0x00011efa89f8] = 64
>  main code  [0x00011efa8a00,0x00011efa8be8] = 488
> Compiled method (nm)  582142 11318 n 0   sun.misc.Unsafe::copyMemory 
> (native)
>  total in heap  [0x00011efa8890,0x00011efa8be8] = 856
>  relocation [0x00011efa89b8,0x00011efa89f8] = 64
>  main code  [0x00011efa8a00,0x00011efa8be8] = 488
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41804) InterpretedUnsafeProjection doesn't properly handle an array of UDTs

2023-01-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41804:


Assignee: (was: Apache Spark)

> InterpretedUnsafeProjection doesn't properly handle an array of UDTs
> 
>
> Key: SPARK-41804
> URL: https://issues.apache.org/jira/browse/SPARK-41804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Reproduction steps:
> {noformat}
> // create a file of vector data
> import org.apache.spark.ml.linalg.{DenseVector, Vector}
> case class TestRow(varr: Array[Vector])
> val values = Array(0.1d, 0.2d, 0.3d)
> val dv = new DenseVector(values).asInstanceOf[Vector]
> val ds = Seq(TestRow(Array(dv, dv))).toDS
> ds.coalesce(1).write.mode("overwrite").format("parquet").save("vector_data")
> // this works
> spark.read.format("parquet").load("vector_data").collect
> sql("set spark.sql.codegen.wholeStage=false")
> sql("set spark.sql.codegen.factoryMode=NO_CODEGEN")
> // this will get an error
> spark.read.format("parquet").load("vector_data").collect
> {noformat}
> The error varies each time you run it, e.g.:
> {noformat}
> Sparse vectors require that the dimension of the indices match the dimension 
> of the values.
> You provided 2 indices and  6619240 values.
> {noformat}
> or
> {noformat}
> org.apache.spark.SparkRuntimeException: Error while decoding: 
> java.lang.NegativeArraySizeException
> {noformat}
> or
> {noformat}
> java.lang.OutOfMemoryError: Java heap space
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.toDoubleArray(UnsafeArrayData.java:414)
> {noformat}
> or
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGBUS (0xa) at pc=0x0001120c9d30, pid=64213, tid=0x1003
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_311-b11) (build 
> 1.8.0_311-b11)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.311-b11 mixed mode bsd-amd64 
> compressed oops)
> # Problematic frame:
> # V  [libjvm.dylib+0xc9d30]  acl_CopyRight+0x29
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # //hs_err_pid64213.log
> Compiled method (nm)  582142 11318 n 0   sun.misc.Unsafe::copyMemory 
> (native)
>  total in heap  [0x00011efa8890,0x00011efa8be8] = 856
>  relocation [0x00011efa89b8,0x00011efa89f8] = 64
>  main code  [0x00011efa8a00,0x00011efa8be8] = 488
> Compiled method (nm)  582142 11318 n 0   sun.misc.Unsafe::copyMemory 
> (native)
>  total in heap  [0x00011efa8890,0x00011efa8be8] = 856
>  relocation [0x00011efa89b8,0x00011efa89f8] = 64
>  main code  [0x00011efa8a00,0x00011efa8be8] = 488
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-41818) Support DataFrameWriter.saveAsTable

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-41818:
--

> Support DataFrameWriter.saveAsTable
> ---
>
> Key: SPARK-41818
> URL: https://issues.apache.org/jira/browse/SPARK-41818
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", 
> line 369, in pyspark.sql.connect.readwriter.DataFrameWriter.insertInto
> Failed example:
>     df.write.saveAsTable("tblA")
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File " pyspark.sql.connect.readwriter.DataFrameWriter.insertInto[2]>", line 1, in 
> 
>         df.write.saveAsTable("tblA")
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", 
> line 350, in saveAsTable
>         
> self._spark.client.execute_command(self._write.command(self._spark.client))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 459, in execute_command
>         self._execute(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 547, in _execute
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 623, in _handle_error
>         raise SparkConnectException(status.message, info.reason) from None
>     pyspark.sql.connect.client.SparkConnectException: 
> (java.lang.ClassNotFoundException) .DefaultSource{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41818) Support DataFrameWriter.saveAsTable

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41818.
--
Resolution: Fixed

> Support DataFrameWriter.saveAsTable
> ---
>
> Key: SPARK-41818
> URL: https://issues.apache.org/jira/browse/SPARK-41818
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", 
> line 369, in pyspark.sql.connect.readwriter.DataFrameWriter.insertInto
> Failed example:
>     df.write.saveAsTable("tblA")
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File " pyspark.sql.connect.readwriter.DataFrameWriter.insertInto[2]>", line 1, in 
> 
>         df.write.saveAsTable("tblA")
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", 
> line 350, in saveAsTable
>         
> self._spark.client.execute_command(self._write.command(self._spark.client))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 459, in execute_command
>         self._execute(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 547, in _execute
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 623, in _handle_error
>         raise SparkConnectException(status.message, info.reason) from None
>     pyspark.sql.connect.client.SparkConnectException: 
> (java.lang.ClassNotFoundException) .DefaultSource{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41818) Support DataFrameWriter.saveAsTable

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41818:
-
Parent: (was: SPARK-41281)
Issue Type: Bug  (was: Sub-task)

> Support DataFrameWriter.saveAsTable
> ---
>
> Key: SPARK-41818
> URL: https://issues.apache.org/jira/browse/SPARK-41818
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", 
> line 369, in pyspark.sql.connect.readwriter.DataFrameWriter.insertInto
> Failed example:
>     df.write.saveAsTable("tblA")
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File " pyspark.sql.connect.readwriter.DataFrameWriter.insertInto[2]>", line 1, in 
> 
>         df.write.saveAsTable("tblA")
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", 
> line 350, in saveAsTable
>         
> self._spark.client.execute_command(self._write.command(self._spark.client))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 459, in execute_command
>         self._execute(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 547, in _execute
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 623, in _handle_error
>         raise SparkConnectException(status.message, info.reason) from None
>     pyspark.sql.connect.client.SparkConnectException: 
> (java.lang.ClassNotFoundException) .DefaultSource{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41818) Support DataFrameWriter.saveAsTable

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41818:
-
Epic Link: SPARK-39375

> Support DataFrameWriter.saveAsTable
> ---
>
> Key: SPARK-41818
> URL: https://issues.apache.org/jira/browse/SPARK-41818
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", 
> line 369, in pyspark.sql.connect.readwriter.DataFrameWriter.insertInto
> Failed example:
>     df.write.saveAsTable("tblA")
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File " pyspark.sql.connect.readwriter.DataFrameWriter.insertInto[2]>", line 1, in 
> 
>         df.write.saveAsTable("tblA")
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", 
> line 350, in saveAsTable
>         
> self._spark.client.execute_command(self._write.command(self._spark.client))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 459, in execute_command
>         self._execute(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 547, in _execute
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 623, in _handle_error
>         raise SparkConnectException(status.message, info.reason) from None
>     pyspark.sql.connect.client.SparkConnectException: 
> (java.lang.ClassNotFoundException) .DefaultSource{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41817) SparkSession.read support reading with schema

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41817:
-
Parent: (was: SPARK-41281)
Issue Type: Bug  (was: Sub-task)

> SparkSession.read support reading with schema
> -
>
> Key: SPARK-41817
> URL: https://issues.apache.org/jira/browse/SPARK-41817
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", 
> line 122, in pyspark.sql.connect.readwriter.DataFrameReader.load
> Failed example:
> with tempfile.TemporaryDirectory() as d:
> # Write a DataFrame into a CSV file with a header
> df = spark.createDataFrame([{"age": 100, "name": "Hyukjin Kwon"}])
> df.write.option("header", 
> True).mode("overwrite").format("csv").save(d)
> # Read the CSV file as a DataFrame with 'nullValue' option set to 
> 'Hyukjin Kwon',
> # and 'header' option set to `True`.
> df = spark.read.load(
> d, schema=df.schema, format="csv", nullValue="Hyukjin Kwon", 
> header=True)
> df.printSchema()
> df.show()
> Exception raised:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
> exec(compile(example.source, filename, "single",
>   File " pyspark.sql.connect.readwriter.DataFrameReader.load[1]>", line 10, in 
> df.printSchema()
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1039, in printSchema
> print(self._tree_string())
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1035, in _tree_string
> query = self._plan.to_proto(self._session.client)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line 
> 92, in to_proto
> plan.root.CopyFrom(self.plan(session))
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line 
> 245, in plan
> plan.read.data_source.schema = self.schema
> TypeError: bad argument type for built-in operation {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41817) SparkSession.read support reading with schema

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41817:
-
Epic Link: SPARK-39375

> SparkSession.read support reading with schema
> -
>
> Key: SPARK-41817
> URL: https://issues.apache.org/jira/browse/SPARK-41817
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", 
> line 122, in pyspark.sql.connect.readwriter.DataFrameReader.load
> Failed example:
> with tempfile.TemporaryDirectory() as d:
> # Write a DataFrame into a CSV file with a header
> df = spark.createDataFrame([{"age": 100, "name": "Hyukjin Kwon"}])
> df.write.option("header", 
> True).mode("overwrite").format("csv").save(d)
> # Read the CSV file as a DataFrame with 'nullValue' option set to 
> 'Hyukjin Kwon',
> # and 'header' option set to `True`.
> df = spark.read.load(
> d, schema=df.schema, format="csv", nullValue="Hyukjin Kwon", 
> header=True)
> df.printSchema()
> df.show()
> Exception raised:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
> exec(compile(example.source, filename, "single",
>   File " pyspark.sql.connect.readwriter.DataFrameReader.load[1]>", line 10, in 
> df.printSchema()
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1039, in printSchema
> print(self._tree_string())
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1035, in _tree_string
> query = self._plan.to_proto(self._session.client)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line 
> 92, in to_proto
> plan.root.CopyFrom(self.plan(session))
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line 
> 245, in plan
> plan.read.data_source.schema = self.schema
> TypeError: bad argument type for built-in operation {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41818) Support DataFrameWriter.saveAsTable

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41818:
-
Epic Link: (was: SPARK-39375)

> Support DataFrameWriter.saveAsTable
> ---
>
> Key: SPARK-41818
> URL: https://issues.apache.org/jira/browse/SPARK-41818
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", 
> line 369, in pyspark.sql.connect.readwriter.DataFrameWriter.insertInto
> Failed example:
>     df.write.saveAsTable("tblA")
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File " pyspark.sql.connect.readwriter.DataFrameWriter.insertInto[2]>", line 1, in 
> 
>         df.write.saveAsTable("tblA")
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", 
> line 350, in saveAsTable
>         
> self._spark.client.execute_command(self._write.command(self._spark.client))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 459, in execute_command
>         self._execute(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 547, in _execute
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 623, in _handle_error
>         raise SparkConnectException(status.message, info.reason) from None
>     pyspark.sql.connect.client.SparkConnectException: 
> (java.lang.ClassNotFoundException) .DefaultSource{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41818) Support DataFrameWriter.saveAsTable

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41818:
-
Parent: SPARK-41284
Issue Type: Sub-task  (was: Bug)

> Support DataFrameWriter.saveAsTable
> ---
>
> Key: SPARK-41818
> URL: https://issues.apache.org/jira/browse/SPARK-41818
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", 
> line 369, in pyspark.sql.connect.readwriter.DataFrameWriter.insertInto
> Failed example:
>     df.write.saveAsTable("tblA")
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File " pyspark.sql.connect.readwriter.DataFrameWriter.insertInto[2]>", line 1, in 
> 
>         df.write.saveAsTable("tblA")
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", 
> line 350, in saveAsTable
>         
> self._spark.client.execute_command(self._write.command(self._spark.client))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 459, in execute_command
>         self._execute(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 547, in _execute
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 623, in _handle_error
>         raise SparkConnectException(status.message, info.reason) from None
>     pyspark.sql.connect.client.SparkConnectException: 
> (java.lang.ClassNotFoundException) .DefaultSource{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41817) SparkSession.read support reading with schema

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41817:
-
Epic Link: (was: SPARK-39375)

> SparkSession.read support reading with schema
> -
>
> Key: SPARK-41817
> URL: https://issues.apache.org/jira/browse/SPARK-41817
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", 
> line 122, in pyspark.sql.connect.readwriter.DataFrameReader.load
> Failed example:
> with tempfile.TemporaryDirectory() as d:
> # Write a DataFrame into a CSV file with a header
> df = spark.createDataFrame([{"age": 100, "name": "Hyukjin Kwon"}])
> df.write.option("header", 
> True).mode("overwrite").format("csv").save(d)
> # Read the CSV file as a DataFrame with 'nullValue' option set to 
> 'Hyukjin Kwon',
> # and 'header' option set to `True`.
> df = spark.read.load(
> d, schema=df.schema, format="csv", nullValue="Hyukjin Kwon", 
> header=True)
> df.printSchema()
> df.show()
> Exception raised:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
> exec(compile(example.source, filename, "single",
>   File " pyspark.sql.connect.readwriter.DataFrameReader.load[1]>", line 10, in 
> df.printSchema()
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1039, in printSchema
> print(self._tree_string())
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1035, in _tree_string
> query = self._plan.to_proto(self._session.client)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line 
> 92, in to_proto
> plan.root.CopyFrom(self.plan(session))
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line 
> 245, in plan
> plan.read.data_source.schema = self.schema
> TypeError: bad argument type for built-in operation {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41817) SparkSession.read support reading with schema

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41817:
-
Parent: SPARK-41284
Issue Type: Sub-task  (was: Bug)

> SparkSession.read support reading with schema
> -
>
> Key: SPARK-41817
> URL: https://issues.apache.org/jira/browse/SPARK-41817
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", 
> line 122, in pyspark.sql.connect.readwriter.DataFrameReader.load
> Failed example:
> with tempfile.TemporaryDirectory() as d:
> # Write a DataFrame into a CSV file with a header
> df = spark.createDataFrame([{"age": 100, "name": "Hyukjin Kwon"}])
> df.write.option("header", 
> True).mode("overwrite").format("csv").save(d)
> # Read the CSV file as a DataFrame with 'nullValue' option set to 
> 'Hyukjin Kwon',
> # and 'header' option set to `True`.
> df = spark.read.load(
> d, schema=df.schema, format="csv", nullValue="Hyukjin Kwon", 
> header=True)
> df.printSchema()
> df.show()
> Exception raised:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
> exec(compile(example.source, filename, "single",
>   File " pyspark.sql.connect.readwriter.DataFrameReader.load[1]>", line 10, in 
> df.printSchema()
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1039, in printSchema
> print(self._tree_string())
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1035, in _tree_string
> query = self._plan.to_proto(self._session.client)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line 
> 92, in to_proto
> plan.root.CopyFrom(self.plan(session))
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/plan.py", line 
> 245, in plan
> plan.read.data_source.schema = self.schema
> TypeError: bad argument type for built-in operation {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41659) Enable doctests in pyspark.sql.connect.readwriter

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41659.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39331
[https://github.com/apache/spark/pull/39331]

> Enable doctests in pyspark.sql.connect.readwriter
> -
>
> Key: SPARK-41659
> URL: https://issues.apache.org/jira/browse/SPARK-41659
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41819) Implement Dataframe.rdd getNumPartitions

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41819:
-
Epic Link: SPARK-39375

> Implement Dataframe.rdd getNumPartitions
> 
>
> Key: SPARK-41819
> URL: https://issues.apache.org/jira/browse/SPARK-41819
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 243, in pyspark.sql.connect.dataframe.DataFrame.coalesce
> Failed example:
>     df.coalesce(1).rdd.getNumPartitions()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", 
> line 1, in 
>         df.coalesce(1).rdd.getNumPartitions()
>     AttributeError: 'function' object has no attribute 
> 'getNumPartitions'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41659) Enable doctests in pyspark.sql.connect.readwriter

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41659:


Assignee: Hyukjin Kwon

> Enable doctests in pyspark.sql.connect.readwriter
> -
>
> Key: SPARK-41659
> URL: https://issues.apache.org/jira/browse/SPARK-41659
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41819) Implement Dataframe.rdd getNumPartitions

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41819:
-
Parent: (was: SPARK-41281)
Issue Type: Bug  (was: Sub-task)

> Implement Dataframe.rdd getNumPartitions
> 
>
> Key: SPARK-41819
> URL: https://issues.apache.org/jira/browse/SPARK-41819
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 243, in pyspark.sql.connect.dataframe.DataFrame.coalesce
> Failed example:
>     df.coalesce(1).rdd.getNumPartitions()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", 
> line 1, in 
>         df.coalesce(1).rdd.getNumPartitions()
>     AttributeError: 'function' object has no attribute 
> 'getNumPartitions'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41819) Implement Dataframe.rdd getNumPartitions

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41819:
-
Parent: SPARK-41279
Issue Type: Sub-task  (was: Bug)

> Implement Dataframe.rdd getNumPartitions
> 
>
> Key: SPARK-41819
> URL: https://issues.apache.org/jira/browse/SPARK-41819
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 243, in pyspark.sql.connect.dataframe.DataFrame.coalesce
> Failed example:
>     df.coalesce(1).rdd.getNumPartitions()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", 
> line 1, in 
>         df.coalesce(1).rdd.getNumPartitions()
>     AttributeError: 'function' object has no attribute 
> 'getNumPartitions'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41819) Implement Dataframe.rdd getNumPartitions

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41819:
-
Epic Link: (was: SPARK-39375)

> Implement Dataframe.rdd getNumPartitions
> 
>
> Key: SPARK-41819
> URL: https://issues.apache.org/jira/browse/SPARK-41819
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 243, in pyspark.sql.connect.dataframe.DataFrame.coalesce
> Failed example:
>     df.coalesce(1).rdd.getNumPartitions()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", 
> line 1, in 
>         df.coalesce(1).rdd.getNumPartitions()
>     AttributeError: 'function' object has no attribute 
> 'getNumPartitions'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41828) Implement creating empty Dataframe

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41828:
-
Epic Link: SPARK-39375

> Implement creating empty Dataframe
> --
>
> Key: SPARK-41828
> URL: https://issues.apache.org/jira/browse/SPARK-41828
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 99, in pyspark.sql.connect.dataframe.DataFrame.isEmpty
> Failed example:
>     df_empty = spark.createDataFrame([], 'a STRING')
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", 
> line 1, in 
>         df_empty = spark.createDataFrame([], 'a STRING')
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", 
> line 186, in createDataFrame
>         raise ValueError("Input data cannot be empty")
>     ValueError: Input data cannot be empty{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41828) Implement creating empty Dataframe

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41828:
-
Parent: (was: SPARK-41279)
Issue Type: Bug  (was: Sub-task)

> Implement creating empty Dataframe
> --
>
> Key: SPARK-41828
> URL: https://issues.apache.org/jira/browse/SPARK-41828
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 99, in pyspark.sql.connect.dataframe.DataFrame.isEmpty
> Failed example:
>     df_empty = spark.createDataFrame([], 'a STRING')
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", 
> line 1, in 
>         df_empty = spark.createDataFrame([], 'a STRING')
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", 
> line 186, in createDataFrame
>         raise ValueError("Input data cannot be empty")
>     ValueError: Input data cannot be empty{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41828) Implement creating empty Dataframe

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41828:
-
Parent: SPARK-41281
Issue Type: Sub-task  (was: Bug)

> Implement creating empty Dataframe
> --
>
> Key: SPARK-41828
> URL: https://issues.apache.org/jira/browse/SPARK-41828
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 99, in pyspark.sql.connect.dataframe.DataFrame.isEmpty
> Failed example:
>     df_empty = spark.createDataFrame([], 'a STRING')
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", 
> line 1, in 
>         df_empty = spark.createDataFrame([], 'a STRING')
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", 
> line 186, in createDataFrame
>         raise ValueError("Input data cannot be empty")
>     ValueError: Input data cannot be empty{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41828) Implement creating empty Dataframe

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41828:
-
Epic Link: (was: SPARK-39375)

> Implement creating empty Dataframe
> --
>
> Key: SPARK-41828
> URL: https://issues.apache.org/jira/browse/SPARK-41828
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 99, in pyspark.sql.connect.dataframe.DataFrame.isEmpty
> Failed example:
>     df_empty = spark.createDataFrame([], 'a STRING')
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", 
> line 1, in 
>         df_empty = spark.createDataFrame([], 'a STRING')
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", 
> line 186, in createDataFrame
>         raise ValueError("Input data cannot be empty")
>     ValueError: Input data cannot be empty{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41835) Implement `transform_keys` function

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41835:


Assignee: (was: Ruifeng Zheng)

> Implement `transform_keys` function
> ---
>
> Key: SPARK-41835
> URL: https://issues.apache.org/jira/browse/SPARK-41835
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41835) Implement `transform_keys` function

2023-01-02 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653713#comment-17653713
 ] 

Hyukjin Kwon commented on SPARK-41835:
--

test output?

> Implement `transform_keys` function
> ---
>
> Key: SPARK-41835
> URL: https://issues.apache.org/jira/browse/SPARK-41835
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41839) Implement SparkSession.sparkContext

2023-01-02 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653715#comment-17653715
 ] 

Hyukjin Kwon commented on SPARK-41839:
--

test output?

> Implement SparkSession.sparkContext
> ---
>
> Key: SPARK-41839
> URL: https://issues.apache.org/jira/browse/SPARK-41839
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41836) Implement `transform_values` function

2023-01-02 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653714#comment-17653714
 ] 

Hyukjin Kwon commented on SPARK-41836:
--

test output?

> Implement `transform_values` function
> -
>
> Key: SPARK-41836
> URL: https://issues.apache.org/jira/browse/SPARK-41836
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41803) log() function variations are missing

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41803:


Assignee: Martin Grund

> log() function variations are missing
> -
>
> Key: SPARK-41803
> URL: https://issues.apache.org/jira/browse/SPARK-41803
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41803) log() function variations are missing

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41803:


Assignee: Ruifeng Zheng  (was: Martin Grund)

> log() function variations are missing
> -
>
> Key: SPARK-41803
> URL: https://issues.apache.org/jira/browse/SPARK-41803
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41803) log() function variations are missing

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41803.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39339
[https://github.com/apache/spark/pull/39339]

> log() function variations are missing
> -
>
> Key: SPARK-41803
> URL: https://issues.apache.org/jira/browse/SPARK-41803
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41659) Enable doctests in pyspark.sql.connect.readwriter

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41659:


Assignee: Sandeep Singh  (was: Hyukjin Kwon)

> Enable doctests in pyspark.sql.connect.readwriter
> -
>
> Key: SPARK-41659
> URL: https://issues.apache.org/jira/browse/SPARK-41659
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41655) Enable doctests in pyspark.sql.connect.column

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41655:


Assignee: Sandeep Singh  (was: Hyukjin Kwon)

> Enable doctests in pyspark.sql.connect.column
> -
>
> Key: SPARK-41655
> URL: https://issues.apache.org/jira/browse/SPARK-41655
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41654) Enable doctests in pyspark.sql.connect.window

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41654:


Assignee: Sandeep Singh  (was: Hyukjin Kwon)

> Enable doctests in pyspark.sql.connect.window
> -
>
> Key: SPARK-41654
> URL: https://issues.apache.org/jira/browse/SPARK-41654
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41653) Test parity: enable doctests in Spark Connect

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41653:


Assignee: Sandeep Singh  (was: Hyukjin Kwon)

> Test parity: enable doctests in Spark Connect
> -
>
> Key: SPARK-41653
> URL: https://issues.apache.org/jira/browse/SPARK-41653
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Sandeep Singh
>Priority: Major
>
> We should actually run the doctests of Spark Connect.
> We should add something like 
> https://github.com/apache/spark/blob/master/python/pyspark/sql/column.py#L1227-L1247
>  to Spark Connect modules, and add the module into 
> https://github.com/apache/spark/blob/master/dev/sparktestsupport/modules.py#L507



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41804) InterpretedUnsafeProjection doesn't properly handle an array of UDTs

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41804:


Assignee: Bruce Robbins

> InterpretedUnsafeProjection doesn't properly handle an array of UDTs
> 
>
> Key: SPARK-41804
> URL: https://issues.apache.org/jira/browse/SPARK-41804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>
> Reproduction steps:
> {noformat}
> // create a file of vector data
> import org.apache.spark.ml.linalg.{DenseVector, Vector}
> case class TestRow(varr: Array[Vector])
> val values = Array(0.1d, 0.2d, 0.3d)
> val dv = new DenseVector(values).asInstanceOf[Vector]
> val ds = Seq(TestRow(Array(dv, dv))).toDS
> ds.coalesce(1).write.mode("overwrite").format("parquet").save("vector_data")
> // this works
> spark.read.format("parquet").load("vector_data").collect
> sql("set spark.sql.codegen.wholeStage=false")
> sql("set spark.sql.codegen.factoryMode=NO_CODEGEN")
> // this will get an error
> spark.read.format("parquet").load("vector_data").collect
> {noformat}
> The error varies each time you run it, e.g.:
> {noformat}
> Sparse vectors require that the dimension of the indices match the dimension 
> of the values.
> You provided 2 indices and  6619240 values.
> {noformat}
> or
> {noformat}
> org.apache.spark.SparkRuntimeException: Error while decoding: 
> java.lang.NegativeArraySizeException
> {noformat}
> or
> {noformat}
> java.lang.OutOfMemoryError: Java heap space
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.toDoubleArray(UnsafeArrayData.java:414)
> {noformat}
> or
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGBUS (0xa) at pc=0x0001120c9d30, pid=64213, tid=0x1003
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_311-b11) (build 
> 1.8.0_311-b11)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.311-b11 mixed mode bsd-amd64 
> compressed oops)
> # Problematic frame:
> # V  [libjvm.dylib+0xc9d30]  acl_CopyRight+0x29
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # //hs_err_pid64213.log
> Compiled method (nm)  582142 11318 n 0   sun.misc.Unsafe::copyMemory 
> (native)
>  total in heap  [0x00011efa8890,0x00011efa8be8] = 856
>  relocation [0x00011efa89b8,0x00011efa89f8] = 64
>  main code  [0x00011efa8a00,0x00011efa8be8] = 488
> Compiled method (nm)  582142 11318 n 0   sun.misc.Unsafe::copyMemory 
> (native)
>  total in heap  [0x00011efa8890,0x00011efa8be8] = 856
>  relocation [0x00011efa89b8,0x00011efa89f8] = 64
>  main code  [0x00011efa8a00,0x00011efa8be8] = 488
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41804) InterpretedUnsafeProjection doesn't properly handle an array of UDTs

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41804.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39349
[https://github.com/apache/spark/pull/39349]

> InterpretedUnsafeProjection doesn't properly handle an array of UDTs
> 
>
> Key: SPARK-41804
> URL: https://issues.apache.org/jira/browse/SPARK-41804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
> Fix For: 3.4.0
>
>
> Reproduction steps:
> {noformat}
> // create a file of vector data
> import org.apache.spark.ml.linalg.{DenseVector, Vector}
> case class TestRow(varr: Array[Vector])
> val values = Array(0.1d, 0.2d, 0.3d)
> val dv = new DenseVector(values).asInstanceOf[Vector]
> val ds = Seq(TestRow(Array(dv, dv))).toDS
> ds.coalesce(1).write.mode("overwrite").format("parquet").save("vector_data")
> // this works
> spark.read.format("parquet").load("vector_data").collect
> sql("set spark.sql.codegen.wholeStage=false")
> sql("set spark.sql.codegen.factoryMode=NO_CODEGEN")
> // this will get an error
> spark.read.format("parquet").load("vector_data").collect
> {noformat}
> The error varies each time you run it, e.g.:
> {noformat}
> Sparse vectors require that the dimension of the indices match the dimension 
> of the values.
> You provided 2 indices and  6619240 values.
> {noformat}
> or
> {noformat}
> org.apache.spark.SparkRuntimeException: Error while decoding: 
> java.lang.NegativeArraySizeException
> {noformat}
> or
> {noformat}
> java.lang.OutOfMemoryError: Java heap space
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.toDoubleArray(UnsafeArrayData.java:414)
> {noformat}
> or
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGBUS (0xa) at pc=0x0001120c9d30, pid=64213, tid=0x1003
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_311-b11) (build 
> 1.8.0_311-b11)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.311-b11 mixed mode bsd-amd64 
> compressed oops)
> # Problematic frame:
> # V  [libjvm.dylib+0xc9d30]  acl_CopyRight+0x29
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # //hs_err_pid64213.log
> Compiled method (nm)  582142 11318 n 0   sun.misc.Unsafe::copyMemory 
> (native)
>  total in heap  [0x00011efa8890,0x00011efa8be8] = 856
>  relocation [0x00011efa89b8,0x00011efa89f8] = 64
>  main code  [0x00011efa8a00,0x00011efa8be8] = 488
> Compiled method (nm)  582142 11318 n 0   sun.misc.Unsafe::copyMemory 
> (native)
>  total in heap  [0x00011efa8890,0x00011efa8be8] = 856
>  relocation [0x00011efa89b8,0x00011efa89f8] = 64
>  main code  [0x00011efa8a00,0x00011efa8be8] = 488
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41841) Support PyPI packaging without JVM

2023-01-02 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-41841:


 Summary: Support PyPI packaging without JVM
 Key: SPARK-41841
 URL: https://issues.apache.org/jira/browse/SPARK-41841
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Connect
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


We should support pip install pyspark without JVM so Spark Connect can be real 
lightweight library.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41835) Implement `transform_keys` function

2023-01-02 Thread Ruifeng Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653722#comment-17653722
 ] 

Ruifeng Zheng commented on SPARK-41835:
---

this function was added 

> Implement `transform_keys` function
> ---
>
> Key: SPARK-41835
> URL: https://issues.apache.org/jira/browse/SPARK-41835
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41656) Enable doctests in pyspark.sql.connect.dataframe

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41656:


Assignee: Sandeep Singh

> Enable doctests in pyspark.sql.connect.dataframe
> 
>
> Key: SPARK-41656
> URL: https://issues.apache.org/jira/browse/SPARK-41656
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41656) Enable doctests in pyspark.sql.connect.dataframe

2023-01-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41656.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39346
[https://github.com/apache/spark/pull/39346]

> Enable doctests in pyspark.sql.connect.dataframe
> 
>
> Key: SPARK-41656
> URL: https://issues.apache.org/jira/browse/SPARK-41656
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41842) Support data type Timestamp(NANOSECOND, null)

2023-01-02 Thread Sandeep Singh (Jira)
Sandeep Singh created SPARK-41842:
-

 Summary: Support data type Timestamp(NANOSECOND, null)
 Key: SPARK-41842
 URL: https://issues.apache.org/jira/browse/SPARK-41842
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 99, in pyspark.sql.connect.dataframe.DataFrame.isEmpty
Failed example:
    df_empty = spark.createDataFrame([], 'a STRING')
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 
1, in 
        df_empty = spark.createDataFrame([], 'a STRING')
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", line 
186, in createDataFrame
        raise ValueError("Input data cannot be empty")
    ValueError: Input data cannot be empty{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41842) Support data type Timestamp(NANOSECOND, null)

2023-01-02 Thread Sandeep Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41842:
--
Description: 
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1966, in pyspark.sql.connect.functions.hour
Failed example:
    df.select(hour('ts').alias('hour')).collect()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(hour('ts').alias('hour')).collect()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1017, in collect
        pdf = self.toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
623, in _handle_error
        raise SparkConnectException(status.message, info.reason) from None
    pyspark.sql.connect.client.SparkConnectException: 
(org.apache.spark.SparkUnsupportedOperationException) Unsupported data type: 
Timestamp(NANOSECOND, null){code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 99, in pyspark.sql.connect.dataframe.DataFrame.isEmpty
Failed example:
    df_empty = spark.createDataFrame([], 'a STRING')
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 
1, in 
        df_empty = spark.createDataFrame([], 'a STRING')
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", line 
186, in createDataFrame
        raise ValueError("Input data cannot be empty")
    ValueError: Input data cannot be empty{code}


> Support data type Timestamp(NANOSECOND, null)
> -
>
> Key: SPARK-41842
> URL: https://issues.apache.org/jira/browse/SPARK-41842
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1966, in pyspark.sql.connect.functions.hour
> Failed example:
>     df.select(hour('ts').alias('hour')).collect()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.select(hour('ts').alias('hour')).collect()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1017, in collect
>         pdf = self.toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 623, in _handle_error
>         raise SparkConnectException(status.message, info.reason) from None
>     pyspark.sql.connect.client.SparkConnectException: 
> (org.apache.spark.SparkUnsupportedOperationException) Unsupported data type: 
> Timestamp(NANOSECOND, null){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41842) Support data type Timestamp(NANOSECOND, null)

2023-01-02 Thread Sandeep Singh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653728#comment-17653728
 ] 

Sandeep Singh commented on SPARK-41842:
---

Not sure about the EPIC for this one.

> Support data type Timestamp(NANOSECOND, null)
> -
>
> Key: SPARK-41842
> URL: https://issues.apache.org/jira/browse/SPARK-41842
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1966, in pyspark.sql.connect.functions.hour
> Failed example:
>     df.select(hour('ts').alias('hour')).collect()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.select(hour('ts').alias('hour')).collect()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1017, in collect
>         pdf = self.toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 623, in _handle_error
>         raise SparkConnectException(status.message, info.reason) from None
>     pyspark.sql.connect.client.SparkConnectException: 
> (org.apache.spark.SparkUnsupportedOperationException) Unsupported data type: 
> Timestamp(NANOSECOND, null){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41843) Implement SparkSession.udf

2023-01-02 Thread Sandeep Singh (Jira)
Sandeep Singh created SPARK-41843:
-

 Summary: Implement SparkSession.udf
 Key: SPARK-41843
 URL: https://issues.apache.org/jira/browse/SPARK-41843
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1966, in pyspark.sql.connect.functions.hour
Failed example:
    df.select(hour('ts').alias('hour')).collect()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(hour('ts').alias('hour')).collect()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1017, in collect
        pdf = self.toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
623, in _handle_error
        raise SparkConnectException(status.message, info.reason) from None
    pyspark.sql.connect.client.SparkConnectException: 
(org.apache.spark.SparkUnsupportedOperationException) Unsupported data type: 
Timestamp(NANOSECOND, null){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41843) Implement SparkSession.udf

2023-01-02 Thread Sandeep Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41843:
--
Description: 
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 2331, in pyspark.sql.connect.functions.call_udf
Failed example:
    _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType())
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType())
    AttributeError: 'SparkSession' object has no attribute 'udf'{code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1966, in pyspark.sql.connect.functions.hour
Failed example:
    df.select(hour('ts').alias('hour')).collect()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(hour('ts').alias('hour')).collect()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1017, in collect
        pdf = self.toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
623, in _handle_error
        raise SparkConnectException(status.message, info.reason) from None
    pyspark.sql.connect.client.SparkConnectException: 
(org.apache.spark.SparkUnsupportedOperationException) Unsupported data type: 
Timestamp(NANOSECOND, null){code}


> Implement SparkSession.udf
> --
>
> Key: SPARK-41843
> URL: https://issues.apache.org/jira/browse/SPARK-41843
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 2331, in pyspark.sql.connect.functions.call_udf
> Failed example:
>     _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType())
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType())
>     AttributeError: 'SparkSession' object has no attribute 'udf'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41835) Implement `transform_keys` function

2023-01-02 Thread Sandeep Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41835:
--
Description: 
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1611, in pyspark.sql.connect.functions.transform_keys
Failed example:
    df.select(transform_keys(
        "data", lambda k, _: upper(k)).alias("data_upper")
    ).show(truncate=False)
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, 
in 
        df.select(transform_keys(
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "transform_keys(data, 
lambdafunction(upper(x_11), x_11, y_12))" due to data type mismatch: Parameter 
1 requires the "MAP" type, however "data" has the type "STRUCT".
    Plan: 'Project [transform_keys(data#4493, lambdafunction('upper(lambda 
'x_11), lambda 'x_11, lambda 'y_12, false)) AS data_upper#4496]
    +- Project [0#4488L AS id#4492L, 1#4489 AS data#4493]
       +- LocalRelation [0#4488L, 1#4489] {code}

> Implement `transform_keys` function
> ---
>
> Key: SPARK-41835
> URL: https://issues.apache.org/jira/browse/SPARK-41835
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1611, in pyspark.sql.connect.functions.transform_keys
> Failed example:
>     df.select(transform_keys(
>         "data", lambda k, _: upper(k)).alias("data_upper")
>     ).show(truncate=False)
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 
> 1, in 
>         df.select(transform_keys(
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve 
> "transform_keys(data, lambdafunction(upper(x_11), x_11, y_12))" due to data 
> type mismatch: Parameter 1 requires the "MAP" type, however "data" has the 
> type "STRUCT".
>     Plan: 'Project [transform_keys(data#4493, lambdafunction('upper(lambda 
> 'x_11), lambda 'x_11, lambda 'y_12, false)) AS data_upper#4496]
>     +- Project [0#4488L AS id#4492L, 1#4489 AS data#4493]
>        +- LocalRelation [0#4488L, 1#4489] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---

[jira] [Commented] (SPARK-41835) Implement `transform_keys` function

2023-01-02 Thread Sandeep Singh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653731#comment-17653731
 ] 

Sandeep Singh commented on SPARK-41835:
---

My bad, error is about expected input types.

> Implement `transform_keys` function
> ---
>
> Key: SPARK-41835
> URL: https://issues.apache.org/jira/browse/SPARK-41835
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1611, in pyspark.sql.connect.functions.transform_keys
> Failed example:
>     df.select(transform_keys(
>         "data", lambda k, _: upper(k)).alias("data_upper")
>     ).show(truncate=False)
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 
> 1, in 
>         df.select(transform_keys(
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve 
> "transform_keys(data, lambdafunction(upper(x_11), x_11, y_12))" due to data 
> type mismatch: Parameter 1 requires the "MAP" type, however "data" has the 
> type "STRUCT".
>     Plan: 'Project [transform_keys(data#4493, lambdafunction('upper(lambda 
> 'x_11), lambda 'x_11, lambda 'y_12, false)) AS data_upper#4496]
>     +- Project [0#4488L AS id#4492L, 1#4489 AS data#4493]
>        +- LocalRelation [0#4488L, 1#4489] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41844) Implement `intX2` function

2023-01-02 Thread Sandeep Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41844:
--
Description: 
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 2332, in pyspark.sql.connect.functions.call_udf
Failed example:
    df.select(call_udf("intX2", "id")).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(call_udf("intX2", "id")).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[UNRESOLVED_ROUTINE] Cannot resolve function `intX2` on search path 
[`system`.`builtin`, `system`.`session`, `spark_catalog`.`default`].
    Plan: {code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1611, in pyspark.sql.connect.functions.transform_keys
Failed example:
    df.select(transform_keys(
        "data", lambda k, _: upper(k)).alias("data_upper")
    ).show(truncate=False)
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, 
in 
        df.select(transform_keys(
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "transform_keys(data, 
lambdafunction(upper(x_11), x_11, y_12))" due to data type mismatch: Parameter 
1 requires the "MAP" type, however "data" has the type "STRUCT".
    Plan: 'Project [transform_keys(data#4493, lambdafunction('upper(lambda 
'x_11), lambda 'x_11, lambda 'y_12, false)) AS data_upper#4496]
    +- Project [0#4488L AS id#4492L, 1#4489 AS data#4493]
       +- LocalRelation [0#4488L, 1#4489] {code}


> Implement `intX2` function
> --
>
> Key: SPARK-41844
> URL: https://issues.apache.org/jira/browse/SPARK-41844
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 2332, in pyspark.sql.connect.functions.call_udf
> Failed example:
>     df.select(call_udf("intX2", "id")).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.select(call_udf("intX2", "id")).show()
>       File 
> "/Users/s

[jira] [Created] (SPARK-41844) Implement `intX2` function

2023-01-02 Thread Sandeep Singh (Jira)
Sandeep Singh created SPARK-41844:
-

 Summary: Implement `intX2` function
 Key: SPARK-41844
 URL: https://issues.apache.org/jira/browse/SPARK-41844
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Sandeep Singh
 Fix For: 3.4.0


{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1611, in pyspark.sql.connect.functions.transform_keys
Failed example:
    df.select(transform_keys(
        "data", lambda k, _: upper(k)).alias("data_upper")
    ).show(truncate=False)
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, 
in 
        df.select(transform_keys(
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "transform_keys(data, 
lambdafunction(upper(x_11), x_11, y_12))" due to data type mismatch: Parameter 
1 requires the "MAP" type, however "data" has the type "STRUCT".
    Plan: 'Project [transform_keys(data#4493, lambdafunction('upper(lambda 
'x_11), lambda 'x_11, lambda 'y_12, false)) AS data_upper#4496]
    +- Project [0#4488L AS id#4492L, 1#4489 AS data#4493]
       +- LocalRelation [0#4488L, 1#4489] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41844) Implement `intX2` function

2023-01-02 Thread Sandeep Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh resolved SPARK-41844.
---
Resolution: Invalid

> Implement `intX2` function
> --
>
> Key: SPARK-41844
> URL: https://issues.apache.org/jira/browse/SPARK-41844
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 2332, in pyspark.sql.connect.functions.call_udf
> Failed example:
>     df.select(call_udf("intX2", "id")).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.select(call_udf("intX2", "id")).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [UNRESOLVED_ROUTINE] Cannot resolve function `intX2` on search path 
> [`system`.`builtin`, `system`.`session`, `spark_catalog`.`default`].
>     Plan: {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41845) Fix `count(expr("*"))` function

2023-01-02 Thread Sandeep Singh (Jira)
Sandeep Singh created SPARK-41845:
-

 Summary: Fix `count(expr("*"))` function
 Key: SPARK-41845
 URL: https://issues.apache.org/jira/browse/SPARK-41845
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Sandeep Singh
 Fix For: 3.4.0


{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 2332, in pyspark.sql.connect.functions.call_udf
Failed example:
    df.select(call_udf("intX2", "id")).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(call_udf("intX2", "id")).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[UNRESOLVED_ROUTINE] Cannot resolve function `intX2` on search path 
[`system`.`builtin`, `system`.`session`, `spark_catalog`.`default`].
    Plan: {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41845) Fix `count(expr("*"))` function

2023-01-02 Thread Sandeep Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41845:
--
Description: 
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 801, in pyspark.sql.connect.functions.count
Failed example:
    df.select(count(expr("*")), count(df.alphabets)).show()
Expected:
    +++
    |count(1)|count(alphabets)|
    +++
    |       4|               3|
    +++
Got:
    +++
    |count(alphabets)|count(alphabets)|
    +++
    |               3|               3|
    +++
     {code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 2332, in pyspark.sql.connect.functions.call_udf
Failed example:
    df.select(call_udf("intX2", "id")).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(call_udf("intX2", "id")).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[UNRESOLVED_ROUTINE] Cannot resolve function `intX2` on search path 
[`system`.`builtin`, `system`.`session`, `spark_catalog`.`default`].
    Plan: {code}


> Fix `count(expr("*"))` function
> ---
>
> Key: SPARK-41845
> URL: https://issues.apache.org/jira/browse/SPARK-41845
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 801, in pyspark.sql.connect.functions.count
> Failed example:
>     df.select(count(expr("*")), count(df.alphabets)).show()
> Expected:
>     +++
>     |count(1)|count(alphabets)|
>     +++
>     |       4|               3|
>     +++
> Got:
>     +++
>     |count(alphabets)|count(alphabets)|
>     +++
>     |               3|               3|
>     +++
>      {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41823) DataFrame.join creating ambiguous column names

2023-01-02 Thread Sandeep Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh resolved SPARK-41823.
---
Resolution: Duplicate

> DataFrame.join creating ambiguous column names
> --
>
> Key: SPARK-41823
> URL: https://issues.apache.org/jira/browse/SPARK-41823
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 254, in pyspark.sql.connect.dataframe.DataFrame.drop
> Failed example:
>     df.join(df2, df.name == df2.name, 'inner').drop('name').show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 
> 1, in 
>         df.join(df2, df.name == df2.name, 'inner').drop('name').show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [AMBIGUOUS_REFERENCE] Reference `name` is ambiguous, could be: [`name`, 
> `name`].
>     Plan: {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37521) insert overwrite table but the partition information stored in Metastore was not changed

2023-01-02 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong resolved SPARK-37521.
-
Resolution: Won't Fix

> insert overwrite table but the partition information stored in Metastore was 
> not changed
> 
>
> Key: SPARK-37521
> URL: https://issues.apache.org/jira/browse/SPARK-37521
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark3.2.0
> hive2.3.9
> metastore2.3.9
>Reporter: jingxiong zhong
>Priority: Major
>
> I create a partitioned table in SparkSQL, insert a data entry, add a regular 
> field, and finally insert a new data entry into the partition,The query is 
> normal in SparkSQL, but the return value of the newly inserted field is NULL 
> in Hive 2.3.9
> for example
> create table updata_col_test1(a int) partitioned by (dt string); 
> insert overwrite table updata_col_test1 partition(dt='20200101') values(1); 
> insert overwrite table updata_col_test1 partition(dt='20200102') values(1);
> insert overwrite table updata_col_test1 partition(dt='20200103') values(1);
> alter table  updata_col_test1 add columns (b int);
> insert overwrite table updata_col_test1 partition(dt) values(1, 2, 
> '20200101'); fail
> insert overwrite table updata_col_test1 partition(dt='20200101') values(1, 
> 2); fail
> insert overwrite table updata_col_test1 partition(dt='20200104') values(1, 
> 2); sucessfully



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37677) spark on k8s, when the user want to push python3.6.6.zip to the pod , but no permission to execute

2023-01-02 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653733#comment-17653733
 ] 

jingxiong zhong commented on SPARK-37677:
-

At present, I have repaired Hadoop version 3.3.5, but it has not been released 
yet. In the future, Spark needs to update the Hadoop version to solve this 
problem.[~valux] 

> spark on k8s, when the user want to push python3.6.6.zip to the pod , but no 
> permission to execute
> --
>
> Key: SPARK-37677
> URL: https://issues.apache.org/jira/browse/SPARK-37677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: jingxiong zhong
>Priority: Major
>
> In cluster mode, I hava another question that when I unzip python3.6.6.zip in 
> pod , but no permission to execute, my execute operation as follows:
> {code:sh}
> spark-submit \
> --archives ./python3.6.6.zip#python3.6.6 \
> --conf "spark.pyspark.python=python3.6.6/python3.6.6/bin/python3" \
> --conf "spark.pyspark.driver.python=python3.6.6/python3.6.6/bin/python3" \
> --conf spark.kubernetes.container.image.pullPolicy=Always \
> ./examples/src/main/python/pi.py 100
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41846) DataFrame aggregation functions : unresolved columns

2023-01-02 Thread Sandeep Singh (Jira)
Sandeep Singh created SPARK-41846:
-

 Summary: DataFrame aggregation functions : unresolved columns
 Key: SPARK-41846
 URL: https://issues.apache.org/jira/browse/SPARK-41846
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code}
File "/.../spark/python/pyspark/sql/connect/column.py", line 106, in 
pyspark.sql.connect.column.Column.eqNullSafe
Failed example:
df1.join(df2, df1["value"] == df2["value"]).count()
Exception raised:
Traceback (most recent call last):
  File "/.../miniconda3/envs/python3.9/lib/python3.9/doctest.py", line 
1336, in __run
exec(compile(example.source, filename, "single",
  File "", line 1, 
in 
df1.join(df2, df1["value"] == df2["value"]).count()
  File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 151, in 
count
pdd = self.agg(_invoke_function("count", lit(1))).toPandas()
  File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1031, in 
toPandas
return self._session.client.to_pandas(query)
  File "/.../spark/python/pyspark/sql/connect/client.py", line 413, in 
to_pandas
return self._execute_and_fetch(req)
  File "/.../spark/python/pyspark/sql/connect/client.py", line 573, in 
_execute_and_fetch
self._handle_error(rpc_error)
  File "/.../spark/python/pyspark/sql/connect/client.py", line 619, in 
_handle_error
raise SparkConnectAnalysisException(
pyspark.sql.connect.client.SparkConnectAnalysisException: 
[AMBIGUOUS_REFERENCE] Reference `value` is ambiguous, could be: [`value`, 
`value`].
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39853) Support stage level schedule for standalone cluster when dynamic allocation is disabled

2023-01-02 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-39853:
-
Fix Version/s: 3.4.0

> Support stage level schedule for standalone cluster when dynamic allocation 
> is disabled
> ---
>
> Key: SPARK-39853
> URL: https://issues.apache.org/jira/browse/SPARK-39853
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: huangtengfei
>Assignee: huangtengfei
>Priority: Major
> Fix For: 3.4.0
>
>
> [SPARK-39062|https://issues.apache.org/jira/browse/SPARK-39062] added stage 
> level schedule support for standalone cluster when dynamic allocation was 
> enabled, spark would request for executors for different resource profiles.
> While when dynamic allocation is disabled, we can also leverage stage level 
> schedule to schedule tasks based on resource profile(task resource requests) 
> to executors with default resource profile.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41846) DataFrame aggregation functions : unresolved columns

2023-01-02 Thread Sandeep Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41846:
--
Description: 
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1098, in pyspark.sql.connect.functions.rank
Failed example:
    df.withColumn("drank", rank().over(w)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.withColumn("drank", rank().over(w)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
`value` cannot be resolved. Did you mean one of the following? [`_1`]
    Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS 
FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
drank#4003]
    +- Project [0#3998L AS _1#4000L]
       +- LocalRelation [0#3998L] {code}

  was:
{code}
File "/.../spark/python/pyspark/sql/connect/column.py", line 106, in 
pyspark.sql.connect.column.Column.eqNullSafe
Failed example:
df1.join(df2, df1["value"] == df2["value"]).count()
Exception raised:
Traceback (most recent call last):
  File "/.../miniconda3/envs/python3.9/lib/python3.9/doctest.py", line 
1336, in __run
exec(compile(example.source, filename, "single",
  File "", line 1, 
in 
df1.join(df2, df1["value"] == df2["value"]).count()
  File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 151, in 
count
pdd = self.agg(_invoke_function("count", lit(1))).toPandas()
  File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1031, in 
toPandas
return self._session.client.to_pandas(query)
  File "/.../spark/python/pyspark/sql/connect/client.py", line 413, in 
to_pandas
return self._execute_and_fetch(req)
  File "/.../spark/python/pyspark/sql/connect/client.py", line 573, in 
_execute_and_fetch
self._handle_error(rpc_error)
  File "/.../spark/python/pyspark/sql/connect/client.py", line 619, in 
_handle_error
raise SparkConnectAnalysisException(
pyspark.sql.connect.client.SparkConnectAnalysisException: 
[AMBIGUOUS_REFERENCE] Reference `value` is ambiguous, could be: [`value`, 
`value`].
{code}


> DataFrame aggregation functions : unresolved columns
> 
>
> Key: SPARK-41846
> URL: https://issues.apache.org/jira/browse/SPARK-41846
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1098, in pyspark.sql.connect.functions.rank
> Failed example:
>     df.withColumn("drank", rank().over(w)).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.withColumn("drank", rank().over(w)).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 

[jira] [Updated] (SPARK-41846) DataFrame aggregation functions : unresolved columns

2023-01-02 Thread Sandeep Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41846:
--
Description: 
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1098, in pyspark.sql.connect.functions.rank
Failed example:
    df.withColumn("drank", rank().over(w)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.withColumn("drank", rank().over(w)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
`value` cannot be resolved. Did you mean one of the following? [`_1`]
    Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS 
FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
drank#4003]
    +- Project [0#3998L AS _1#4000L]
       +- LocalRelation [0#3998L] {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1032, in pyspark.sql.connect.functions.cume_dist
Failed example:
    df.withColumn("cd", cume_dist().over(w)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.withColumn("cd", cume_dist().over(w)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
`value` cannot be resolved. Did you mean one of the following? [`_1`]
    Plan: 'Project [_1#2202L, cume_dist() windowspecdefinition('value ASC NULLS 
FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) 
AS cd#2205]
    +- Project [0#2200L AS _1#2202L]
       +- LocalRelation [0#2200L] {code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1098, in pyspark.sql.connect.functions.rank
Failed example:
    df.withColumn("drank", rank().over(w)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.withColumn("drank", rank().over(w)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.sin

[jira] [Updated] (SPARK-41846) DataFrame windowspec functions : unresolved columns

2023-01-02 Thread Sandeep Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41846:
--
Summary: DataFrame windowspec functions : unresolved columns  (was: 
DataFrame aggregation functions : unresolved columns)

> DataFrame windowspec functions : unresolved columns
> ---
>
> Key: SPARK-41846
> URL: https://issues.apache.org/jira/browse/SPARK-41846
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1098, in pyspark.sql.connect.functions.rank
> Failed example:
>     df.withColumn("drank", rank().over(w)).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.withColumn("drank", rank().over(w)).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
> `value` cannot be resolved. Did you mean one of the following? [`_1`]
>     Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS 
> FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS drank#4003]
>     +- Project [0#3998L AS _1#4000L]
>        +- LocalRelation [0#3998L] {code}
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1032, in pyspark.sql.connect.functions.cume_dist
> Failed example:
>     df.withColumn("cd", cume_dist().over(w)).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.withColumn("cd", cume_dist().over(w)).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
> `value` cannot be resolved. Did you mean one of the following? [`_1`]
>     Plan: 'Project [_1#2202L, cume_dist() windowspecdefinition('value ASC 
> NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), 
> currentrow$())) AS cd#2205]
>     +- Project [0#2200L AS _1#2202L]
>        +- LocalRelation [0#2200L] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...

[jira] [Created] (SPARK-41847) DataFrame mapfield invalid type

2023-01-02 Thread Sandeep Singh (Jira)
Sandeep Singh created SPARK-41847:
-

 Summary: DataFrame mapfield invalid type
 Key: SPARK-41847
 URL: https://issues.apache.org/jira/browse/SPARK-41847
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1098, in pyspark.sql.connect.functions.rank
Failed example:
    df.withColumn("drank", rank().over(w)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.withColumn("drank", rank().over(w)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
`value` cannot be resolved. Did you mean one of the following? [`_1`]
    Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS 
FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
drank#4003]
    +- Project [0#3998L AS _1#4000L]
       +- LocalRelation [0#3998L] {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1032, in pyspark.sql.connect.functions.cume_dist
Failed example:
    df.withColumn("cd", cume_dist().over(w)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.withColumn("cd", cume_dist().over(w)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
`value` cannot be resolved. Did you mean one of the following? [`_1`]
    Plan: 'Project [_1#2202L, cume_dist() windowspecdefinition('value ASC NULLS 
FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) 
AS cd#2205]
    +- Project [0#2200L AS _1#2202L]
       +- LocalRelation [0#2200L] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41847) DataFrame mapfield invalid type

2023-01-02 Thread Sandeep Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41847:
--
Description: 
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1270, in pyspark.sql.connect.functions.explode
Failed example:
    eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type 
"STRUCT" while it's required to be "MAP".
    Plan:  {code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1098, in pyspark.sql.connect.functions.rank
Failed example:
    df.withColumn("drank", rank().over(w)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.withColumn("drank", rank().over(w)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
`value` cannot be resolved. Did you mean one of the following? [`_1`]
    Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS 
FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
drank#4003]
    +- Project [0#3998L AS _1#4000L]
       +- LocalRelation [0#3998L] {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1032, in pyspark.sql.connect.functions.cume_dist
Failed example:
    df.withColumn("cd", cume_dist().over(w)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.withColumn("cd", cume_dist().over(w)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 

[jira] [Updated] (SPARK-41847) DataFrame mapfield,structlist invalid type

2023-01-02 Thread Sandeep Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41847:
--
Summary: DataFrame mapfield,structlist invalid type  (was: DataFrame 
mapfield invalid type)

> DataFrame mapfield,structlist invalid type
> --
>
> Key: SPARK-41847
> URL: https://issues.apache.org/jira/browse/SPARK-41847
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1270, in pyspark.sql.connect.functions.explode
> Failed example:
>     eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type 
> "STRUCT" while it's required to be "MAP".
>     Plan:  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41847) DataFrame mapfield,structlist invalid type

2023-01-02 Thread Sandeep Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41847:
--
Description: 
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1270, in pyspark.sql.connect.functions.explode
Failed example:
    eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type 
"STRUCT" while it's required to be "MAP".
    Plan:  {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1364, in pyspark.sql.connect.functions.inline
Failed example:
    df.select(inline(df.structlist)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(inline(df.structlist)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is 
of type "ARRAY" while it's required to be "STRUCT".
    Plan:  {code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1270, in pyspark.sql.connect.functions.explode
Failed example:
    eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
   

[jira] [Created] (SPARK-41848) Tasks are over-scheduled with TaskResourceProfile

2023-01-02 Thread wuyi (Jira)
wuyi created SPARK-41848:


 Summary: Tasks are over-scheduled with TaskResourceProfile
 Key: SPARK-41848
 URL: https://issues.apache.org/jira/browse/SPARK-41848
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: wuyi


{code:java}
test("SPARK-XXX") {
  val conf = new 
SparkConf().setAppName("test").setMaster("local-cluster[1,4,1024]")
  sc = new SparkContext(conf)
  val req = new TaskResourceRequests().cpus(3)
  val rp = new ResourceProfileBuilder().require(req).build()

  val res = sc.parallelize(Seq(0, 1), 2).withResources(rp).map { x =>
Thread.sleep(5000)
x * 2
  }.collect()
  assert(res === Array(0, 2))
} {code}
In this test, tasks are supposed to be scheduled in order since each task 
requires 3 cores but the executor only has 4 cores. However, we noticed 2 tasks 
are launched concurrently from the logs.

It turns out that we used the TaskResourceProfile (taskCpus=3) of the taskset 
for task scheduling:
{code:java}
val rpId = taskSet.taskSet.resourceProfileId
val taskSetProf = sc.resourceProfileManager.resourceProfileFromId(rpId)
val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(taskSetProf, 
conf) {code}
but the ResourceProfile (taskCpus=1) of the executor for updating the free 
cores in ExecutorData:
{code:java}
val rpId = executorData.resourceProfileId
val prof = scheduler.sc.resourceProfileManager.resourceProfileFromId(rpId)
val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(prof, conf)
executorData.freeCores -= taskCpus {code}
which results in the inconsistency of the available cores.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41849) Implement DataFrameReader.text

2023-01-02 Thread Sandeep Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41849:
--
Description: 
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 276, in pyspark.sql.connect.functions.input_file_name
Failed example:
    df = spark.read.text(path)
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 
1, in 
        df = spark.read.text(path)
    AttributeError: 'DataFrameReader' object has no attribute 'text'{code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1270, in pyspark.sql.connect.functions.explode
Failed example:
    eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type 
"STRUCT" while it's required to be "MAP".
    Plan:  {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1364, in pyspark.sql.connect.functions.inline
Failed example:
    df.select(inline(df.structlist)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(inline(df.structlist)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is 
of type "ARRAY" while it's required to be "STRUCT".
    Plan:  {code}


> Implement DataFrameReader.text
> --
>
> Key: SPARK-41849
> URL: https://issues.apache.org/jira/browse/SPARK-41849
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 276, in pyspark.sql.connect.functions.input_file_name
> Failed example:
>     df = spark.read.text(path)
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  li

[jira] [Created] (SPARK-41849) Implement DataFrameReader.text

2023-01-02 Thread Sandeep Singh (Jira)
Sandeep Singh created SPARK-41849:
-

 Summary: Implement DataFrameReader.text
 Key: SPARK-41849
 URL: https://issues.apache.org/jira/browse/SPARK-41849
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1270, in pyspark.sql.connect.functions.explode
Failed example:
    eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type 
"STRUCT" while it's required to be "MAP".
    Plan:  {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1364, in pyspark.sql.connect.functions.inline
Failed example:
    df.select(inline(df.structlist)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(inline(df.structlist)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is 
of type "ARRAY" while it's required to be "STRUCT".
    Plan:  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41814) Column.eqNullSafe fails on NaN comparison

2023-01-02 Thread Ruifeng Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653736#comment-17653736
 ] 

Ruifeng Zheng commented on SPARK-41814:
---

this issue is due to:
1, the conversion from rows to pd.DataFrame, which automatically convert null 
to NaN
2, then the conversion from pd.DataFrame to pa.Table, which convert NaN to null

> Column.eqNullSafe fails on NaN comparison
> -
>
> Key: SPARK-41814
> URL: https://issues.apache.org/jira/browse/SPARK-41814
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> File "/.../spark/python/pyspark/sql/connect/column.py", line 115, in 
> pyspark.sql.connect.column.Column.eqNullSafe
> Failed example:
> df2.select(
> df2['value'].eqNullSafe(None),
> df2['value'].eqNullSafe(float('NaN')),
> df2['value'].eqNullSafe(42.0)
> ).show()
> Expected:
> ++---++
> |(value <=> NULL)|(value <=> NaN)|(value <=> 42.0)|
> ++---++
> |   false|   true|   false|
> |   false|  false|true|
> |true|  false|   false|
> ++---++
> Got:
> ++---++
> |(value <=> NULL)|(value <=> NaN)|(value <=> 42.0)|
> ++---++
> |true|  false|   false|
> |   false|  false|true|
> |true|  false|   false|
> ++---++
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41850) Fix DataFrameReader.isnan

2023-01-02 Thread Sandeep Singh (Jira)
Sandeep Singh created SPARK-41850:
-

 Summary: Fix DataFrameReader.isnan
 Key: SPARK-41850
 URL: https://issues.apache.org/jira/browse/SPARK-41850
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 276, in pyspark.sql.connect.functions.input_file_name
Failed example:
    df = spark.read.text(path)
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 
1, in 
        df = spark.read.text(path)
    AttributeError: 'DataFrameReader' object has no attribute 'text'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41850) Fix `isnan` function

2023-01-02 Thread Sandeep Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41850:
--
Summary: Fix `isnan` function  (was: Fix DataFrameReader.isnan)

> Fix `isnan` function
> 
>
> Key: SPARK-41850
> URL: https://issues.apache.org/jira/browse/SPARK-41850
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 276, in pyspark.sql.connect.functions.input_file_name
> Failed example:
>     df = spark.read.text(path)
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 
> 1, in 
>         df = spark.read.text(path)
>     AttributeError: 'DataFrameReader' object has no attribute 'text'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41850) Fix `isnan` function

2023-01-02 Thread Sandeep Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41850:
--
Description: 
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 288, in pyspark.sql.connect.functions.isnan
Failed example:
    df.select("a", "b", isnan("a").alias("r1"), isnan(df.b).alias("r2")).show()
Expected:
    +---+---+-+-+
    |  a|  b|   r1|   r2|
    +---+---+-+-+
    |1.0|NaN|false| true|
    |NaN|2.0| true|false|
    +---+---+-+-+
Got:
    +++-+-+
    |   a|   b|   r1|   r2|
    +++-+-+
    | 1.0|null|false|false|
    |null| 2.0|false|false|
    +++-+-+
    {code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 276, in pyspark.sql.connect.functions.input_file_name
Failed example:
    df = spark.read.text(path)
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 
1, in 
        df = spark.read.text(path)
    AttributeError: 'DataFrameReader' object has no attribute 'text'{code}


> Fix `isnan` function
> 
>
> Key: SPARK-41850
> URL: https://issues.apache.org/jira/browse/SPARK-41850
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 288, in pyspark.sql.connect.functions.isnan
> Failed example:
>     df.select("a", "b", isnan("a").alias("r1"), 
> isnan(df.b).alias("r2")).show()
> Expected:
>     +---+---+-+-+
>     |  a|  b|   r1|   r2|
>     +---+---+-+-+
>     |1.0|NaN|false| true|
>     |NaN|2.0| true|false|
>     +---+---+-+-+
> Got:
>     +++-+-+
>     |   a|   b|   r1|   r2|
>     +++-+-+
>     | 1.0|null|false|false|
>     |null| 2.0|false|false|
>     +++-+-+
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41850) Fix `isnan` function

2023-01-02 Thread Sandeep Singh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653738#comment-17653738
 ] 

Sandeep Singh commented on SPARK-41850:
---

This should be moved under SPARK-41283

> Fix `isnan` function
> 
>
> Key: SPARK-41850
> URL: https://issues.apache.org/jira/browse/SPARK-41850
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 288, in pyspark.sql.connect.functions.isnan
> Failed example:
>     df.select("a", "b", isnan("a").alias("r1"), 
> isnan(df.b).alias("r2")).show()
> Expected:
>     +---+---+-+-+
>     |  a|  b|   r1|   r2|
>     +---+---+-+-+
>     |1.0|NaN|false| true|
>     |NaN|2.0| true|false|
>     +---+---+-+-+
> Got:
>     +++-+-+
>     |   a|   b|   r1|   r2|
>     +++-+-+
>     | 1.0|null|false|false|
>     |null| 2.0|false|false|
>     +++-+-+
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41847) DataFrame mapfield,structlist invalid type

2023-01-02 Thread Sandeep Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41847:
--
Description: 
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1270, in pyspark.sql.connect.functions.explode
Failed example:
    eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type 
"STRUCT" while it's required to be "MAP".
    Plan:  {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1364, in pyspark.sql.connect.functions.inline
Failed example:
    df.select(inline(df.structlist)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(inline(df.structlist)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is 
of type "ARRAY" while it's required to be "STRUCT".
    Plan:  {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1411, in pyspark.sql.connect.functions.map_filter
Failed example:
    df.select(map_filter(
        "data", lambda _, v: v > 30.0).alias("data_filtered")
    ).show(truncate=False)
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(map_filter(
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
 

[jira] [Updated] (SPARK-41851) Fix `nanvl` function

2023-01-02 Thread Sandeep Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41851:
--
Description: 
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 313, in pyspark.sql.connect.functions.nanvl
Failed example:
    df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, 
df.b).alias("r2")).collect()
Expected:
    [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)]
Got:
    [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 801, in pyspark.sql.connect.functions.count
Failed example:
    df.select(count(expr("*")), count(df.alphabets)).show()
Expected:
    +++
    |count(1)|count(alphabets)|
    +++
    |       4|               3|
    +++
Got:
    +++
    |count(alphabets)|count(alphabets)|
    +++
    |               3|               3|
    +++
     {code}


> Fix `nanvl` function
> 
>
> Key: SPARK-41851
> URL: https://issues.apache.org/jira/browse/SPARK-41851
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 313, in pyspark.sql.connect.functions.nanvl
> Failed example:
>     df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, 
> df.b).alias("r2")).collect()
> Expected:
>     [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)]
> Got:
>     [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41851) Fix `nanvl` function

2023-01-02 Thread Sandeep Singh (Jira)
Sandeep Singh created SPARK-41851:
-

 Summary: Fix `nanvl` function
 Key: SPARK-41851
 URL: https://issues.apache.org/jira/browse/SPARK-41851
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Sandeep Singh
 Fix For: 3.4.0


{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 801, in pyspark.sql.connect.functions.count
Failed example:
    df.select(count(expr("*")), count(df.alphabets)).show()
Expected:
    +++
    |count(1)|count(alphabets)|
    +++
    |       4|               3|
    +++
Got:
    +++
    |count(alphabets)|count(alphabets)|
    +++
    |               3|               3|
    +++
     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41848) Tasks are over-scheduled with TaskResourceProfile

2023-01-02 Thread wuyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653739#comment-17653739
 ] 

wuyi commented on SPARK-41848:
--

cc [~ivoson] 

> Tasks are over-scheduled with TaskResourceProfile
> -
>
> Key: SPARK-41848
> URL: https://issues.apache.org/jira/browse/SPARK-41848
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: wuyi
>Priority: Major
>
> {code:java}
> test("SPARK-XXX") {
>   val conf = new 
> SparkConf().setAppName("test").setMaster("local-cluster[1,4,1024]")
>   sc = new SparkContext(conf)
>   val req = new TaskResourceRequests().cpus(3)
>   val rp = new ResourceProfileBuilder().require(req).build()
>   val res = sc.parallelize(Seq(0, 1), 2).withResources(rp).map { x =>
> Thread.sleep(5000)
> x * 2
>   }.collect()
>   assert(res === Array(0, 2))
> } {code}
> In this test, tasks are supposed to be scheduled in order since each task 
> requires 3 cores but the executor only has 4 cores. However, we noticed 2 
> tasks are launched concurrently from the logs.
> It turns out that we used the TaskResourceProfile (taskCpus=3) of the taskset 
> for task scheduling:
> {code:java}
> val rpId = taskSet.taskSet.resourceProfileId
> val taskSetProf = sc.resourceProfileManager.resourceProfileFromId(rpId)
> val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(taskSetProf, 
> conf) {code}
> but the ResourceProfile (taskCpus=1) of the executor for updating the free 
> cores in ExecutorData:
> {code:java}
> val rpId = executorData.resourceProfileId
> val prof = scheduler.sc.resourceProfileManager.resourceProfileFromId(rpId)
> val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(prof, conf)
> executorData.freeCores -= taskCpus {code}
> which results in the inconsistency of the available cores.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41848) Tasks are over-scheduled with TaskResourceProfile

2023-01-02 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-41848:
-
Priority: Blocker  (was: Major)

> Tasks are over-scheduled with TaskResourceProfile
> -
>
> Key: SPARK-41848
> URL: https://issues.apache.org/jira/browse/SPARK-41848
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: wuyi
>Priority: Blocker
>
> {code:java}
> test("SPARK-XXX") {
>   val conf = new 
> SparkConf().setAppName("test").setMaster("local-cluster[1,4,1024]")
>   sc = new SparkContext(conf)
>   val req = new TaskResourceRequests().cpus(3)
>   val rp = new ResourceProfileBuilder().require(req).build()
>   val res = sc.parallelize(Seq(0, 1), 2).withResources(rp).map { x =>
> Thread.sleep(5000)
> x * 2
>   }.collect()
>   assert(res === Array(0, 2))
> } {code}
> In this test, tasks are supposed to be scheduled in order since each task 
> requires 3 cores but the executor only has 4 cores. However, we noticed 2 
> tasks are launched concurrently from the logs.
> It turns out that we used the TaskResourceProfile (taskCpus=3) of the taskset 
> for task scheduling:
> {code:java}
> val rpId = taskSet.taskSet.resourceProfileId
> val taskSetProf = sc.resourceProfileManager.resourceProfileFromId(rpId)
> val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(taskSetProf, 
> conf) {code}
> but the ResourceProfile (taskCpus=1) of the executor for updating the free 
> cores in ExecutorData:
> {code:java}
> val rpId = executorData.resourceProfileId
> val prof = scheduler.sc.resourceProfileManager.resourceProfileFromId(rpId)
> val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(prof, conf)
> executorData.freeCores -= taskCpus {code}
> which results in the inconsistency of the available cores.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41852) Fix `pmod` function

2023-01-02 Thread Sandeep Singh (Jira)
Sandeep Singh created SPARK-41852:
-

 Summary: Fix `pmod` function
 Key: SPARK-41852
 URL: https://issues.apache.org/jira/browse/SPARK-41852
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Sandeep Singh
 Fix For: 3.4.0


{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 313, in pyspark.sql.connect.functions.nanvl
Failed example:
    df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, 
df.b).alias("r2")).collect()
Expected:
    [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)]
Got:
    [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41852) Fix `pmod` function

2023-01-02 Thread Sandeep Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41852:
--
Description: 
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 622, in pyspark.sql.connect.functions.pmod
Failed example:
    df.select(pmod("a", "b")).show()
Expected:
    +--+
    |pmod(a, b)|
    +--+
    |       NaN|
    |       NaN|
    |       1.0|
    |       NaN|
    |       1.0|
    |       2.0|
    |      -5.0|
    |       7.0|
    |       1.0|
    +--+
Got:
    +--+
    |pmod(a, b)|
    +--+
    |      null|
    |      null|
    |       1.0|
    |      null|
    |       1.0|
    |       2.0|
    |      -5.0|
    |       7.0|
    |       1.0|
    +--+
    {code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 313, in pyspark.sql.connect.functions.nanvl
Failed example:
    df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, 
df.b).alias("r2")).collect()
Expected:
    [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)]
Got:
    [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code}


> Fix `pmod` function
> ---
>
> Key: SPARK-41852
> URL: https://issues.apache.org/jira/browse/SPARK-41852
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 622, in pyspark.sql.connect.functions.pmod
> Failed example:
>     df.select(pmod("a", "b")).show()
> Expected:
>     +--+
>     |pmod(a, b)|
>     +--+
>     |       NaN|
>     |       NaN|
>     |       1.0|
>     |       NaN|
>     |       1.0|
>     |       2.0|
>     |      -5.0|
>     |       7.0|
>     |       1.0|
>     +--+
> Got:
>     +--+
>     |pmod(a, b)|
>     +--+
>     |      null|
>     |      null|
>     |       1.0|
>     |      null|
>     |       1.0|
>     |       2.0|
>     |      -5.0|
>     |       7.0|
>     |       1.0|
>     +--+
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41853) Use Map in place of SortedMap for ErrorClassesJsonReader

2023-01-02 Thread Ted Yu (Jira)
Ted Yu created SPARK-41853:
--

 Summary: Use Map in place of SortedMap for ErrorClassesJsonReader
 Key: SPARK-41853
 URL: https://issues.apache.org/jira/browse/SPARK-41853
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 3.2.3
Reporter: Ted Yu


The use of SortedMap in ErrorClassesJsonReader was mostly for making tests 
easier to write.

This PR replaces SortedMap with Map since SortedMap is slower compared to Map.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41853) Use Map in place of SortedMap for ErrorClassesJsonReader

2023-01-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41853:


Assignee: (was: Apache Spark)

> Use Map in place of SortedMap for ErrorClassesJsonReader
> 
>
> Key: SPARK-41853
> URL: https://issues.apache.org/jira/browse/SPARK-41853
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.2.3
>Reporter: Ted Yu
>Priority: Minor
>
> The use of SortedMap in ErrorClassesJsonReader was mostly for making tests 
> easier to write.
> This PR replaces SortedMap with Map since SortedMap is slower compared to Map.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41853) Use Map in place of SortedMap for ErrorClassesJsonReader

2023-01-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653740#comment-17653740
 ] 

Apache Spark commented on SPARK-41853:
--

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/39351

> Use Map in place of SortedMap for ErrorClassesJsonReader
> 
>
> Key: SPARK-41853
> URL: https://issues.apache.org/jira/browse/SPARK-41853
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.2.3
>Reporter: Ted Yu
>Priority: Minor
>
> The use of SortedMap in ErrorClassesJsonReader was mostly for making tests 
> easier to write.
> This PR replaces SortedMap with Map since SortedMap is slower compared to Map.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41853) Use Map in place of SortedMap for ErrorClassesJsonReader

2023-01-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41853:


Assignee: Apache Spark

> Use Map in place of SortedMap for ErrorClassesJsonReader
> 
>
> Key: SPARK-41853
> URL: https://issues.apache.org/jira/browse/SPARK-41853
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.2.3
>Reporter: Ted Yu
>Assignee: Apache Spark
>Priority: Minor
>
> The use of SortedMap in ErrorClassesJsonReader was mostly for making tests 
> easier to write.
> This PR replaces SortedMap with Map since SortedMap is slower compared to Map.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41847) DataFrame mapfield,structlist invalid type

2023-01-02 Thread Sandeep Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41847:
--
Description: 
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1270, in pyspark.sql.connect.functions.explode
Failed example:
    eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type 
"STRUCT" while it's required to be "MAP".
    Plan:  {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1364, in pyspark.sql.connect.functions.inline
Failed example:
    df.select(inline(df.structlist)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(inline(df.structlist)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is 
of type "ARRAY" while it's required to be "STRUCT".
    Plan:  {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1411, in pyspark.sql.connect.functions.map_filter
Failed example:
    df.select(map_filter(
        "data", lambda _, v: v > 30.0).alias("data_filtered")
    ).show(truncate=False)
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(map_filter(
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
 

[jira] [Commented] (SPARK-41853) Use Map in place of SortedMap for ErrorClassesJsonReader

2023-01-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653741#comment-17653741
 ] 

Apache Spark commented on SPARK-41853:
--

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/39351

> Use Map in place of SortedMap for ErrorClassesJsonReader
> 
>
> Key: SPARK-41853
> URL: https://issues.apache.org/jira/browse/SPARK-41853
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.2.3
>Reporter: Ted Yu
>Priority: Minor
>
> The use of SortedMap in ErrorClassesJsonReader was mostly for making tests 
> easier to write.
> This PR replaces SortedMap with Map since SortedMap is slower compared to Map.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41852) Fix `pmod` function

2023-01-02 Thread Ruifeng Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653743#comment-17653743
 ] 

Ruifeng Zheng commented on SPARK-41852:
---

could you please also provide the code to create the dataframe?

a known issue is that `session.createDataFrame` doesn't handle NaN/None 
correctly.

> Fix `pmod` function
> ---
>
> Key: SPARK-41852
> URL: https://issues.apache.org/jira/browse/SPARK-41852
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 622, in pyspark.sql.connect.functions.pmod
> Failed example:
>     df.select(pmod("a", "b")).show()
> Expected:
>     +--+
>     |pmod(a, b)|
>     +--+
>     |       NaN|
>     |       NaN|
>     |       1.0|
>     |       NaN|
>     |       1.0|
>     |       2.0|
>     |      -5.0|
>     |       7.0|
>     |       1.0|
>     +--+
> Got:
>     +--+
>     |pmod(a, b)|
>     +--+
>     |      null|
>     |      null|
>     |       1.0|
>     |      null|
>     |       1.0|
>     |       2.0|
>     |      -5.0|
>     |       7.0|
>     |       1.0|
>     +--+
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-41814) Column.eqNullSafe fails on NaN comparison

2023-01-02 Thread Ruifeng Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653736#comment-17653736
 ] 

Ruifeng Zheng edited comment on SPARK-41814 at 1/3/23 3:06 AM:
---

this issue is due to that `createDataFrame` can't handle NaN/None properly:
1, the conversion from rows to pd.DataFrame, which automatically convert null 
to NaN
2, then the conversion from pd.DataFrame to pa.Table, which convert NaN to null


was (Author: podongfeng):
this issue is due to:
1, the conversion from rows to pd.DataFrame, which automatically convert null 
to NaN
2, then the conversion from pd.DataFrame to pa.Table, which convert NaN to null

> Column.eqNullSafe fails on NaN comparison
> -
>
> Key: SPARK-41814
> URL: https://issues.apache.org/jira/browse/SPARK-41814
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> File "/.../spark/python/pyspark/sql/connect/column.py", line 115, in 
> pyspark.sql.connect.column.Column.eqNullSafe
> Failed example:
> df2.select(
> df2['value'].eqNullSafe(None),
> df2['value'].eqNullSafe(float('NaN')),
> df2['value'].eqNullSafe(42.0)
> ).show()
> Expected:
> ++---++
> |(value <=> NULL)|(value <=> NaN)|(value <=> 42.0)|
> ++---++
> |   false|   true|   false|
> |   false|  false|true|
> |true|  false|   false|
> ++---++
> Got:
> ++---++
> |(value <=> NULL)|(value <=> NaN)|(value <=> 42.0)|
> ++---++
> |true|  false|   false|
> |   false|  false|true|
> |true|  false|   false|
> ++---++
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41851) Fix `nanvl` function

2023-01-02 Thread Ruifeng Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653746#comment-17653746
 ] 

Ruifeng Zheng commented on SPARK-41851:
---

could you please also provide the code to create the dataframe?

a known issue is that `session.createDataFrame` doesn't handle NaN/None 
correctly.

https://issues.apache.org/jira/browse/SPARK-41814

> Fix `nanvl` function
> 
>
> Key: SPARK-41851
> URL: https://issues.apache.org/jira/browse/SPARK-41851
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 313, in pyspark.sql.connect.functions.nanvl
> Failed example:
>     df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, 
> df.b).alias("r2")).collect()
> Expected:
>     [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)]
> Got:
>     [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-41814) Column.eqNullSafe fails on NaN comparison

2023-01-02 Thread Ruifeng Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653736#comment-17653736
 ] 

Ruifeng Zheng edited comment on SPARK-41814 at 1/3/23 3:12 AM:
---

this issue is due to that `createDataFrame` can't handle NaN/None properly:
1, the conversion from rows to pd.DataFrame, which automatically convert None 
to NaN
2, then the conversion from pd.DataFrame to pa.Table, which convert NaN to null


was (Author: podongfeng):
this issue is due to that `createDataFrame` can't handle NaN/None properly:
1, the conversion from rows to pd.DataFrame, which automatically convert null 
to NaN
2, then the conversion from pd.DataFrame to pa.Table, which convert NaN to null

> Column.eqNullSafe fails on NaN comparison
> -
>
> Key: SPARK-41814
> URL: https://issues.apache.org/jira/browse/SPARK-41814
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> File "/.../spark/python/pyspark/sql/connect/column.py", line 115, in 
> pyspark.sql.connect.column.Column.eqNullSafe
> Failed example:
> df2.select(
> df2['value'].eqNullSafe(None),
> df2['value'].eqNullSafe(float('NaN')),
> df2['value'].eqNullSafe(42.0)
> ).show()
> Expected:
> ++---++
> |(value <=> NULL)|(value <=> NaN)|(value <=> 42.0)|
> ++---++
> |   false|   true|   false|
> |   false|  false|true|
> |true|  false|   false|
> ++---++
> Got:
> ++---++
> |(value <=> NULL)|(value <=> NaN)|(value <=> 42.0)|
> ++---++
> |true|  false|   false|
> |   false|  false|true|
> |true|  false|   false|
> ++---++
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41815) Column.isNull returns nan instead of None

2023-01-02 Thread Ruifeng Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653748#comment-17653748
 ] 

Ruifeng Zheng commented on SPARK-41815:
---

similar to the issue in `createDataFrame` 
https://issues.apache.org/jira/browse/SPARK-41814



> Column.isNull returns nan instead of None
> -
>
> Key: SPARK-41815
> URL: https://issues.apache.org/jira/browse/SPARK-41815
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> File "/.../spark/python/pyspark/sql/connect/column.py", line 99, in 
> pyspark.sql.connect.column.Column.isNull
> Failed example:
> df.filter(df.height.isNull()).collect()
> Expected:
> [Row(name='Alice', height=None)]
> Got:
> [Row(name='Alice', height=nan)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41852) Fix `pmod` function

2023-01-02 Thread Sandeep Singh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653750#comment-17653750
 ] 

Sandeep Singh commented on SPARK-41852:
---

[~podongfeng] these are from the doctests
{code:java}
>>> from pyspark.sql.functions import pmod
>>> df = spark.createDataFrame([
... (1.0, float('nan')), (float('nan'), 2.0), (10.0, 3.0),
... (float('nan'), float('nan')), (-3.0, 4.0), (-10.0, 3.0),
... (-5.0, -6.0), (7.0, -8.0), (1.0, 2.0)],
... ("a", "b"))
>>> df.select(pmod("a", "b")).show() {code}

> Fix `pmod` function
> ---
>
> Key: SPARK-41852
> URL: https://issues.apache.org/jira/browse/SPARK-41852
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 622, in pyspark.sql.connect.functions.pmod
> Failed example:
>     df.select(pmod("a", "b")).show()
> Expected:
>     +--+
>     |pmod(a, b)|
>     +--+
>     |       NaN|
>     |       NaN|
>     |       1.0|
>     |       NaN|
>     |       1.0|
>     |       2.0|
>     |      -5.0|
>     |       7.0|
>     |       1.0|
>     +--+
> Got:
>     +--+
>     |pmod(a, b)|
>     +--+
>     |      null|
>     |      null|
>     |       1.0|
>     |      null|
>     |       1.0|
>     |       2.0|
>     |      -5.0|
>     |       7.0|
>     |       1.0|
>     +--+
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41851) Fix `nanvl` function

2023-01-02 Thread Sandeep Singh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653751#comment-17653751
 ] 

Sandeep Singh commented on SPARK-41851:
---

[~podongfeng] 
{code:java}
>>> df = spark.createDataFrame([(1.0, float('nan')), (float('nan'), 2.0)], 
>>> ("a", "b"))
>>> df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, 
>>> df.b).alias("r2")).collect() {code}

> Fix `nanvl` function
> 
>
> Key: SPARK-41851
> URL: https://issues.apache.org/jira/browse/SPARK-41851
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 313, in pyspark.sql.connect.functions.nanvl
> Failed example:
>     df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, 
> df.b).alias("r2")).collect()
> Expected:
>     [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)]
> Got:
>     [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39995) PySpark installation doesn't support Scala 2.13 binaries

2023-01-02 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653752#comment-17653752
 ] 

Hyukjin Kwon commented on SPARK-39995:
--

For:

{quote}
Also, env vars (e.g. PYSPARK_HADOOP_VERSION) are easy to use with pip and CLI 
but not possible with package managers like Poetry.
{quote}

We can't do this because of the issue in pip itself, see SPARK-32837

> PySpark installation doesn't support Scala 2.13 binaries
> 
>
> Key: SPARK-39995
> URL: https://issues.apache.org/jira/browse/SPARK-39995
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Oleksandr Shevchenko
>Priority: Major
>
> [PyPi|https://pypi.org/project/pyspark/] doesn't support Spark binary 
> [installation|https://spark.apache.org/docs/latest/api/python/getting_started/install.html#using-pypi]
>  for Scala 2.13.
> Currently, the setup 
> [script|https://github.com/apache/spark/blob/master/python/pyspark/install.py]
>  allows to set versions of Spark, Hadoop (PYSPARK_HADOOP_VERSION), and mirror 
> (PYSPARK_RELEASE_MIRROR) to download needed Spark binaries, but it's always 
> Scala 2.12 compatible binaries. There isn't any parameter to download 
> "spark-3.3.0-bin-hadoop3-scala2.13.tgz".
> It's possible to download Spark manually and set the needed SPARK_HOME, but 
> it's hard to use with pip or Poetry.
> Also, env vars (e.g. PYSPARK_HADOOP_VERSION) are easy to use with pip and CLI 
> but not possible with package managers like Poetry.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39995) PySpark installation doesn't support Scala 2.13 binaries

2023-01-02 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653753#comment-17653753
 ] 

Hyukjin Kwon commented on SPARK-39995:
--

I think i will be able to pick this up before Spark 3.4.

> PySpark installation doesn't support Scala 2.13 binaries
> 
>
> Key: SPARK-39995
> URL: https://issues.apache.org/jira/browse/SPARK-39995
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Oleksandr Shevchenko
>Priority: Major
>
> [PyPi|https://pypi.org/project/pyspark/] doesn't support Spark binary 
> [installation|https://spark.apache.org/docs/latest/api/python/getting_started/install.html#using-pypi]
>  for Scala 2.13.
> Currently, the setup 
> [script|https://github.com/apache/spark/blob/master/python/pyspark/install.py]
>  allows to set versions of Spark, Hadoop (PYSPARK_HADOOP_VERSION), and mirror 
> (PYSPARK_RELEASE_MIRROR) to download needed Spark binaries, but it's always 
> Scala 2.12 compatible binaries. There isn't any parameter to download 
> "spark-3.3.0-bin-hadoop3-scala2.13.tgz".
> It's possible to download Spark manually and set the needed SPARK_HOME, but 
> it's hard to use with pip or Poetry.
> Also, env vars (e.g. PYSPARK_HADOOP_VERSION) are easy to use with pip and CLI 
> but not possible with package managers like Poetry.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41854) Automatic reformat/check python/setup.py

2023-01-02 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-41854:


 Summary: Automatic reformat/check python/setup.py 
 Key: SPARK-41854
 URL: https://issues.apache.org/jira/browse/SPARK-41854
 Project: Spark
  Issue Type: Test
  Components: Build, PySpark
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


python/setup.py should be also reformatted via ./dev/reformat-python



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41854) Automatic reformat/check python/setup.py

2023-01-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41854:


Assignee: (was: Apache Spark)

> Automatic reformat/check python/setup.py 
> -
>
> Key: SPARK-41854
> URL: https://issues.apache.org/jira/browse/SPARK-41854
> Project: Spark
>  Issue Type: Test
>  Components: Build, PySpark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> python/setup.py should be also reformatted via ./dev/reformat-python



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41854) Automatic reformat/check python/setup.py

2023-01-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653756#comment-17653756
 ] 

Apache Spark commented on SPARK-41854:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39352

> Automatic reformat/check python/setup.py 
> -
>
> Key: SPARK-41854
> URL: https://issues.apache.org/jira/browse/SPARK-41854
> Project: Spark
>  Issue Type: Test
>  Components: Build, PySpark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> python/setup.py should be also reformatted via ./dev/reformat-python



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   >