from:"Sandeep Singh \(JIRA\)"

[jira] [Commented] (SPARK-42002) Implement DataFrameWriterV2 (ReadwriterV2Tests)

2023-01-15 Thread Sandeep Singh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677094#comment-17677094
 ] 

Sandeep Singh commented on SPARK-42002:
---

I'm working on this

> Implement DataFrameWriterV2 (ReadwriterV2Tests)
> ---
>
> Key: SPARK-42002
> URL: https://issues.apache.org/jira/browse/SPARK-42002
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> pyspark/sql/tests/test_readwriter.py:182 (ReadwriterV2ParityTests.test_api)
> self = 
>  testMethod=test_api>
> def test_api(self):
> df = self.df
> >   writer = df.writeTo("testcat.t")
> ../test_readwriter.py:185: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> self = DataFrame[key: bigint, value: string], args = ('testcat.t',), kwargs = 
> {}
> def writeTo(self, *args: Any, **kwargs: Any) -> None:
> >   raise NotImplementedError("writeTo() is not implemented.")
> E   NotImplementedError: writeTo() is not implemented.
> ../../connect/dataframe.py:1529: NotImplementedError
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42073) Enable pyspark.sql.tests.test_types 2 test cases

2023-01-15 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-42073:
-

 Summary: Enable pyspark.sql.tests.test_types 2 test cases
 Key: SPARK-42073
 URL: https://issues.apache.org/jira/browse/SPARK-42073
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, Tests
Affects Versions: 3.4.0
Reporter: Sandeep Singh
Assignee: Hyukjin Kwon
 Fix For: 3.4.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42012) Implement DataFrameReader.orc

2023-01-13 Thread Sandeep Singh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17676751#comment-17676751
 ] 

Sandeep Singh commented on SPARK-42012:
---

working on this.

> Implement DataFrameReader.orc
> -
>
> Key: SPARK-42012
> URL: https://issues.apache.org/jira/browse/SPARK-42012
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> pyspark/sql/tests/test_datasources.py:114 
> (DataSourcesParityTests.test_read_multiple_orc_file)
> self = 
>  testMethod=test_read_multiple_orc_file>
> def test_read_multiple_orc_file(self):
> >   df = self.spark.read.orc(
> [
> "python/test_support/sql/orc_partitioned/b=0/c=0",
> "python/test_support/sql/orc_partitioned/b=1/c=1",
> ]
> )
> ../test_datasources.py:116: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> self =  0x7fb170946b50>
> args = (['python/test_support/sql/orc_partitioned/b=0/c=0', 
> 'python/test_support/sql/orc_partitioned/b=1/c=1'],)
> kwargs = {}
> def orc(self, *args: Any, **kwargs: Any) -> None:
> >   raise NotImplementedError("orc() is not implemented.")
> E   NotImplementedError: orc() is not implemented.
> ../../connect/readwriter.py:228: NotImplementedError
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41820) DataFrame.createOrReplaceGlobalTempView - SparkConnectException: requirement failed

2023-01-06 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41820:
--
Description: 
{code:java}
>>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], schema=["age", 
>>> "name"])
>>> df.createOrReplaceGlobalTempView("people") {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1292, in 
pyspark.sql.connect.dataframe.DataFrame.createOrReplaceGlobalTempView
Failed example:
    df2.createOrReplaceGlobalTempView("people")
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", 
line 1, in 
        df2.createOrReplaceGlobalTempView("people")
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1192, in createOrReplaceGlobalTempView
        self._session.client.execute_command(command)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
459, in execute_command
        self._execute(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
547, in _execute
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
625, in _handle_error
        raise SparkConnectException(status.message) from None
    pyspark.sql.connect.client.SparkConnectException: requirement failed 

{code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1292, in 
pyspark.sql.connect.dataframe.DataFrame.createOrReplaceGlobalTempView
Failed example:
    df2.createOrReplaceGlobalTempView("people")
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", 
line 1, in 
        df2.createOrReplaceGlobalTempView("people")
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1192, in createOrReplaceGlobalTempView
        self._session.client.execute_command(command)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
459, in execute_command
        self._execute(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
547, in _execute
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
625, in _handle_error
        raise SparkConnectException(status.message) from None
    pyspark.sql.connect.client.SparkConnectException: requirement failed {code}


> DataFrame.createOrReplaceGlobalTempView - SparkConnectException: requirement 
> failed
> ---
>
> Key: SPARK-41820
> URL: https://issues.apache.org/jira/browse/SPARK-41820
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> >>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], schema=["age", 
> >>> "name"])
> >>> df.createOrReplaceGlobalTempView("people") {code}
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1292, in 
> pyspark.sql.connect.dataframe.DataFrame.createOrReplaceGlobalTempView
> Failed example:
>     df2.createOrReplaceGlobalTempView("people")
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File " pyspark.sql.connect.dataframe.DataFrame.createOrReplaceGlobalTempView[3]>", 
> line 1, in 
>         df2.createOrReplaceGlobalTempView("people")
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1192, in createOrReplaceGlobalTempView
>         self._session.client.execute_command(command)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 459, in execute_command
>         self._execute(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 547, in _execute
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 625, in _handle_error
>         raise SparkConnectException(statu

[jira] [Created] (SPARK-41922) Implement DataFrame `semanticHash`

2023-01-06 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41922:
-

 Summary: Implement DataFrame `semanticHash`
 Key: SPARK-41922
 URL: https://issues.apache.org/jira/browse/SPARK-41922
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41874) Implement DataFrame `sameSemantics`

2023-01-06 Thread Sandeep Singh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655329#comment-17655329
 ] 

Sandeep Singh commented on SPARK-41874:
---

Working on this

> Implement DataFrame `sameSemantics`
> ---
>
> Key: SPARK-41874
> URL: https://issues.apache.org/jira/browse/SPARK-41874
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41824) Implement DataFrame.explain format to be similar to PySpark

2023-01-06 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41824:
--
Description: 
{code:java}
df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", 
"name"]) 
df.explain()
df.explain(True)
df.explain(mode="formatted")
df.explain("cost"){code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1296, in pyspark.sql.connect.dataframe.DataFrame.explain
Failed example:
    df.explain()
Expected:
    == Physical Plan ==
    *(1) Scan ExistingRDD[age...,name...]
Got:
    == Physical Plan ==
    LocalTableScan [age#1148L, name#1149]
    
    
**
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1314, in pyspark.sql.connect.dataframe.DataFrame.explain
Failed example:
    df.explain(mode="formatted")
Expected:
    == Physical Plan ==
    * Scan ExistingRDD (...)
    (1) Scan ExistingRDD [codegen id : ...]
    Output [2]: [age..., name...]
    ...
Got:
    == Physical Plan ==
    LocalTableScan (1)
    
    
    (1) LocalTableScan
    Output [2]: [age#1170L, name#1171]
    Arguments: [age#1170L, name#1171]
    
    {code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1296, in pyspark.sql.connect.dataframe.DataFrame.explain
Failed example:
    df.explain()
Expected:
    == Physical Plan ==
    *(1) Scan ExistingRDD[age...,name...]
Got:
    == Physical Plan ==
    LocalTableScan [age#1148L, name#1149]
    
    
**
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1314, in pyspark.sql.connect.dataframe.DataFrame.explain
Failed example:
    df.explain(mode="formatted")
Expected:
    == Physical Plan ==
    * Scan ExistingRDD (...)
    (1) Scan ExistingRDD [codegen id : ...]
    Output [2]: [age..., name...]
    ...
Got:
    == Physical Plan ==
    LocalTableScan (1)
    
    
    (1) LocalTableScan
    Output [2]: [age#1170L, name#1171]
    Arguments: [age#1170L, name#1171]
    
    {code}


> Implement DataFrame.explain format to be similar to PySpark
> ---
>
> Key: SPARK-41824
> URL: https://issues.apache.org/jira/browse/SPARK-41824
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", 
> "name"]) 
> df.explain()
> df.explain(True)
> df.explain(mode="formatted")
> df.explain("cost"){code}
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1296, in pyspark.sql.connect.dataframe.DataFrame.explain
> Failed example:
>     df.explain()
> Expected:
>     == Physical Plan ==
>     *(1) Scan ExistingRDD[age...,name...]
> Got:
>     == Physical Plan ==
>     LocalTableScan [age#1148L, name#1149]
>     
>     
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1314, in pyspark.sql.connect.dataframe.DataFrame.explain
> Failed example:
>     df.explain(mode="formatted")
> Expected:
>     == Physical Plan ==
>     * Scan ExistingRDD (...)
>     (1) Scan ExistingRDD [codegen id : ...]
>     Output [2]: [age..., name...]
>     ...
> Got:
>     == Physical Plan ==
>     LocalTableScan (1)
>     
>     
>     (1) LocalTableScan
>     Output [2]: [age#1170L, name#1171]
>     Arguments: [age#1170L, name#1171]
>     
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41824) Implement DataFrame.explain format to be similar to PySpark

2023-01-06 Thread Sandeep Singh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655293#comment-17655293
 ] 

Sandeep Singh commented on SPARK-41824:
---

this is from the doctests 

`./python/run-tests --testnames 'pyspark.sql.connect.dataframe'`

> Implement DataFrame.explain format to be similar to PySpark
> ---
>
> Key: SPARK-41824
> URL: https://issues.apache.org/jira/browse/SPARK-41824
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1296, in pyspark.sql.connect.dataframe.DataFrame.explain
> Failed example:
>     df.explain()
> Expected:
>     == Physical Plan ==
>     *(1) Scan ExistingRDD[age...,name...]
> Got:
>     == Physical Plan ==
>     LocalTableScan [age#1148L, name#1149]
>     
>     
> **
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1314, in pyspark.sql.connect.dataframe.DataFrame.explain
> Failed example:
>     df.explain(mode="formatted")
> Expected:
>     == Physical Plan ==
>     * Scan ExistingRDD (...)
>     (1) Scan ExistingRDD [codegen id : ...]
>     Output [2]: [age..., name...]
>     ...
> Got:
>     == Physical Plan ==
>     LocalTableScan (1)
>     
>     
>     (1) LocalTableScan
>     Output [2]: [age#1170L, name#1171]
>     Arguments: [age#1170L, name#1171]
>     
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-41818) Support DataFrameWriter.saveAsTable

2023-01-05 Thread Sandeep Singh (Jira)



[ https://issues.apache.org/jira/browse/SPARK-41818 ]


Sandeep Singh deleted comment on SPARK-41818:
---

was (Author: techaddict):
Could be moved under https://issues.apache.org/jira/browse/SPARK-41279 

> Support DataFrameWriter.saveAsTable
> ---
>
> Key: SPARK-41818
> URL: https://issues.apache.org/jira/browse/SPARK-41818
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", 
> line 369, in pyspark.sql.connect.readwriter.DataFrameWriter.insertInto
> Failed example:
>     df.write.saveAsTable("tblA")
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File " pyspark.sql.connect.readwriter.DataFrameWriter.insertInto[2]>", line 1, in 
> 
>         df.write.saveAsTable("tblA")
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", 
> line 350, in saveAsTable
>         
> self._spark.client.execute_command(self._write.command(self._spark.client))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 459, in execute_command
>         self._execute(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 547, in _execute
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 623, in _handle_error
>         raise SparkConnectException(status.message, info.reason) from None
>     pyspark.sql.connect.client.SparkConnectException: 
> (java.lang.ClassNotFoundException) .DefaultSource{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41921) Enable doctests in connect.column and connect.functions

2023-01-05 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41921:
-

 Summary: Enable doctests in connect.column and connect.functions
 Key: SPARK-41921
 URL: https://issues.apache.org/jira/browse/SPARK-41921
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh
Assignee: Sandeep Singh
 Fix For: 3.4.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41907) Function `sampleby` return parity

2023-01-05 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41907:
-

 Summary: Function `sampleby` return parity
 Key: SPARK-41907
 URL: https://issues.apache.org/jira/browse/SPARK-41907
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
df = self.df
from pyspark.sql import functions

rnd = df.select("key", functions.rand()).collect()
for row in rnd:
assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1]
rndn = df.select("key", functions.randn(5)).collect()
for row in rndn:
assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1]

# If the specified seed is 0, we should use it.
# https://issues.apache.org/jira/browse/SPARK-9691
rnd1 = df.select("key", functions.rand(0)).collect()
rnd2 = df.select("key", functions.rand(0)).collect()
self.assertEqual(sorted(rnd1), sorted(rnd2))

rndn1 = df.select("key", functions.randn(0)).collect()
rndn2 = df.select("key", functions.randn(0)).collect()
self.assertEqual(sorted(rndn1), sorted(rndn2)){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 299, in test_rand_functions
rnd = df.select("key", functions.rand()).collect()
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
line 2917, in select
jdf = self._jdf.select(self._jcols(*cols))
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
line 2537, in _jcols
return self._jseq(cols, _to_java_column)
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
line 2524, in _jseq
return _to_seq(self.sparkSession._sc, cols, converter)
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
86, in _to_seq
cols = [converter(c) for c in cols]
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
86, in 
cols = [converter(c) for c in cols]
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
65, in _to_java_column
raise TypeError(
TypeError: Invalid argument, not a string or column: Column<'rand()'> of type 
. For column literals, use 'lit', 
'array', 'struct' or 'create_map' function.
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41907) Function `sampleby` return parity

2023-01-05 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41907:
--
Description: 
{code:java}
df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)])
sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0)
self.assertTrue(sampled.count() == 35){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 202, in test_sampleby
self.assertTrue(sampled.count() == 35)
AssertionError: False is not true {code}

  was:
{code:java}
df = self.df
from pyspark.sql import functions

rnd = df.select("key", functions.rand()).collect()
for row in rnd:
assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1]
rndn = df.select("key", functions.randn(5)).collect()
for row in rndn:
assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1]

# If the specified seed is 0, we should use it.
# https://issues.apache.org/jira/browse/SPARK-9691
rnd1 = df.select("key", functions.rand(0)).collect()
rnd2 = df.select("key", functions.rand(0)).collect()
self.assertEqual(sorted(rnd1), sorted(rnd2))

rndn1 = df.select("key", functions.randn(0)).collect()
rndn2 = df.select("key", functions.randn(0)).collect()
self.assertEqual(sorted(rndn1), sorted(rndn2)){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 299, in test_rand_functions
rnd = df.select("key", functions.rand()).collect()
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
line 2917, in select
jdf = self._jdf.select(self._jcols(*cols))
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
line 2537, in _jcols
return self._jseq(cols, _to_java_column)
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
line 2524, in _jseq
return _to_seq(self.sparkSession._sc, cols, converter)
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
86, in _to_seq
cols = [converter(c) for c in cols]
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
86, in 
cols = [converter(c) for c in cols]
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
65, in _to_java_column
raise TypeError(
TypeError: Invalid argument, not a string or column: Column<'rand()'> of type 
. For column literals, use 'lit', 
'array', 'struct' or 'create_map' function.
{code}


> Function `sampleby` return parity
> -
>
> Key: SPARK-41907
> URL: https://issues.apache.org/jira/browse/SPARK-41907
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)])
> sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0)
> self.assertTrue(sampled.count() == 35){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 202, in test_sampleby
> self.assertTrue(sampled.count() == 35)
> AssertionError: False is not true {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41906) Handle Function `rand() `

2023-01-05 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41906:
--
Description: 
{code:java}
df = self.df
from pyspark.sql import functions

rnd = df.select("key", functions.rand()).collect()
for row in rnd:
assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1]
rndn = df.select("key", functions.randn(5)).collect()
for row in rndn:
assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1]

# If the specified seed is 0, we should use it.
# https://issues.apache.org/jira/browse/SPARK-9691
rnd1 = df.select("key", functions.rand(0)).collect()
rnd2 = df.select("key", functions.rand(0)).collect()
self.assertEqual(sorted(rnd1), sorted(rnd2))

rndn1 = df.select("key", functions.randn(0)).collect()
rndn2 = df.select("key", functions.randn(0)).collect()
self.assertEqual(sorted(rndn1), sorted(rndn2)){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 299, in test_rand_functions
rnd = df.select("key", functions.rand()).collect()
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
line 2917, in select
jdf = self._jdf.select(self._jcols(*cols))
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
line 2537, in _jcols
return self._jseq(cols, _to_java_column)
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/dataframe.py", 
line 2524, in _jseq
return _to_seq(self.sparkSession._sc, cols, converter)
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
86, in _to_seq
cols = [converter(c) for c in cols]
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
86, in 
cols = [converter(c) for c in cols]
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/column.py", line 
65, in _to_java_column
raise TypeError(
TypeError: Invalid argument, not a string or column: Column<'rand()'> of type 
. For column literals, use 'lit', 
'array', 'struct' or 'create_map' function.
{code}

  was:
{code:java}
df = self.spark.createDataFrame(
[
(
[1, 2, 3],
2,
2,
),
(
[4, 5],
2,
2,
),
],
["x", "index", "len"],
)

expected = [Row(sliced=[2, 3]), Row(sliced=[5])]
self.assertTrue(
all(
[
df.select(slice(df.x, 2, 2).alias("sliced")).collect() == expected,
df.select(slice(df.x, lit(2), lit(2)).alias("sliced")).collect() == 
expected,
df.select(slice("x", "index", "len").alias("sliced")).collect() == 
expected,
]
)
)

self.assertEqual(
df.select(slice(df.x, size(df.x) - 1, lit(1)).alias("sliced")).collect(),
[Row(sliced=[2]), Row(sliced=[4])],
)
self.assertEqual(
df.select(slice(df.x, lit(1), size(df.x) - 1).alias("sliced")).collect(),
[Row(sliced=[1, 2]), Row(sliced=[4])],
){code}
{code:java}
 Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 596, in test_slice
df.select(slice("x", "index", "len").alias("sliced")).collect() == expected,
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line 
332, in wrapped
return getattr(functions, f.__name__)(*args, **kwargs)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1525, in slice
raise TypeError(f"start should be a Column or int, but got 
{type(start).__name__}")
TypeError: start should be a Column or int, but got str{code}


> Handle Function `rand() `
> -
>
> Key: SPARK-41906
> URL: https://issues.apache.org/jira/browse/SPARK-41906
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.df
> from pyspark.sql import functions
> rnd = df.select("key", functions.rand()).collect()
> for row in rnd:
> assert row[1] >= 0.0 and row[1] <= 1.0, "got: %s" % row[1]
> rndn = df.select("key", functions.randn(5)).collect()
> for row in rndn:
> assert row[1] >= -4.0 and row[1] <= 4.0, "got: %s" % row[1]
> # If the specified seed is 0, we should use it.
> # https://issues.apache.org/jira/browse/SPARK-9691
> rnd1 = df.select("key", functions.rand(0)).collect()
> rnd2 = df.select("key", functions.rand(0)).collect()
> self.assertEqual(sorted(rnd1), sorted(rnd2))
> rndn1 = df.select("key", functions.randn(0)).collect()
> rndn2 = df.select("key", functions.randn(0)).collect()
> self.assertEqual(sorted(rndn1), sorted(rndn2)){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 29

[jira] [Created] (SPARK-41906) Handle Function `rand() `

2023-01-05 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41906:
-

 Summary: Handle Function `rand() `
 Key: SPARK-41906
 URL: https://issues.apache.org/jira/browse/SPARK-41906
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
df = self.spark.createDataFrame(
[
(
[1, 2, 3],
2,
2,
),
(
[4, 5],
2,
2,
),
],
["x", "index", "len"],
)

expected = [Row(sliced=[2, 3]), Row(sliced=[5])]
self.assertTrue(
all(
[
df.select(slice(df.x, 2, 2).alias("sliced")).collect() == expected,
df.select(slice(df.x, lit(2), lit(2)).alias("sliced")).collect() == 
expected,
df.select(slice("x", "index", "len").alias("sliced")).collect() == 
expected,
]
)
)

self.assertEqual(
df.select(slice(df.x, size(df.x) - 1, lit(1)).alias("sliced")).collect(),
[Row(sliced=[2]), Row(sliced=[4])],
)
self.assertEqual(
df.select(slice(df.x, lit(1), size(df.x) - 1).alias("sliced")).collect(),
[Row(sliced=[1, 2]), Row(sliced=[4])],
){code}
{code:java}
 Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 596, in test_slice
df.select(slice("x", "index", "len").alias("sliced")).collect() == expected,
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line 
332, in wrapped
return getattr(functions, f.__name__)(*args, **kwargs)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1525, in slice
raise TypeError(f"start should be a Column or int, but got 
{type(start).__name__}")
TypeError: start should be a Column or int, but got str{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41905) Function `slice` should handle string in params

2023-01-05 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41905:
--
Summary: Function `slice` should handle string in params  (was: Function 
`slice` should expect string in params)

> Function `slice` should handle string in params
> ---
>
> Key: SPARK-41905
> URL: https://issues.apache.org/jira/browse/SPARK-41905
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> from pyspark.sql import Window
> from pyspark.sql.functions import nth_value
> df = self.spark.createDataFrame(
> [
> ("a", 0, None),
> ("a", 1, "x"),
> ("a", 2, "y"),
> ("a", 3, "z"),
> ("a", 4, None),
> ("b", 1, None),
> ("b", 2, None),
> ],
> schema=("key", "order", "value"),
> )
> w = Window.partitionBy("key").orderBy("order")
> rs = df.select(
> df.key,
> df.order,
> nth_value("value", 2).over(w),
> nth_value("value", 2, False).over(w),
> nth_value("value", 2, True).over(w),
> ).collect()
> expected = [
> ("a", 0, None, None, None),
> ("a", 1, "x", "x", None),
> ("a", 2, "x", "x", "y"),
> ("a", 3, "x", "x", "y"),
> ("a", 4, "x", "x", "y"),
> ("b", 1, None, None, None),
> ("b", 2, None, None, None),
> ]
> for r, ex in zip(sorted(rs), sorted(expected)):
> self.assertEqual(tuple(r), ex[: len(r)]){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 755, in test_nth_value
> self.assertEqual(tuple(r), ex[: len(r)])
> AssertionError: Tuples differ: ('a', 1, 'x', None) != ('a', 1, 'x', 'x')
> First differing element 3:
> None
> 'x'
> - ('a', 1, 'x', None)
> ?   
> + ('a', 1, 'x', 'x')
> ?   ^^^
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41905) Function `slice` should handle string in params

2023-01-05 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41905:
--
Description: 
{code:java}
df = self.spark.createDataFrame(
[
(
[1, 2, 3],
2,
2,
),
(
[4, 5],
2,
2,
),
],
["x", "index", "len"],
)

expected = [Row(sliced=[2, 3]), Row(sliced=[5])]
self.assertTrue(
all(
[
df.select(slice(df.x, 2, 2).alias("sliced")).collect() == expected,
df.select(slice(df.x, lit(2), lit(2)).alias("sliced")).collect() == 
expected,
df.select(slice("x", "index", "len").alias("sliced")).collect() == 
expected,
]
)
)

self.assertEqual(
df.select(slice(df.x, size(df.x) - 1, lit(1)).alias("sliced")).collect(),
[Row(sliced=[2]), Row(sliced=[4])],
)
self.assertEqual(
df.select(slice(df.x, lit(1), size(df.x) - 1).alias("sliced")).collect(),
[Row(sliced=[1, 2]), Row(sliced=[4])],
){code}
{code:java}
 Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 596, in test_slice
df.select(slice("x", "index", "len").alias("sliced")).collect() == expected,
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line 
332, in wrapped
return getattr(functions, f.__name__)(*args, **kwargs)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1525, in slice
raise TypeError(f"start should be a Column or int, but got 
{type(start).__name__}")
TypeError: start should be a Column or int, but got str{code}

  was:
{code:java}
from pyspark.sql import Window
from pyspark.sql.functions import nth_value

df = self.spark.createDataFrame(
[
("a", 0, None),
("a", 1, "x"),
("a", 2, "y"),
("a", 3, "z"),
("a", 4, None),
("b", 1, None),
("b", 2, None),
],
schema=("key", "order", "value"),
)
w = Window.partitionBy("key").orderBy("order")

rs = df.select(
df.key,
df.order,
nth_value("value", 2).over(w),
nth_value("value", 2, False).over(w),
nth_value("value", 2, True).over(w),
).collect()

expected = [
("a", 0, None, None, None),
("a", 1, "x", "x", None),
("a", 2, "x", "x", "y"),
("a", 3, "x", "x", "y"),
("a", 4, "x", "x", "y"),
("b", 1, None, None, None),
("b", 2, None, None, None),
]

for r, ex in zip(sorted(rs), sorted(expected)):
self.assertEqual(tuple(r), ex[: len(r)]){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 755, in test_nth_value
self.assertEqual(tuple(r), ex[: len(r)])
AssertionError: Tuples differ: ('a', 1, 'x', None) != ('a', 1, 'x', 'x')

First differing element 3:
None
'x'

- ('a', 1, 'x', None)
?   

+ ('a', 1, 'x', 'x')
?   ^^^
 {code}


> Function `slice` should handle string in params
> ---
>
> Key: SPARK-41905
> URL: https://issues.apache.org/jira/browse/SPARK-41905
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame(
> [
> (
> [1, 2, 3],
> 2,
> 2,
> ),
> (
> [4, 5],
> 2,
> 2,
> ),
> ],
> ["x", "index", "len"],
> )
> expected = [Row(sliced=[2, 3]), Row(sliced=[5])]
> self.assertTrue(
> all(
> [
> df.select(slice(df.x, 2, 2).alias("sliced")).collect() == 
> expected,
> df.select(slice(df.x, lit(2), lit(2)).alias("sliced")).collect() 
> == expected,
> df.select(slice("x", "index", "len").alias("sliced")).collect() 
> == expected,
> ]
> )
> )
> self.assertEqual(
> df.select(slice(df.x, size(df.x) - 1, lit(1)).alias("sliced")).collect(),
> [Row(sliced=[2]), Row(sliced=[4])],
> )
> self.assertEqual(
> df.select(slice(df.x, lit(1), size(df.x) - 1).alias("sliced")).collect(),
> [Row(sliced=[1, 2]), Row(sliced=[4])],
> ){code}
> {code:java}
>  Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 596, in test_slice
> df.select(slice("x", "index", "len").alias("sliced")).collect() == 
> expected,
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line 
> 332, in wrapped
> return getattr(functions, f.__name__)(*args, **kwargs)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1525, in slice
> raise TypeError(f"start should be a Column or

[jira] [Created] (SPARK-41905) Function `slice` should expect string in params

2023-01-05 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41905:
-

 Summary: Function `slice` should expect string in params
 Key: SPARK-41905
 URL: https://issues.apache.org/jira/browse/SPARK-41905
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
from pyspark.sql import Window
from pyspark.sql.functions import nth_value

df = self.spark.createDataFrame(
[
("a", 0, None),
("a", 1, "x"),
("a", 2, "y"),
("a", 3, "z"),
("a", 4, None),
("b", 1, None),
("b", 2, None),
],
schema=("key", "order", "value"),
)
w = Window.partitionBy("key").orderBy("order")

rs = df.select(
df.key,
df.order,
nth_value("value", 2).over(w),
nth_value("value", 2, False).over(w),
nth_value("value", 2, True).over(w),
).collect()

expected = [
("a", 0, None, None, None),
("a", 1, "x", "x", None),
("a", 2, "x", "x", "y"),
("a", 3, "x", "x", "y"),
("a", 4, "x", "x", "y"),
("b", 1, None, None, None),
("b", 2, None, None, None),
]

for r, ex in zip(sorted(rs), sorted(expected)):
self.assertEqual(tuple(r), ex[: len(r)]){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 755, in test_nth_value
self.assertEqual(tuple(r), ex[: len(r)])
AssertionError: Tuples differ: ('a', 1, 'x', None) != ('a', 1, 'x', 'x')

First differing element 3:
None
'x'

- ('a', 1, 'x', None)
?   

+ ('a', 1, 'x', 'x')
?   ^^^
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41904) Fix Function `nth_value` functions output

2023-01-05 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41904:
--
Summary: Fix Function `nth_value` functions output  (was: Fix `nth_value` 
functions output)

> Fix Function `nth_value` functions output
> -
>
> Key: SPARK-41904
> URL: https://issues.apache.org/jira/browse/SPARK-41904
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> from pyspark.sql import Window
> from pyspark.sql.functions import nth_value
> df = self.spark.createDataFrame(
> [
> ("a", 0, None),
> ("a", 1, "x"),
> ("a", 2, "y"),
> ("a", 3, "z"),
> ("a", 4, None),
> ("b", 1, None),
> ("b", 2, None),
> ],
> schema=("key", "order", "value"),
> )
> w = Window.partitionBy("key").orderBy("order")
> rs = df.select(
> df.key,
> df.order,
> nth_value("value", 2).over(w),
> nth_value("value", 2, False).over(w),
> nth_value("value", 2, True).over(w),
> ).collect()
> expected = [
> ("a", 0, None, None, None),
> ("a", 1, "x", "x", None),
> ("a", 2, "x", "x", "y"),
> ("a", 3, "x", "x", "y"),
> ("a", 4, "x", "x", "y"),
> ("b", 1, None, None, None),
> ("b", 2, None, None, None),
> ]
> for r, ex in zip(sorted(rs), sorted(expected)):
> self.assertEqual(tuple(r), ex[: len(r)]){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 755, in test_nth_value
> self.assertEqual(tuple(r), ex[: len(r)])
> AssertionError: Tuples differ: ('a', 1, 'x', None) != ('a', 1, 'x', 'x')
> First differing element 3:
> None
> 'x'
> - ('a', 1, 'x', None)
> ?   
> + ('a', 1, 'x', 'x')
> ?   ^^^
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41904) Fix `nth_value` functions output

2023-01-05 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41904:
--
Description: 
{code:java}
from pyspark.sql import Window
from pyspark.sql.functions import nth_value

df = self.spark.createDataFrame(
[
("a", 0, None),
("a", 1, "x"),
("a", 2, "y"),
("a", 3, "z"),
("a", 4, None),
("b", 1, None),
("b", 2, None),
],
schema=("key", "order", "value"),
)
w = Window.partitionBy("key").orderBy("order")

rs = df.select(
df.key,
df.order,
nth_value("value", 2).over(w),
nth_value("value", 2, False).over(w),
nth_value("value", 2, True).over(w),
).collect()

expected = [
("a", 0, None, None, None),
("a", 1, "x", "x", None),
("a", 2, "x", "x", "y"),
("a", 3, "x", "x", "y"),
("a", 4, "x", "x", "y"),
("b", 1, None, None, None),
("b", 2, None, None, None),
]

for r, ex in zip(sorted(rs), sorted(expected)):
self.assertEqual(tuple(r), ex[: len(r)]){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 755, in test_nth_value
self.assertEqual(tuple(r), ex[: len(r)])
AssertionError: Tuples differ: ('a', 1, 'x', None) != ('a', 1, 'x', 'x')

First differing element 3:
None
'x'

- ('a', 1, 'x', None)
?   

+ ('a', 1, 'x', 'x')
?   ^^^
 {code}

  was:
{code:java}
from pyspark.sql.functions import flatten, struct, transform

df = self.spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as 
letters")

actual = df.select(
flatten(
transform(
"numbers",
lambda number: transform(
"letters", lambda letter: struct(number.alias("n"), 
letter.alias("l"))
),
)
)
).first()[0]

expected = [
(1, "a"),
(1, "b"),
(1, "c"),
(2, "a"),
(2, "b"),
(2, "c"),
(3, "a"),
(3, "b"),
(3, "c"),
]

self.assertEquals(actual, expected){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 809, in test_nested_higher_order_function
self.assertEquals(actual, expected)
AssertionError: Lists differ: [{'n': 'a', 'l': 'a'}, {'n': 'b', 'l': 'b'[151 
chars]'c'}] != [(1, 'a'), (1, 'b'), (1, 'c'), (2, 'a'), ([43 chars]'c')]

First differing element 0:
{'n': 'a', 'l': 'a'}
(1, 'a')

- [{'l': 'a', 'n': 'a'},
-  {'l': 'b', 'n': 'b'},
-  {'l': 'c', 'n': 'c'},
-  {'l': 'a', 'n': 'a'},
-  {'l': 'b', 'n': 'b'},
-  {'l': 'c', 'n': 'c'},
-  {'l': 'a', 'n': 'a'},
-  {'l': 'b', 'n': 'b'},
-  {'l': 'c', 'n': 'c'}]
+ [(1, 'a'),
+  (1, 'b'),
+  (1, 'c'),
+  (2, 'a'),
+  (2, 'b'),
+  (2, 'c'),
+  (3, 'a'),
+  (3, 'b'),
+  (3, 'c')]
{code}


> Fix `nth_value` functions output
> 
>
> Key: SPARK-41904
> URL: https://issues.apache.org/jira/browse/SPARK-41904
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> from pyspark.sql import Window
> from pyspark.sql.functions import nth_value
> df = self.spark.createDataFrame(
> [
> ("a", 0, None),
> ("a", 1, "x"),
> ("a", 2, "y"),
> ("a", 3, "z"),
> ("a", 4, None),
> ("b", 1, None),
> ("b", 2, None),
> ],
> schema=("key", "order", "value"),
> )
> w = Window.partitionBy("key").orderBy("order")
> rs = df.select(
> df.key,
> df.order,
> nth_value("value", 2).over(w),
> nth_value("value", 2, False).over(w),
> nth_value("value", 2, True).over(w),
> ).collect()
> expected = [
> ("a", 0, None, None, None),
> ("a", 1, "x", "x", None),
> ("a", 2, "x", "x", "y"),
> ("a", 3, "x", "x", "y"),
> ("a", 4, "x", "x", "y"),
> ("b", 1, None, None, None),
> ("b", 2, None, None, None),
> ]
> for r, ex in zip(sorted(rs), sorted(expected)):
> self.assertEqual(tuple(r), ex[: len(r)]){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 755, in test_nth_value
> self.assertEqual(tuple(r), ex[: len(r)])
> AssertionError: Tuples differ: ('a', 1, 'x', None) != ('a', 1, 'x', 'x')
> First differing element 3:
> None
> 'x'
> - ('a', 1, 'x', None)
> ?   
> + ('a', 1, 'x', 'x')
> ?   ^^^
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41904) Fix `nth_value` functions output

2023-01-05 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41904:
-

 Summary: Fix `nth_value` functions output
 Key: SPARK-41904
 URL: https://issues.apache.org/jira/browse/SPARK-41904
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
from pyspark.sql.functions import flatten, struct, transform

df = self.spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as 
letters")

actual = df.select(
flatten(
transform(
"numbers",
lambda number: transform(
"letters", lambda letter: struct(number.alias("n"), 
letter.alias("l"))
),
)
)
).first()[0]

expected = [
(1, "a"),
(1, "b"),
(1, "c"),
(2, "a"),
(2, "b"),
(2, "c"),
(3, "a"),
(3, "b"),
(3, "c"),
]

self.assertEquals(actual, expected){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 809, in test_nested_higher_order_function
self.assertEquals(actual, expected)
AssertionError: Lists differ: [{'n': 'a', 'l': 'a'}, {'n': 'b', 'l': 'b'[151 
chars]'c'}] != [(1, 'a'), (1, 'b'), (1, 'c'), (2, 'a'), ([43 chars]'c')]

First differing element 0:
{'n': 'a', 'l': 'a'}
(1, 'a')

- [{'l': 'a', 'n': 'a'},
-  {'l': 'b', 'n': 'b'},
-  {'l': 'c', 'n': 'c'},
-  {'l': 'a', 'n': 'a'},
-  {'l': 'b', 'n': 'b'},
-  {'l': 'c', 'n': 'c'},
-  {'l': 'a', 'n': 'a'},
-  {'l': 'b', 'n': 'b'},
-  {'l': 'c', 'n': 'c'}]
+ [(1, 'a'),
+  (1, 'b'),
+  (1, 'c'),
+  (2, 'a'),
+  (2, 'b'),
+  (2, 'c'),
+  (3, 'a'),
+  (3, 'b'),
+  (3, 'c')]
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41902) Parity in String representation of higher_order_function's output

2023-01-05 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41902:
--
Summary: Parity in String representation of higher_order_function's output  
(was: Parity in String representation of higher_order_function)

> Parity in String representation of higher_order_function's output
> -
>
> Key: SPARK-41902
> URL: https://issues.apache.org/jira/browse/SPARK-41902
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> from pyspark.sql.functions import flatten, struct, transform
> df = self.spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') 
> as letters")
> actual = df.select(
> flatten(
> transform(
> "numbers",
> lambda number: transform(
> "letters", lambda letter: struct(number.alias("n"), 
> letter.alias("l"))
> ),
> )
> )
> ).first()[0]
> expected = [
> (1, "a"),
> (1, "b"),
> (1, "c"),
> (2, "a"),
> (2, "b"),
> (2, "c"),
> (3, "a"),
> (3, "b"),
> (3, "c"),
> ]
> self.assertEquals(actual, expected){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 809, in test_nested_higher_order_function
> self.assertEquals(actual, expected)
> AssertionError: Lists differ: [{'n': 'a', 'l': 'a'}, {'n': 'b', 'l': 'b'[151 
> chars]'c'}] != [(1, 'a'), (1, 'b'), (1, 'c'), (2, 'a'), ([43 chars]'c')]
> First differing element 0:
> {'n': 'a', 'l': 'a'}
> (1, 'a')
> - [{'l': 'a', 'n': 'a'},
> -  {'l': 'b', 'n': 'b'},
> -  {'l': 'c', 'n': 'c'},
> -  {'l': 'a', 'n': 'a'},
> -  {'l': 'b', 'n': 'b'},
> -  {'l': 'c', 'n': 'c'},
> -  {'l': 'a', 'n': 'a'},
> -  {'l': 'b', 'n': 'b'},
> -  {'l': 'c', 'n': 'c'}]
> + [(1, 'a'),
> +  (1, 'b'),
> +  (1, 'c'),
> +  (2, 'a'),
> +  (2, 'b'),
> +  (2, 'c'),
> +  (3, 'a'),
> +  (3, 'b'),
> +  (3, 'c')]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41902) Parity in String representation of higher_order_function

2023-01-05 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41902:
--
Description: 
{code:java}
from pyspark.sql.functions import flatten, struct, transform

df = self.spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as 
letters")

actual = df.select(
flatten(
transform(
"numbers",
lambda number: transform(
"letters", lambda letter: struct(number.alias("n"), 
letter.alias("l"))
),
)
)
).first()[0]

expected = [
(1, "a"),
(1, "b"),
(1, "c"),
(2, "a"),
(2, "b"),
(2, "c"),
(3, "a"),
(3, "b"),
(3, "c"),
]

self.assertEquals(actual, expected){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 809, in test_nested_higher_order_function
self.assertEquals(actual, expected)
AssertionError: Lists differ: [{'n': 'a', 'l': 'a'}, {'n': 'b', 'l': 'b'[151 
chars]'c'}] != [(1, 'a'), (1, 'b'), (1, 'c'), (2, 'a'), ([43 chars]'c')]

First differing element 0:
{'n': 'a', 'l': 'a'}
(1, 'a')

- [{'l': 'a', 'n': 'a'},
-  {'l': 'b', 'n': 'b'},
-  {'l': 'c', 'n': 'c'},
-  {'l': 'a', 'n': 'a'},
-  {'l': 'b', 'n': 'b'},
-  {'l': 'c', 'n': 'c'},
-  {'l': 'a', 'n': 'a'},
-  {'l': 'b', 'n': 'b'},
-  {'l': 'c', 'n': 'c'}]
+ [(1, 'a'),
+  (1, 'b'),
+  (1, 'c'),
+  (2, 'a'),
+  (2, 'b'),
+  (2, 'c'),
+  (3, 'a'),
+  (3, 'b'),
+  (3, 'c')]
{code}

  was:
{code:java}
expected = {"a": 1, "b": 2}
expected2 = {"c": 3, "d": 4}
df = self.spark.createDataFrame(
[(list(expected.keys()), list(expected.values()))], ["k", "v"]
)
actual = (
df.select(
expr("map('c', 3, 'd', 4) as dict2"),
map_from_arrays(df.k, df.v).alias("dict"),
"*",
)
.select(
map_contains_key("dict", "a").alias("one"),
map_contains_key("dict", "d").alias("not_exists"),
map_keys("dict").alias("keys"),
map_values("dict").alias("values"),
map_entries("dict").alias("items"),
"*",
)
.select(
map_concat("dict", "dict2").alias("merged"),
map_from_entries(arrays_zip("keys", "values")).alias("from_items"),
"*",
)
.first()
)
self.assertEqual(expected, actual["dict"]){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 1142, in test_map_functions
self.assertEqual(expected, actual["dict"])
AssertionError: {'a': 1, 'b': 2} != [('a', 1), ('b', 2)]{code}

Summary: Parity in String representation of higher_order_function  
(was: Fix String representation of maps created by `map_from_arrays`)

> Parity in String representation of higher_order_function
> 
>
> Key: SPARK-41902
> URL: https://issues.apache.org/jira/browse/SPARK-41902
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> from pyspark.sql.functions import flatten, struct, transform
> df = self.spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') 
> as letters")
> actual = df.select(
> flatten(
> transform(
> "numbers",
> lambda number: transform(
> "letters", lambda letter: struct(number.alias("n"), 
> letter.alias("l"))
> ),
> )
> )
> ).first()[0]
> expected = [
> (1, "a"),
> (1, "b"),
> (1, "c"),
> (2, "a"),
> (2, "b"),
> (2, "c"),
> (3, "a"),
> (3, "b"),
> (3, "c"),
> ]
> self.assertEquals(actual, expected){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 809, in test_nested_higher_order_function
> self.assertEquals(actual, expected)
> AssertionError: Lists differ: [{'n': 'a', 'l': 'a'}, {'n': 'b', 'l': 'b'[151 
> chars]'c'}] != [(1, 'a'), (1, 'b'), (1, 'c'), (2, 'a'), ([43 chars]'c')]
> First differing element 0:
> {'n': 'a', 'l': 'a'}
> (1, 'a')
> - [{'l': 'a', 'n': 'a'},
> -  {'l': 'b', 'n': 'b'},
> -  {'l': 'c', 'n': 'c'},
> -  {'l': 'a', 'n': 'a'},
> -  {'l': 'b', 'n': 'b'},
> -  {'l': 'c', 'n': 'c'},
> -  {'l': 'a', 'n': 'a'},
> -  {'l': 'b', 'n': 'b'},
> -  {'l': 'c', 'n': 'c'}]
> + [(1, 'a'),
> +  (1, 'b'),
> +  (1, 'c'),
> +  (2, 'a'),
> +  (2, 'b'),
> +  (2, 'c'),
> +  (3, 'a'),
> +  (3, 'b'),
> +  (3, 'c')]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apac

[jira] [Updated] (SPARK-41903) Support data type ndarray

2023-01-05 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41903:
--
Description: 
{code:java}
import numpy as np

arr_dtype_to_spark_dtypes = [
("int8", [("b", "array")]),
("int16", [("b", "array")]),
("int32", [("b", "array")]),
("int64", [("b", "array")]),
("float32", [("b", "array")]),
("float64", [("b", "array")]),
]
for t, expected_spark_dtypes in arr_dtype_to_spark_dtypes:
arr = np.array([1, 2]).astype(t)
self.assertEqual(
expected_spark_dtypes, 
self.spark.range(1).select(lit(arr).alias("b")).dtypes
)
arr = np.array([1, 2]).astype(np.uint)
with self.assertRaisesRegex(
TypeError, "The type of array scalar '%s' is not supported" % arr.dtype
):
self.spark.range(1).select(lit(arr).alias("b")){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 1100, in test_ndarray_input
expected_spark_dtypes, 
self.spark.range(1).select(lit(arr).alias("b")).dtypes
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line 
332, in wrapped
return getattr(functions, f.__name__)(*args, **kwargs)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 198, in lit
return Column(LiteralExpression._from_value(col))
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/expressions.py", 
line 266, in _from_value
return LiteralExpression(value=value, 
dataType=LiteralExpression._infer_type(value))
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/expressions.py", 
line 262, in _infer_type
raise ValueError(f"Unsupported Data Type {type(value).__name__}")
ValueError: Unsupported Data Type ndarray {code}

  was:
{code:java}
import numpy as np
from pyspark.sql.functions import lit

dtype_to_spark_dtypes = [
(np.int8, [("CAST(1 AS TINYINT)", "tinyint")]),
(np.int16, [("CAST(1 AS SMALLINT)", "smallint")]),
(np.int32, [("CAST(1 AS INT)", "int")]),
(np.int64, [("CAST(1 AS BIGINT)", "bigint")]),
(np.float32, [("CAST(1.0 AS FLOAT)", "float")]),
(np.float64, [("CAST(1.0 AS DOUBLE)", "double")]),
(np.bool_, [("true", "boolean")]),
]
for dtype, spark_dtypes in dtype_to_spark_dtypes:
self.assertEqual(self.spark.range(1).select(lit(dtype(1))).dtypes, 
spark_dtypes){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 1064, in test_lit_np_scalar
self.assertEqual(self.spark.range(1).select(lit(dtype(1))).dtypes, 
spark_dtypes)
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line 
332, in wrapped
return getattr(functions, f.__name__)(*args, **kwargs)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 198, in lit
return Column(LiteralExpression._from_value(col))
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/expressions.py", 
line 266, in _from_value
return LiteralExpression(value=value, 
dataType=LiteralExpression._infer_type(value))
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/expressions.py", 
line 262, in _infer_type
raise ValueError(f"Unsupported Data Type {type(value).__name__}")
ValueError: Unsupported Data Type int8
{code}


> Support data type ndarray
> -
>
> Key: SPARK-41903
> URL: https://issues.apache.org/jira/browse/SPARK-41903
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> import numpy as np
> arr_dtype_to_spark_dtypes = [
> ("int8", [("b", "array")]),
> ("int16", [("b", "array")]),
> ("int32", [("b", "array")]),
> ("int64", [("b", "array")]),
> ("float32", [("b", "array")]),
> ("float64", [("b", "array")]),
> ]
> for t, expected_spark_dtypes in arr_dtype_to_spark_dtypes:
> arr = np.array([1, 2]).astype(t)
> self.assertEqual(
> expected_spark_dtypes, 
> self.spark.range(1).select(lit(arr).alias("b")).dtypes
> )
> arr = np.array([1, 2]).astype(np.uint)
> with self.assertRaisesRegex(
> TypeError, "The type of array scalar '%s' is not supported" % arr.dtype
> ):
> self.spark.range(1).select(lit(arr).alias("b")){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 1100, in test_ndarray_input
> expected_spark_dtypes, 
> self.spark.range(1).select(lit(arr).alias("b")).dtypes
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line 
> 332, in wrapped
> return getattr(functions, f.__name__)(*args, **kwargs)
>

[jira] [Updated] (SPARK-41902) Fix String representation of maps created by `map_from_arrays`

2023-01-05 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41902:
--
Description: 
{code:java}
expected = {"a": 1, "b": 2}
expected2 = {"c": 3, "d": 4}
df = self.spark.createDataFrame(
[(list(expected.keys()), list(expected.values()))], ["k", "v"]
)
actual = (
df.select(
expr("map('c', 3, 'd', 4) as dict2"),
map_from_arrays(df.k, df.v).alias("dict"),
"*",
)
.select(
map_contains_key("dict", "a").alias("one"),
map_contains_key("dict", "d").alias("not_exists"),
map_keys("dict").alias("keys"),
map_values("dict").alias("values"),
map_entries("dict").alias("items"),
"*",
)
.select(
map_concat("dict", "dict2").alias("merged"),
map_from_entries(arrays_zip("keys", "values")).alias("from_items"),
"*",
)
.first()
)
self.assertEqual(expected, actual["dict"]){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 1142, in test_map_functions
self.assertEqual(expected, actual["dict"])
AssertionError: {'a': 1, 'b': 2} != [('a', 1), ('b', 2)]{code}

  was:
{code:java}
from pyspark.sql import functions

funs = [
(functions.acosh, "ACOSH"),
(functions.asinh, "ASINH"),
(functions.atanh, "ATANH"),
]

cols = ["a", functions.col("a")]

for f, alias in funs:
for c in cols:
self.assertIn(f"{alias}(a)", repr(f(c))){code}
{code:java}
 Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 271, in test_inverse_trig_functions
self.assertIn(f"{alias}(a)", repr(f(c)))
AssertionError: 'ACOSH(a)' not found in 
"Column<'acosh(ColumnReference(a))'>"{code}
 

 
{code:java}
from pyspark.sql.functions import col, lit, overlay
from itertools import chain
import re

actual = list(
chain.from_iterable(
[
re.findall("(overlay\\(.*\\))", str(x))
for x in [
overlay(col("foo"), col("bar"), 1),
overlay("x", "y", 3),
overlay(col("x"), col("y"), 1, 3),
overlay("x", "y", 2, 5),
overlay("x", "y", lit(11)),
overlay("x", "y", lit(2), lit(5)),
]
]
)
)

expected = [
"overlay(foo, bar, 1, -1)",
"overlay(x, y, 3, -1)",
"overlay(x, y, 1, 3)",
"overlay(x, y, 2, 5)",
"overlay(x, y, 11, -1)",
"overlay(x, y, 2, 5)",
]

self.assertListEqual(actual, expected)

df = self.spark.createDataFrame([("SPARK_SQL", "CORE", 7, 0)], ("x", "y", 
"pos", "len"))

exp = [Row(ol="SPARK_CORESQL")]
self.assertTrue(
all(
[
df.select(overlay(df.x, df.y, 7, 0).alias("ol")).collect() == exp,
df.select(overlay(df.x, df.y, lit(7), 
lit(0)).alias("ol")).collect() == exp,
df.select(overlay("x", "y", "pos", "len").alias("ol")).collect() == 
exp,
]
)
) {code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 675, in test_overlay
self.assertListEqual(actual, expected)
AssertionError: Lists differ: ['overlay(ColumnReference(foo), 
ColumnReference(bar[402 chars]5))'] != ['overlay(foo, bar, 1, -1)', 'overlay(x, 
y, 3, -1)'[90 chars] 5)']

First differing element 0:
'overlay(ColumnReference(foo), ColumnReference(bar), Literal(1), Literal(-1))'
'overlay(foo, bar, 1, -1)'

- ['overlay(ColumnReference(foo), ColumnReference(bar), Literal(1), 
Literal(-1))',
-  'overlay(ColumnReference(x), ColumnReference(y), Literal(3), Literal(-1))',
-  'overlay(ColumnReference(x), ColumnReference(y), Literal(1), Literal(3))',
-  'overlay(ColumnReference(x), ColumnReference(y), Literal(2), Literal(5))',
-  'overlay(ColumnReference(x), ColumnReference(y), Literal(11), Literal(-1))',
-  'overlay(ColumnReference(x), ColumnReference(y), Literal(2), Literal(5))']
+ ['overlay(foo, bar, 1, -1)',
+  'overlay(x, y, 3, -1)',
+  'overlay(x, y, 1, 3)',
+  'overlay(x, y, 2, 5)',
+  'overlay(x, y, 11, -1)',
+  'overlay(x, y, 2, 5)']
 {code}


> Fix String representation of maps created by `map_from_arrays`
> --
>
> Key: SPARK-41902
> URL: https://issues.apache.org/jira/browse/SPARK-41902
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> expected = {"a": 1, "b": 2}
> expected2 = {"c": 3, "d": 4}
> df = self.spark.createDataFrame(
> [(list(expected.keys()), list(expected.values()))], ["k", "v"]
> )
> actual = (
> df.select(
> expr("map('c', 3, 'd', 4) as dict2"),
> map_from_arrays(df.

[jira] [Created] (SPARK-41903) Support data type ndarray

2023-01-05 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41903:
-

 Summary: Support data type ndarray
 Key: SPARK-41903
 URL: https://issues.apache.org/jira/browse/SPARK-41903
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
import numpy as np
from pyspark.sql.functions import lit

dtype_to_spark_dtypes = [
(np.int8, [("CAST(1 AS TINYINT)", "tinyint")]),
(np.int16, [("CAST(1 AS SMALLINT)", "smallint")]),
(np.int32, [("CAST(1 AS INT)", "int")]),
(np.int64, [("CAST(1 AS BIGINT)", "bigint")]),
(np.float32, [("CAST(1.0 AS FLOAT)", "float")]),
(np.float64, [("CAST(1.0 AS DOUBLE)", "double")]),
(np.bool_, [("true", "boolean")]),
]
for dtype, spark_dtypes in dtype_to_spark_dtypes:
self.assertEqual(self.spark.range(1).select(lit(dtype(1))).dtypes, 
spark_dtypes){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 1064, in test_lit_np_scalar
self.assertEqual(self.spark.range(1).select(lit(dtype(1))).dtypes, 
spark_dtypes)
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line 
332, in wrapped
return getattr(functions, f.__name__)(*args, **kwargs)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 198, in lit
return Column(LiteralExpression._from_value(col))
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/expressions.py", 
line 266, in _from_value
return LiteralExpression(value=value, 
dataType=LiteralExpression._infer_type(value))
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/expressions.py", 
line 262, in _infer_type
raise ValueError(f"Unsupported Data Type {type(value).__name__}")
ValueError: Unsupported Data Type int8
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41902) Fix String representation of maps created by `map_from_arrays`

2023-01-05 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41902:
-

 Summary: Fix String representation of maps created by 
`map_from_arrays`
 Key: SPARK-41902
 URL: https://issues.apache.org/jira/browse/SPARK-41902
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
from pyspark.sql import functions

funs = [
(functions.acosh, "ACOSH"),
(functions.asinh, "ASINH"),
(functions.atanh, "ATANH"),
]

cols = ["a", functions.col("a")]

for f, alias in funs:
for c in cols:
self.assertIn(f"{alias}(a)", repr(f(c))){code}
{code:java}
 Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 271, in test_inverse_trig_functions
self.assertIn(f"{alias}(a)", repr(f(c)))
AssertionError: 'ACOSH(a)' not found in 
"Column<'acosh(ColumnReference(a))'>"{code}
 

 
{code:java}
from pyspark.sql.functions import col, lit, overlay
from itertools import chain
import re

actual = list(
chain.from_iterable(
[
re.findall("(overlay\\(.*\\))", str(x))
for x in [
overlay(col("foo"), col("bar"), 1),
overlay("x", "y", 3),
overlay(col("x"), col("y"), 1, 3),
overlay("x", "y", 2, 5),
overlay("x", "y", lit(11)),
overlay("x", "y", lit(2), lit(5)),
]
]
)
)

expected = [
"overlay(foo, bar, 1, -1)",
"overlay(x, y, 3, -1)",
"overlay(x, y, 1, 3)",
"overlay(x, y, 2, 5)",
"overlay(x, y, 11, -1)",
"overlay(x, y, 2, 5)",
]

self.assertListEqual(actual, expected)

df = self.spark.createDataFrame([("SPARK_SQL", "CORE", 7, 0)], ("x", "y", 
"pos", "len"))

exp = [Row(ol="SPARK_CORESQL")]
self.assertTrue(
all(
[
df.select(overlay(df.x, df.y, 7, 0).alias("ol")).collect() == exp,
df.select(overlay(df.x, df.y, lit(7), 
lit(0)).alias("ol")).collect() == exp,
df.select(overlay("x", "y", "pos", "len").alias("ol")).collect() == 
exp,
]
)
) {code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 675, in test_overlay
self.assertListEqual(actual, expected)
AssertionError: Lists differ: ['overlay(ColumnReference(foo), 
ColumnReference(bar[402 chars]5))'] != ['overlay(foo, bar, 1, -1)', 'overlay(x, 
y, 3, -1)'[90 chars] 5)']

First differing element 0:
'overlay(ColumnReference(foo), ColumnReference(bar), Literal(1), Literal(-1))'
'overlay(foo, bar, 1, -1)'

- ['overlay(ColumnReference(foo), ColumnReference(bar), Literal(1), 
Literal(-1))',
-  'overlay(ColumnReference(x), ColumnReference(y), Literal(3), Literal(-1))',
-  'overlay(ColumnReference(x), ColumnReference(y), Literal(1), Literal(3))',
-  'overlay(ColumnReference(x), ColumnReference(y), Literal(2), Literal(5))',
-  'overlay(ColumnReference(x), ColumnReference(y), Literal(11), Literal(-1))',
-  'overlay(ColumnReference(x), ColumnReference(y), Literal(2), Literal(5))']
+ ['overlay(foo, bar, 1, -1)',
+  'overlay(x, y, 3, -1)',
+  'overlay(x, y, 1, 3)',
+  'overlay(x, y, 2, 5)',
+  'overlay(x, y, 11, -1)',
+  'overlay(x, y, 2, 5)']
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41901) Parity in String representation of Column

2023-01-05 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41901:
--
Description: 
{code:java}
from pyspark.sql import functions

funs = [
(functions.acosh, "ACOSH"),
(functions.asinh, "ASINH"),
(functions.atanh, "ATANH"),
]

cols = ["a", functions.col("a")]

for f, alias in funs:
for c in cols:
self.assertIn(f"{alias}(a)", repr(f(c))){code}
{code:java}
 Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 271, in test_inverse_trig_functions
self.assertIn(f"{alias}(a)", repr(f(c)))
AssertionError: 'ACOSH(a)' not found in 
"Column<'acosh(ColumnReference(a))'>"{code}
 

 
{code:java}
from pyspark.sql.functions import col, lit, overlay
from itertools import chain
import re

actual = list(
chain.from_iterable(
[
re.findall("(overlay\\(.*\\))", str(x))
for x in [
overlay(col("foo"), col("bar"), 1),
overlay("x", "y", 3),
overlay(col("x"), col("y"), 1, 3),
overlay("x", "y", 2, 5),
overlay("x", "y", lit(11)),
overlay("x", "y", lit(2), lit(5)),
]
]
)
)

expected = [
"overlay(foo, bar, 1, -1)",
"overlay(x, y, 3, -1)",
"overlay(x, y, 1, 3)",
"overlay(x, y, 2, 5)",
"overlay(x, y, 11, -1)",
"overlay(x, y, 2, 5)",
]

self.assertListEqual(actual, expected)

df = self.spark.createDataFrame([("SPARK_SQL", "CORE", 7, 0)], ("x", "y", 
"pos", "len"))

exp = [Row(ol="SPARK_CORESQL")]
self.assertTrue(
all(
[
df.select(overlay(df.x, df.y, 7, 0).alias("ol")).collect() == exp,
df.select(overlay(df.x, df.y, lit(7), 
lit(0)).alias("ol")).collect() == exp,
df.select(overlay("x", "y", "pos", "len").alias("ol")).collect() == 
exp,
]
)
) {code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 675, in test_overlay
self.assertListEqual(actual, expected)
AssertionError: Lists differ: ['overlay(ColumnReference(foo), 
ColumnReference(bar[402 chars]5))'] != ['overlay(foo, bar, 1, -1)', 'overlay(x, 
y, 3, -1)'[90 chars] 5)']

First differing element 0:
'overlay(ColumnReference(foo), ColumnReference(bar), Literal(1), Literal(-1))'
'overlay(foo, bar, 1, -1)'

- ['overlay(ColumnReference(foo), ColumnReference(bar), Literal(1), 
Literal(-1))',
-  'overlay(ColumnReference(x), ColumnReference(y), Literal(3), Literal(-1))',
-  'overlay(ColumnReference(x), ColumnReference(y), Literal(1), Literal(3))',
-  'overlay(ColumnReference(x), ColumnReference(y), Literal(2), Literal(5))',
-  'overlay(ColumnReference(x), ColumnReference(y), Literal(11), Literal(-1))',
-  'overlay(ColumnReference(x), ColumnReference(y), Literal(2), Literal(5))']
+ ['overlay(foo, bar, 1, -1)',
+  'overlay(x, y, 3, -1)',
+  'overlay(x, y, 1, 3)',
+  'overlay(x, y, 2, 5)',
+  'overlay(x, y, 11, -1)',
+  'overlay(x, y, 2, 5)']
 {code}

  was:
{code:java}
from pyspark.sql import functions

funs = [
(functions.acosh, "ACOSH"),
(functions.asinh, "ASINH"),
(functions.atanh, "ATANH"),
]

cols = ["a", functions.col("a")]

for f, alias in funs:
for c in cols:
self.assertIn(f"{alias}(a)", repr(f(c))){code}
{code:java}
 Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 271, in test_inverse_trig_functions
self.assertIn(f"{alias}(a)", repr(f(c)))
AssertionError: 'ACOSH(a)' not found in 
"Column<'acosh(ColumnReference(a))'>"{code}
 

 
{code:java}
from pyspark.sql.functions import col, lit, overlay
from itertools import chain
import re

actual = list(
chain.from_iterable(
[
re.findall("(overlay\\(.*\\))", str(x))
for x in [
overlay(col("foo"), col("bar"), 1),
overlay("x", "y", 3),
overlay(col("x"), col("y"), 1, 3),
overlay("x", "y", 2, 5),
overlay("x", "y", lit(11)),
overlay("x", "y", lit(2), lit(5)),
]
]
)
)

expected = [
"overlay(foo, bar, 1, -1)",
"overlay(x, y, 3, -1)",
"overlay(x, y, 1, 3)",
"overlay(x, y, 2, 5)",
"overlay(x, y, 11, -1)",
"overlay(x, y, 2, 5)",
]

self.assertListEqual(actual, expected)

df = self.spark.createDataFrame([("SPARK_SQL", "CORE", 7, 0)], ("x", "y", 
"pos", "len"))

exp = [Row(ol="SPARK_CORESQL")]
self.assertTrue(
all(
[
df.select(overlay(df.x, df.y, 7, 0).alias("ol")).collect() == exp,
df.select(overlay(df.x, df.y, lit(7), 
lit(0)).alias("ol")).collect() == exp,
df.select(overlay("x", "y", "pos", "len").alias("ol")).collect() == 
exp,
]
)
) {code}
{code:ja

[jira] [Updated] (SPARK-41901) Parity in String representation of Column

2023-01-05 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41901:
--
Description: 
{code:java}
from pyspark.sql import functions

funs = [
(functions.acosh, "ACOSH"),
(functions.asinh, "ASINH"),
(functions.atanh, "ATANH"),
]

cols = ["a", functions.col("a")]

for f, alias in funs:
for c in cols:
self.assertIn(f"{alias}(a)", repr(f(c))){code}
{code:java}
 Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 271, in test_inverse_trig_functions
self.assertIn(f"{alias}(a)", repr(f(c)))
AssertionError: 'ACOSH(a)' not found in 
"Column<'acosh(ColumnReference(a))'>"{code}
 

 
{code:java}
from pyspark.sql.functions import col, lit, overlay
from itertools import chain
import re

actual = list(
chain.from_iterable(
[
re.findall("(overlay\\(.*\\))", str(x))
for x in [
overlay(col("foo"), col("bar"), 1),
overlay("x", "y", 3),
overlay(col("x"), col("y"), 1, 3),
overlay("x", "y", 2, 5),
overlay("x", "y", lit(11)),
overlay("x", "y", lit(2), lit(5)),
]
]
)
)

expected = [
"overlay(foo, bar, 1, -1)",
"overlay(x, y, 3, -1)",
"overlay(x, y, 1, 3)",
"overlay(x, y, 2, 5)",
"overlay(x, y, 11, -1)",
"overlay(x, y, 2, 5)",
]

self.assertListEqual(actual, expected)

df = self.spark.createDataFrame([("SPARK_SQL", "CORE", 7, 0)], ("x", "y", 
"pos", "len"))

exp = [Row(ol="SPARK_CORESQL")]
self.assertTrue(
all(
[
df.select(overlay(df.x, df.y, 7, 0).alias("ol")).collect() == exp,
df.select(overlay(df.x, df.y, lit(7), 
lit(0)).alias("ol")).collect() == exp,
df.select(overlay("x", "y", "pos", "len").alias("ol")).collect() == 
exp,
]
)
) {code}
{code:java}
Traceback (most recent call last): File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 675, in test_overlay self.assertListEqual(actual, expected) 
AssertionError: Lists differ: ['overlay(ColumnReference(foo), 
ColumnReference(bar[402 chars]5))'] != ['overlay(foo, bar, 1, -1)', 'overlay(x, 
y, 3, -1)'[90 chars] 5)'] First differing element 0: 
'overlay(ColumnReference(foo), ColumnReference(bar), Literal(1), Literal(-1))' 
'overlay(foo, bar, 1, -1)' - ['overlay(ColumnReference(foo), 
ColumnReference(bar), Literal(1), Literal(-1))', - 'overlay(ColumnReference(x), 
ColumnReference(y), Literal(3), Literal(-1))', - 'overlay(ColumnReference(x), 
ColumnReference(y), Literal(1), Literal(3))', - 'overlay(ColumnReference(x), 
ColumnReference(y), Literal(2), Literal(5))', - 'overlay(ColumnReference(x), 
ColumnReference(y), Literal(11), Literal(-1))', - 'overlay(ColumnReference(x), 
ColumnReference(y), Literal(2), Literal(5))'] + ['overlay(foo, bar, 1, -1)', + 
'overlay(x, y, 3, -1)', + 'overlay(x, y, 1, 3)', + 'overlay(x, y, 2, 5)', + 
'overlay(x, y, 11, -1)', + 'overlay(x, y, 2, 5)']
{code}

  was:
{code:java}
dt = datetime.date(2021, 12, 27)

# Note; number var in Python gets converted to LongType column;
# this is not supported by the function, so cast to Integer explicitly
df = self.spark.createDataFrame([Row(date=dt, add=2)], "date date, add integer")

self.assertTrue(
all(
df.select(
date_add(df.date, df.add) == datetime.date(2021, 12, 29),
date_add(df.date, "add") == datetime.date(2021, 12, 29),
date_add(df.date, 3) == datetime.date(2021, 12, 30),
).first()
)
){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 391, in test_date_add_function
).first()
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 246, in first
return self.head()
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 310, in head
rs = self.head(1)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 312, in head
return self.take(n)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 317, in take
return self.limit(num).collect()
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1076, in collect
table = self._session.client.to_table(query)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
414, in to_table
table, _ = self._execute_and_fetch(req)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
586, in _execute_and_fetch
self._handle_error(rpc_error)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
625, in _handle_error

[jira] [Created] (SPARK-41901) Parity in String representation of Column

2023-01-05 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41901:
-

 Summary: Parity in String representation of Column
 Key: SPARK-41901
 URL: https://issues.apache.org/jira/browse/SPARK-41901
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
dt = datetime.date(2021, 12, 27)

# Note; number var in Python gets converted to LongType column;
# this is not supported by the function, so cast to Integer explicitly
df = self.spark.createDataFrame([Row(date=dt, add=2)], "date date, add integer")

self.assertTrue(
all(
df.select(
date_add(df.date, df.add) == datetime.date(2021, 12, 29),
date_add(df.date, "add") == datetime.date(2021, 12, 29),
date_add(df.date, 3) == datetime.date(2021, 12, 30),
).first()
)
){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 391, in test_date_add_function
).first()
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 246, in first
return self.head()
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 310, in head
rs = self.head(1)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 312, in head
return self.take(n)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 317, in take
return self.limit(num).collect()
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1076, in collect
table = self._session.client.to_table(query)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
414, in to_table
table, _ = self._execute_and_fetch(req)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
586, in _execute_and_fetch
self._handle_error(rpc_error)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
625, in _handle_error
raise SparkConnectAnalysisException(
pyspark.sql.connect.client.SparkConnectAnalysisException: 
[DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "date_add(date, add)" 
due to data type mismatch: Parameter 2 requires the ("INT" or "SMALLINT" or 
"TINYINT") type, however "add" has the type "BIGINT".
Plan: 'GlobalLimit 1
+- 'LocalLimit 1
   +- 'Project [unresolvedalias('`==`(date_add(date#753, add#754L), 
2021-12-29), None), unresolvedalias('`==`(date_add(date#753, add#754L), 
2021-12-29), None), (date_add(date#753, 3) = 2021-12-30) AS (date_add(date, 3) 
= DATE '2021-12-30')#759]
  +- Project [date#753, add#754L]
 +- Project [date#749 AS date#753, add#750L AS add#754L]
+- LocalRelation [date#749, add#750L]{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41900) Support data type int8

2023-01-05 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41900:
--
Description: 
{code:java}
import numpy as np
from pyspark.sql.functions import lit

dtype_to_spark_dtypes = [
(np.int8, [("CAST(1 AS TINYINT)", "tinyint")]),
(np.int16, [("CAST(1 AS SMALLINT)", "smallint")]),
(np.int32, [("CAST(1 AS INT)", "int")]),
(np.int64, [("CAST(1 AS BIGINT)", "bigint")]),
(np.float32, [("CAST(1.0 AS FLOAT)", "float")]),
(np.float64, [("CAST(1.0 AS DOUBLE)", "double")]),
(np.bool_, [("true", "boolean")]),
]
for dtype, spark_dtypes in dtype_to_spark_dtypes:
self.assertEqual(self.spark.range(1).select(lit(dtype(1))).dtypes, 
spark_dtypes){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 1064, in test_lit_np_scalar
self.assertEqual(self.spark.range(1).select(lit(dtype(1))).dtypes, 
spark_dtypes)
  File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line 
332, in wrapped
return getattr(functions, f.__name__)(*args, **kwargs)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 198, in lit
return Column(LiteralExpression._from_value(col))
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/expressions.py", 
line 266, in _from_value
return LiteralExpression(value=value, 
dataType=LiteralExpression._infer_type(value))
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/expressions.py", 
line 262, in _infer_type
raise ValueError(f"Unsupported Data Type {type(value).__name__}")
ValueError: Unsupported Data Type int8
{code}

  was:
{code:java}
row = self.spark.createDataFrame([("Alice", None, None, None)], 
schema).fillna(True).first()
self.assertEqual(row.age, None){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 231, in test_fillna
    self.assertEqual(row.age, None)
AssertionError: nan != None{code}
 
{code:java}
row = (
self.spark.createDataFrame([("Alice", 10, None)], schema)
.replace(10, 20, subset=["name", "height"])
.first()
)
self.assertEqual(row.name, "Alice")
self.assertEqual(row.age, 10)
self.assertEqual(row.height, None) {code}
{code:java}
Traceback (most recent call last):   File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 372, in test_replace     self.assertEqual(row.height, None) 
AssertionError: nan != None
{code}


> Support data type int8
> --
>
> Key: SPARK-41900
> URL: https://issues.apache.org/jira/browse/SPARK-41900
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> import numpy as np
> from pyspark.sql.functions import lit
> dtype_to_spark_dtypes = [
> (np.int8, [("CAST(1 AS TINYINT)", "tinyint")]),
> (np.int16, [("CAST(1 AS SMALLINT)", "smallint")]),
> (np.int32, [("CAST(1 AS INT)", "int")]),
> (np.int64, [("CAST(1 AS BIGINT)", "bigint")]),
> (np.float32, [("CAST(1.0 AS FLOAT)", "float")]),
> (np.float64, [("CAST(1.0 AS DOUBLE)", "double")]),
> (np.bool_, [("true", "boolean")]),
> ]
> for dtype, spark_dtypes in dtype_to_spark_dtypes:
> self.assertEqual(self.spark.range(1).select(lit(dtype(1))).dtypes, 
> spark_dtypes){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 1064, in test_lit_np_scalar
> self.assertEqual(self.spark.range(1).select(lit(dtype(1))).dtypes, 
> spark_dtypes)
>   File "/Users/s.singh/personal/spark-oss/python/pyspark/sql/utils.py", line 
> 332, in wrapped
> return getattr(functions, f.__name__)(*args, **kwargs)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 198, in lit
> return Column(LiteralExpression._from_value(col))
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/expressions.py",
>  line 266, in _from_value
> return LiteralExpression(value=value, 
> dataType=LiteralExpression._infer_type(value))
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/expressions.py",
>  line 262, in _infer_type
> raise ValueError(f"Unsupported Data Type {type(value).__name__}")
> ValueError: Unsupported Data Type int8
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41900) Support data type int8

2023-01-05 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41900:
-

 Summary: Support data type int8
 Key: SPARK-41900
 URL: https://issues.apache.org/jira/browse/SPARK-41900
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
row = self.spark.createDataFrame([("Alice", None, None, None)], 
schema).fillna(True).first()
self.assertEqual(row.age, None){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 231, in test_fillna
    self.assertEqual(row.age, None)
AssertionError: nan != None{code}
 
{code:java}
row = (
self.spark.createDataFrame([("Alice", 10, None)], schema)
.replace(10, 20, subset=["name", "height"])
.first()
)
self.assertEqual(row.name, "Alice")
self.assertEqual(row.age, 10)
self.assertEqual(row.height, None) {code}
{code:java}
Traceback (most recent call last):   File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 372, in test_replace     self.assertEqual(row.height, None) 
AssertionError: nan != None
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41898) Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as argument

2023-01-05 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41898:
--
Description: 
{code:java}
df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], 
["key", "value"])
w = Window.partitionBy("value").orderBy("key")
from pyspark.sql import functions as F

sel = df.select(
df.value,
df.key,
F.max("key").over(w.rowsBetween(0, 1)),
F.min("key").over(w.rowsBetween(0, 1)),
F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))),
F.row_number().over(w),
F.rank().over(w),
F.dense_rank().over(w),
F.ntile(2).over(w),
)
rs = sorted(sel.collect()){code}
{code:java}
Traceback (most recent call last):   File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 821, in test_window_functions     
F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))),   File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/window.py", line 
152, in rowsBetween     raise TypeError(f"start must be a int, but got 
{type(start).__name__}") TypeError: start must be a int, but got float {code}

  was:
{code:java}
from pyspark.sql.functions import assert_true

df = self.spark.range(3)

self.assertEqual(
df.select(assert_true(df.id < 3)).toDF("val").collect(),
[Row(val=None), Row(val=None), Row(val=None)],
)

with self.assertRaises(Py4JJavaError) as cm:
df.select(assert_true(df.id < 2, "too big")).toDF("val").collect(){code}
{code:java}
df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], 
["key", "value"])
w = Window.partitionBy("value").orderBy("key")
from pyspark.sql import functions as F

sel = df.select(
df.value,
df.key,
F.max("key").over(w.rowsBetween(0, 1)),
F.min("key").over(w.rowsBetween(0, 1)),
F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))),
F.row_number().over(w),
F.rank().over(w),
F.dense_rank().over(w),
F.ntile(2).over(w),
)
rs = sorted(sel.collect()){code}


> Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as 
> argument
> 
>
> Key: SPARK-41898
> URL: https://issues.apache.org/jira/browse/SPARK-41898
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], 
> ["key", "value"])
> w = Window.partitionBy("value").orderBy("key")
> from pyspark.sql import functions as F
> sel = df.select(
> df.value,
> df.key,
> F.max("key").over(w.rowsBetween(0, 1)),
> F.min("key").over(w.rowsBetween(0, 1)),
> F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))),
> F.row_number().over(w),
> F.rank().over(w),
> F.dense_rank().over(w),
> F.ntile(2).over(w),
> )
> rs = sorted(sel.collect()){code}
> {code:java}
> Traceback (most recent call last):   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 821, in test_window_functions     
> F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))),   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/window.py", 
> line 152, in rowsBetween     raise TypeError(f"start must be a int, but got 
> {type(start).__name__}") TypeError: start must be a int, but got float {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41899) DataFrame.createDataFrame converting int to bigint

2023-01-05 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41899:
--
Description: 
{code:java}
dt = datetime.date(2021, 12, 27)

# Note; number var in Python gets converted to LongType column;
# this is not supported by the function, so cast to Integer explicitly
df = self.spark.createDataFrame([Row(date=dt, add=2)], "date date, add integer")

self.assertTrue(
all(
df.select(
date_add(df.date, df.add) == datetime.date(2021, 12, 29),
date_add(df.date, "add") == datetime.date(2021, 12, 29),
date_add(df.date, 3) == datetime.date(2021, 12, 30),
).first()
)
){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 391, in test_date_add_function
).first()
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 246, in first
return self.head()
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 310, in head
rs = self.head(1)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 312, in head
return self.take(n)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 317, in take
return self.limit(num).collect()
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1076, in collect
table = self._session.client.to_table(query)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
414, in to_table
table, _ = self._execute_and_fetch(req)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
586, in _execute_and_fetch
self._handle_error(rpc_error)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
625, in _handle_error
raise SparkConnectAnalysisException(
pyspark.sql.connect.client.SparkConnectAnalysisException: 
[DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "date_add(date, add)" 
due to data type mismatch: Parameter 2 requires the ("INT" or "SMALLINT" or 
"TINYINT") type, however "add" has the type "BIGINT".
Plan: 'GlobalLimit 1
+- 'LocalLimit 1
   +- 'Project [unresolvedalias('`==`(date_add(date#753, add#754L), 
2021-12-29), None), unresolvedalias('`==`(date_add(date#753, add#754L), 
2021-12-29), None), (date_add(date#753, 3) = 2021-12-30) AS (date_add(date, 3) 
= DATE '2021-12-30')#759]
  +- Project [date#753, add#754L]
 +- Project [date#749 AS date#753, add#750L AS add#754L]
+- LocalRelation [date#749, add#750L]{code}

  was:
{code:java}
from pyspark.sql.functions import assert_true

df = self.spark.range(3)

self.assertEqual(
df.select(assert_true(df.id < 3)).toDF("val").collect(),
[Row(val=None), Row(val=None), Row(val=None)],
)

with self.assertRaises(Py4JJavaError) as cm:
df.select(assert_true(df.id < 2, "too big")).toDF("val").collect(){code}
{code:java}
df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], 
["key", "value"])
w = Window.partitionBy("value").orderBy("key")
from pyspark.sql import functions as F

sel = df.select(
df.value,
df.key,
F.max("key").over(w.rowsBetween(0, 1)),
F.min("key").over(w.rowsBetween(0, 1)),
F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))),
F.row_number().over(w),
F.rank().over(w),
F.dense_rank().over(w),
F.ntile(2).over(w),
)
rs = sorted(sel.collect()){code}


> DataFrame.createDataFrame converting int to bigint
> --
>
> Key: SPARK-41899
> URL: https://issues.apache.org/jira/browse/SPARK-41899
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> dt = datetime.date(2021, 12, 27)
> # Note; number var in Python gets converted to LongType column;
> # this is not supported by the function, so cast to Integer explicitly
> df = self.spark.createDataFrame([Row(date=dt, add=2)], "date date, add 
> integer")
> self.assertTrue(
> all(
> df.select(
> date_add(df.date, df.add) == datetime.date(2021, 12, 29),
> date_add(df.date, "add") == datetime.date(2021, 12, 29),
> date_add(df.date, 3) == datetime.date(2021, 12, 30),
> ).first()
> )
> ){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 391, in test_date_add_function
> ).first()
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 246, in first
> return self.head()
>   File 
> "/Us

[jira] [Created] (SPARK-41899) DataFrame.createDataFrame converting int to bigint

2023-01-05 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41899:
-

 Summary: DataFrame.createDataFrame converting int to bigint
 Key: SPARK-41899
 URL: https://issues.apache.org/jira/browse/SPARK-41899
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
from pyspark.sql.functions import assert_true

df = self.spark.range(3)

self.assertEqual(
df.select(assert_true(df.id < 3)).toDF("val").collect(),
[Row(val=None), Row(val=None), Row(val=None)],
)

with self.assertRaises(Py4JJavaError) as cm:
df.select(assert_true(df.id < 2, "too big")).toDF("val").collect(){code}
{code:java}
df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], 
["key", "value"])
w = Window.partitionBy("value").orderBy("key")
from pyspark.sql import functions as F

sel = df.select(
df.value,
df.key,
F.max("key").over(w.rowsBetween(0, 1)),
F.min("key").over(w.rowsBetween(0, 1)),
F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))),
F.row_number().over(w),
F.rank().over(w),
F.dense_rank().over(w),
F.ntile(2).over(w),
)
rs = sorted(sel.collect()){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41898) Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as argument

2023-01-05 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41898:
--
Description: 
{code:java}
from pyspark.sql.functions import assert_true

df = self.spark.range(3)

self.assertEqual(
df.select(assert_true(df.id < 3)).toDF("val").collect(),
[Row(val=None), Row(val=None), Row(val=None)],
)

with self.assertRaises(Py4JJavaError) as cm:
df.select(assert_true(df.id < 2, "too big")).toDF("val").collect(){code}
{code:java}
df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], 
["key", "value"])
w = Window.partitionBy("value").orderBy("key")
from pyspark.sql import functions as F

sel = df.select(
df.value,
df.key,
F.max("key").over(w.rowsBetween(0, 1)),
F.min("key").over(w.rowsBetween(0, 1)),
F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))),
F.row_number().over(w),
F.rank().over(w),
F.dense_rank().over(w),
F.ntile(2).over(w),
)
rs = sorted(sel.collect()){code}

  was:
PySpark throws Py4JJavaError where as connect throws SparkConnectException
{code:java}
from pyspark.sql.functions import assert_true

df = self.spark.range(3)

self.assertEqual(
df.select(assert_true(df.id < 3)).toDF("val").collect(),
[Row(val=None), Row(val=None), Row(val=None)],
)

with self.assertRaises(Py4JJavaError) as cm:
df.select(assert_true(df.id < 2, "too big")).toDF("val").collect(){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 950, in test_assert_true
df.select(assert_true(df.id < 2, "too big")).toDF("val").collect()
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1076, in collect
table = self._session.client.to_table(query)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
414, in to_table
table, _ = self._execute_and_fetch(req)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
586, in _execute_and_fetch
self._handle_error(rpc_error)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
629, in _handle_error
raise SparkConnectException(status.message, info.reason) from None
pyspark.sql.connect.client.SparkConnectException: (java.lang.RuntimeException) 
too big {code}


> Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as 
> argument
> 
>
> Key: SPARK-41898
> URL: https://issues.apache.org/jira/browse/SPARK-41898
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> from pyspark.sql.functions import assert_true
> df = self.spark.range(3)
> self.assertEqual(
> df.select(assert_true(df.id < 3)).toDF("val").collect(),
> [Row(val=None), Row(val=None), Row(val=None)],
> )
> with self.assertRaises(Py4JJavaError) as cm:
> df.select(assert_true(df.id < 2, "too big")).toDF("val").collect(){code}
> {code:java}
> df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], 
> ["key", "value"])
> w = Window.partitionBy("value").orderBy("key")
> from pyspark.sql import functions as F
> sel = df.select(
> df.value,
> df.key,
> F.max("key").over(w.rowsBetween(0, 1)),
> F.min("key").over(w.rowsBetween(0, 1)),
> F.count("key").over(w.rowsBetween(float("-inf"), float("inf"))),
> F.row_number().over(w),
> F.rank().over(w),
> F.dense_rank().over(w),
> F.ntile(2).over(w),
> )
> rs = sorted(sel.collect()){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41898) Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as argument

2023-01-05 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41898:
-

 Summary: Window.rowsBetween should handle `float("-inf")` and 
`float("+inf")` as argument
 Key: SPARK-41898
 URL: https://issues.apache.org/jira/browse/SPARK-41898
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


PySpark throws Py4JJavaError where as connect throws SparkConnectException
{code:java}
from pyspark.sql.functions import assert_true

df = self.spark.range(3)

self.assertEqual(
df.select(assert_true(df.id < 3)).toDF("val").collect(),
[Row(val=None), Row(val=None), Row(val=None)],
)

with self.assertRaises(Py4JJavaError) as cm:
df.select(assert_true(df.id < 2, "too big")).toDF("val").collect(){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 950, in test_assert_true
df.select(assert_true(df.id < 2, "too big")).toDF("val").collect()
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1076, in collect
table = self._session.client.to_table(query)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
414, in to_table
table, _ = self._execute_and_fetch(req)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
586, in _execute_and_fetch
self._handle_error(rpc_error)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
629, in _handle_error
raise SparkConnectException(status.message, info.reason) from None
pyspark.sql.connect.client.SparkConnectException: (java.lang.RuntimeException) 
too big {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41897) Parity in Error types between pyspark and connect functions

2023-01-05 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41897:
--
Description: 
PySpark throws Py4JJavaError where as connect throws SparkConnectException
{code:java}
from pyspark.sql.functions import assert_true

df = self.spark.range(3)

self.assertEqual(
df.select(assert_true(df.id < 3)).toDF("val").collect(),
[Row(val=None), Row(val=None), Row(val=None)],
)

with self.assertRaises(Py4JJavaError) as cm:
df.select(assert_true(df.id < 2, "too big")).toDF("val").collect(){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", 
line 950, in test_assert_true
df.select(assert_true(df.id < 2, "too big")).toDF("val").collect()
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1076, in collect
table = self._session.client.to_table(query)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
414, in to_table
table, _ = self._execute_and_fetch(req)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
586, in _execute_and_fetch
self._handle_error(rpc_error)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
629, in _handle_error
raise SparkConnectException(status.message, info.reason) from None
pyspark.sql.connect.client.SparkConnectException: (java.lang.RuntimeException) 
too big {code}

  was:
{code:java}
df = self.spark.range(10e10).toDF("id")
such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}


> Parity in Error types between pyspark and connect functions
> ---
>
> Key: SPARK-41897
> URL: https://issues.apache.org/jira/browse/SPARK-41897
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> PySpark throws Py4JJavaError where as connect throws SparkConnectException
> {code:java}
> from pyspark.sql.functions import assert_true
> df = self.spark.range(3)
> self.assertEqual(
> df.select(assert_true(df.id < 3)).toDF("val").collect(),
> [Row(val=None), Row(val=None), Row(val=None)],
> )
> with self.assertRaises(Py4JJavaError) as cm:
> df.select(assert_true(df.id < 2, "too big")).toDF("val").collect(){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 950, in test_assert_true
> df.select(assert_true(df.id < 2, "too big")).toDF("val").collect()
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1076, in collect
> table = self._session.client.to_table(query)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 414, in to_table
> table, _ = self._execute_and_fetch(req)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 586, in _execute_and_fetch
> self._handle_error(rpc_error)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 629, in _handle_error
> raise SparkConnectException(status.message, info.reason) from None
> pyspark.sql.connect.client.SparkConnectException: 
> (java.lang.RuntimeException) too big {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41897) Parity in Error types between pyspark and connect functions

2023-01-05 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41897:
-

 Summary: Parity in Error types between pyspark and connect 
functions
 Key: SPARK-41897
 URL: https://issues.apache.org/jira/browse/SPARK-41897
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
df = self.spark.range(10e10).toDF("id")
such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41891) Enable test_add_months_function, test_array_repeat, test_dayofweek, test_first_last_ignorenulls, test_function_parity, test_inline, test_window_time, test_reciprocal_tri

2023-01-04 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41891:
--
Summary: Enable test_add_months_function, test_array_repeat, 
test_dayofweek, test_first_last_ignorenulls, test_function_parity, test_inline, 
test_window_time, test_reciprocal_trig_functions  (was: Enable 8 tests)

> Enable test_add_months_function, test_array_repeat, test_dayofweek, 
> test_first_last_ignorenulls, test_function_parity, test_inline, 
> test_window_time, test_reciprocal_trig_functions
> 
>
> Key: SPARK-41891
> URL: https://issues.apache.org/jira/browse/SPARK-41891
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41892) Add JIRAs or messages for skipped messages

2023-01-04 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41892:
-

 Summary: Add JIRAs or messages for skipped messages
 Key: SPARK-41892
 URL: https://issues.apache.org/jira/browse/SPARK-41892
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh
Assignee: Sandeep Singh
 Fix For: 3.4.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41878) Add JIRAs or messages for skipped tests

2023-01-04 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41878:
--
Summary: Add JIRAs or messages for skipped tests  (was: Add JIRAs or 
messages for skipped messages)

> Add JIRAs or messages for skipped tests
> ---
>
> Key: SPARK-41878
> URL: https://issues.apache.org/jira/browse/SPARK-41878
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> Add JIRAs or Messages for all the skipped messages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41891) Enable 8 tests

2023-01-04 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41891:
-

 Summary: Enable 8 tests
 Key: SPARK-41891
 URL: https://issues.apache.org/jira/browse/SPARK-41891
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh
Assignee: Sandeep Singh
 Fix For: 3.4.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41887) Support DataFrame hint parameter to be list

2023-01-04 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41887:
-

 Summary: Support DataFrame hint parameter to be list
 Key: SPARK-41887
 URL: https://issues.apache.org/jira/browse/SPARK-41887
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
df = self.spark.range(10e10).toDF("id")
such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 556, in test_extended_hint_types
    hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 482, in hint
    raise TypeError(
TypeError: param should be a int or str, but got float 1.2345{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41887) Support DataFrame hint parameter to be list

2023-01-04 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41887:
--
Description: 
{code:java}
df = self.spark.range(10e10).toDF("id")
such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}

  was:
{code:java}
df = self.spark.range(10e10).toDF("id")
such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 556, in test_extended_hint_types
    hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 482, in hint
    raise TypeError(
TypeError: param should be a int or str, but got float 1.2345{code}


> Support DataFrame hint parameter to be list
> ---
>
> Key: SPARK-41887
> URL: https://issues.apache.org/jira/browse/SPARK-41887
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.range(10e10).toDF("id")
> such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
> hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41871) DataFrame hint parameter can be str, float or int

2023-01-04 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41871:
--
Summary: DataFrame hint parameter can be str, float or int  (was: DataFrame 
hint parameter can be str, list, float or int)

> DataFrame hint parameter can be str, float or int
> -
>
> Key: SPARK-41871
> URL: https://issues.apache.org/jira/browse/SPARK-41871
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.range(10e10).toDF("id")
> such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
> hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 556, in test_extended_hint_types
>     hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 482, in hint
>     raise TypeError(
> TypeError: param should be a int or str, but got float 1.2345{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41884) DataFrame `toPandas` parity in return types

2023-01-04 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41884:
--
Description: 
{code:java}
import numpy as np
import pandas as pd

df = self.spark.createDataFrame(
[[[("a", 2, 3.0), ("a", 2, 3.0)]], [[("b", 5, 6.0), ("b", 5, 6.0)]]],
"array_struct_col Array>",
)
for is_arrow_enabled in [True, False]:
with self.sql_conf({"spark.sql.execution.arrow.pyspark.enabled": 
is_arrow_enabled}):
pdf = df.toPandas()
self.assertEqual(type(pdf), pd.DataFrame)
self.assertEqual(type(pdf["array_struct_col"]), pd.Series)
if is_arrow_enabled:
self.assertEqual(type(pdf["array_struct_col"][0]), np.ndarray)
else:
self.assertEqual(type(pdf["array_struct_col"][0]), list){code}
{code:java}
Traceback (most recent call last):
1415  File "/__w/spark/spark/python/pyspark/sql/tests/test_dataframe.py", line 
1202, in test_to_pandas_for_array_of_struct
1416df = self.spark.createDataFrame(
1417  File "/__w/spark/spark/python/pyspark/sql/connect/session.py", line 264, 
in createDataFrame
1418table = pa.Table.from_pylist([dict(zip(_cols, list(item))) for item in 
_data])
1419  File "pyarrow/table.pxi", line 3700, in pyarrow.lib.Table.from_pylist
1420  File "pyarrow/table.pxi", line 5221, in pyarrow.lib._from_pylist
1421  File "pyarrow/table.pxi", line 3575, in pyarrow.lib.Table.from_arrays
1422  File "pyarrow/table.pxi", line 1383, in pyarrow.lib._sanitize_arrays
1423  File "pyarrow/table.pxi", line 1364, in pyarrow.lib._schema_from_arrays
1424  File "pyarrow/array.pxi", line 320, in pyarrow.lib.array
1425  File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
1426  File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
1427  File "pyarrow/error.pxi", line 123, in pyarrow.lib.check_status
1428pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object{code}
 
{code:java}
import numpy as np

pdf = self._to_pandas()
types = pdf.dtypes
self.assertEqual(types[0], np.int32)
self.assertEqual(types[1], np.object)
self.assertEqual(types[2], np.bool)
self.assertEqual(types[3], np.float32)
self.assertEqual(types[4], np.object)  # datetime.date
self.assertEqual(types[5], "datetime64[ns]")
self.assertEqual(types[6], "datetime64[ns]")
self.assertEqual(types[7], "timedelta64[ns]") {code}
{code:java}
Traceback (most recent call last): 1434 File 
"/__w/spark/spark/python/pyspark/sql/tests/test_dataframe.py", line 1039, in 
test_to_pandas 1435 self.assertEqual(types[5], "datetime64[ns]") 
1436AssertionError: datetime64[ns, Etc/UTC] != 'datetime64[ns]' 1437
{code}

  was:
{code:java}
schema = StructType(
[StructField("i", StringType(), True), StructField("j", IntegerType(), 
True)]
)
df = self.spark.createDataFrame([("a", 1)], schema)

schema1 = StructType([StructField("j", StringType()), StructField("i", 
StringType())])
df1 = df.to(schema1)
self.assertEqual(schema1, df1.schema)
self.assertEqual(df.count(), df1.count())

schema2 = StructType([StructField("j", LongType())])
df2 = df.to(schema2)
self.assertEqual(schema2, df2.schema)
self.assertEqual(df.count(), df2.count())

schema3 = StructType([StructField("struct", schema1, False)])
df3 = df.select(struct("i", "j").alias("struct")).to(schema3)
self.assertEqual(schema3, df3.schema)
self.assertEqual(df.count(), df3.count())

# incompatible field nullability
schema4 = StructType([StructField("j", LongType(), False)])
self.assertRaisesRegex(
AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4)
){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 1486, in test_to
    self.assertRaisesRegex(
AssertionError: AnalysisException not raised by  {code}


> DataFrame `toPandas` parity in return types
> ---
>
> Key: SPARK-41884
> URL: https://issues.apache.org/jira/browse/SPARK-41884
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> import numpy as np
> import pandas as pd
> df = self.spark.createDataFrame(
> [[[("a", 2, 3.0), ("a", 2, 3.0)]], [[("b", 5, 6.0), ("b", 5, 6.0)]]],
> "array_struct_col Array>",
> )
> for is_arrow_enabled in [True, False]:
> with self.sql_conf({"spark.sql.execution.arrow.pyspark.enabled": 
> is_arrow_enabled}):
> pdf = df.toPandas()
> self.assertEqual(type(pdf), pd.DataFrame)
> self.assertEqual(type(pdf["array_struct_col"]), pd.Series)
> if is_arrow_enabled:
> self.assertEqual(type(pdf["array_struct_col"][0]), np.ndarray)
> else:
> self.assertEqual(type(pdf["array_struct_col"][0]), list){code}
> {code:java}
> Trac

[jira] [Created] (SPARK-41884) DataFrame `toPandas` parity in return types

2023-01-04 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41884:
-

 Summary: DataFrame `toPandas` parity in return types
 Key: SPARK-41884
 URL: https://issues.apache.org/jira/browse/SPARK-41884
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
schema = StructType(
[StructField("i", StringType(), True), StructField("j", IntegerType(), 
True)]
)
df = self.spark.createDataFrame([("a", 1)], schema)

schema1 = StructType([StructField("j", StringType()), StructField("i", 
StringType())])
df1 = df.to(schema1)
self.assertEqual(schema1, df1.schema)
self.assertEqual(df.count(), df1.count())

schema2 = StructType([StructField("j", LongType())])
df2 = df.to(schema2)
self.assertEqual(schema2, df2.schema)
self.assertEqual(df.count(), df2.count())

schema3 = StructType([StructField("struct", schema1, False)])
df3 = df.select(struct("i", "j").alias("struct")).to(schema3)
self.assertEqual(schema3, df3.schema)
self.assertEqual(df.count(), df3.count())

# incompatible field nullability
schema4 = StructType([StructField("j", LongType(), False)])
self.assertRaisesRegex(
AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4)
){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 1486, in test_to
    self.assertRaisesRegex(
AssertionError: AnalysisException not raised by  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41878) Add JIRAs or messages for skipped messages

2023-01-03 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41878:
--
Description: Add JIRAs or Messages for all the skipped messages.  (was: 5 
tests pass now. Should enable them.)

> Add JIRAs or messages for skipped messages
> --
>
> Key: SPARK-41878
> URL: https://issues.apache.org/jira/browse/SPARK-41878
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> Add JIRAs or Messages for all the skipped messages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41878) Add JIRAs or messages for skipped messages

2023-01-03 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41878:
-

 Summary: Add JIRAs or messages for skipped messages
 Key: SPARK-41878
 URL: https://issues.apache.org/jira/browse/SPARK-41878
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, Tests
Affects Versions: 3.4.0
Reporter: Sandeep Singh
Assignee: Hyukjin Kwon
 Fix For: 3.4.0


5 tests pass now. Should enable them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41877) SparkSession.createDataFrame error parity

2023-01-03 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41877:
--
Description: 
{code:java}
df = self.spark.createDataFrame(
[
(1, 10, 1.0, "one"),
(2, 20, 2.0, "two"),
(3, 30, 3.0, "three"),
],
["id", "int", "double", "str"],
)

with self.subTest(desc="with none identifier"):
with self.assertRaisesRegex(AssertionError, "ids must not be None"):
df.unpivot(None, ["int", "double"], "var", "val"){code}
Error:
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 575, in test_unpivot
    with self.assertRaisesRegex(AssertionError, "ids must not be None"):
AssertionError: AssertionError not raised{code}

  was:
{code:java}
df = self.spark.createDataFrame([(1, 2)], ["c", "c"]){code}
Error:
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 65, in test_duplicated_column_names
    df = self.spark.createDataFrame([(1, 2)], ["c", "c"])
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", line 
277, in createDataFrame
    raise ValueError(
ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 
elements{code}


> SparkSession.createDataFrame error parity
> -
>
> Key: SPARK-41877
> URL: https://issues.apache.org/jira/browse/SPARK-41877
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame(
> [
> (1, 10, 1.0, "one"),
> (2, 20, 2.0, "two"),
> (3, 30, 3.0, "three"),
> ],
> ["id", "int", "double", "str"],
> )
> with self.subTest(desc="with none identifier"):
> with self.assertRaisesRegex(AssertionError, "ids must not be None"):
> df.unpivot(None, ["int", "double"], "var", "val"){code}
> Error:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 575, in test_unpivot
>     with self.assertRaisesRegex(AssertionError, "ids must not be None"):
> AssertionError: AssertionError not raised{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41877) SparkSession.createDataFrame error parity

2023-01-03 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41877:
-

 Summary: SparkSession.createDataFrame error parity
 Key: SPARK-41877
 URL: https://issues.apache.org/jira/browse/SPARK-41877
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
df = self.spark.createDataFrame([(1, 2)], ["c", "c"]){code}
Error:
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 65, in test_duplicated_column_names
    df = self.spark.createDataFrame([(1, 2)], ["c", "c"])
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", line 
277, in createDataFrame
    raise ValueError(
ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 
elements{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41876) Implement DataFrame `toLocalIterator`

2023-01-03 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41876:
-

 Summary: Implement DataFrame `toLocalIterator`
 Key: SPARK-41876
 URL: https://issues.apache.org/jira/browse/SPARK-41876
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
schema = StructType(
[StructField("i", StringType(), True), StructField("j", IntegerType(), 
True)]
)
df = self.spark.createDataFrame([("a", 1)], schema)

schema1 = StructType([StructField("j", StringType()), StructField("i", 
StringType())])
df1 = df.to(schema1)
self.assertEqual(schema1, df1.schema)
self.assertEqual(df.count(), df1.count())

schema2 = StructType([StructField("j", LongType())])
df2 = df.to(schema2)
self.assertEqual(schema2, df2.schema)
self.assertEqual(df.count(), df2.count())

schema3 = StructType([StructField("struct", schema1, False)])
df3 = df.select(struct("i", "j").alias("struct")).to(schema3)
self.assertEqual(schema3, df3.schema)
self.assertEqual(df.count(), df3.count())

# incompatible field nullability
schema4 = StructType([StructField("j", LongType(), False)])
self.assertRaisesRegex(
AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4)
){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 1486, in test_to
    self.assertRaisesRegex(
AssertionError: AnalysisException not raised by  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41875) Throw proper errors in Dataset.to()

2023-01-03 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41875:
--
Description: 
{code:java}
schema = StructType(
[StructField("i", StringType(), True), StructField("j", IntegerType(), 
True)]
)
df = self.spark.createDataFrame([("a", 1)], schema)

schema1 = StructType([StructField("j", StringType()), StructField("i", 
StringType())])
df1 = df.to(schema1)
self.assertEqual(schema1, df1.schema)
self.assertEqual(df.count(), df1.count())

schema2 = StructType([StructField("j", LongType())])
df2 = df.to(schema2)
self.assertEqual(schema2, df2.schema)
self.assertEqual(df.count(), df2.count())

schema3 = StructType([StructField("struct", schema1, False)])
df3 = df.select(struct("i", "j").alias("struct")).to(schema3)
self.assertEqual(schema3, df3.schema)
self.assertEqual(df.count(), df3.count())

# incompatible field nullability
schema4 = StructType([StructField("j", LongType(), False)])
self.assertRaisesRegex(
AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4)
){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 1486, in test_to
    self.assertRaisesRegex(
AssertionError: AnalysisException not raised by  {code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 401, in pyspark.sql.connect.dataframe.DataFrame.sample
Failed example:
    df.sample(0.5, 3).count()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 
1, in 
        df.sample(0.5, 3).count()
    TypeError: DataFrame.sample() takes 2 positional arguments but 3 were given
**
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 411, in pyspark.sql.connect.dataframe.DataFrame.sample
Failed example:
    df.sample(False, fraction=1.0).count()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 
1, in 
        df.sample(False, fraction=1.0).count()
    TypeError: DataFrame.sample() got multiple values for argument 
'fraction'{code}


> Throw proper errors in Dataset.to()
> ---
>
> Key: SPARK-41875
> URL: https://issues.apache.org/jira/browse/SPARK-41875
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> schema = StructType(
> [StructField("i", StringType(), True), StructField("j", IntegerType(), 
> True)]
> )
> df = self.spark.createDataFrame([("a", 1)], schema)
> schema1 = StructType([StructField("j", StringType()), StructField("i", 
> StringType())])
> df1 = df.to(schema1)
> self.assertEqual(schema1, df1.schema)
> self.assertEqual(df.count(), df1.count())
> schema2 = StructType([StructField("j", LongType())])
> df2 = df.to(schema2)
> self.assertEqual(schema2, df2.schema)
> self.assertEqual(df.count(), df2.count())
> schema3 = StructType([StructField("struct", schema1, False)])
> df3 = df.select(struct("i", "j").alias("struct")).to(schema3)
> self.assertEqual(schema3, df3.schema)
> self.assertEqual(df.count(), df3.count())
> # incompatible field nullability
> schema4 = StructType([StructField("j", LongType(), False)])
> self.assertRaisesRegex(
> AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4)
> ){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 1486, in test_to
>     self.assertRaisesRegex(
> AssertionError: AnalysisException not raised by  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41875) Throw proper errors in Dataset.to()

2023-01-03 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41875:
-

 Summary: Throw proper errors in Dataset.to()
 Key: SPARK-41875
 URL: https://issues.apache.org/jira/browse/SPARK-41875
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 401, in pyspark.sql.connect.dataframe.DataFrame.sample
Failed example:
    df.sample(0.5, 3).count()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 
1, in 
        df.sample(0.5, 3).count()
    TypeError: DataFrame.sample() takes 2 positional arguments but 3 were given
**
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 411, in pyspark.sql.connect.dataframe.DataFrame.sample
Failed example:
    df.sample(False, fraction=1.0).count()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 
1, in 
        df.sample(False, fraction=1.0).count()
    TypeError: DataFrame.sample() got multiple values for argument 
'fraction'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41874) Implement DataFrame `sameSemantics`

2023-01-03 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41874:
-

 Summary: Implement DataFrame `sameSemantics`
 Key: SPARK-41874
 URL: https://issues.apache.org/jira/browse/SPARK-41874
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41872) Fix DataFrame createDataframe handling of None

2023-01-03 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41872:
--
Summary: Fix DataFrame createDataframe handling of None  (was: Fix 
DataFrame fillna with bool)

> Fix DataFrame createDataframe handling of None
> --
>
> Key: SPARK-41872
> URL: https://issues.apache.org/jira/browse/SPARK-41872
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> row = self.spark.createDataFrame([("Alice", None, None, None)], 
> schema).fillna(True).first()
> self.assertEqual(row.age, None){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 231, in test_fillna
>     self.assertEqual(row.age, None)
> AssertionError: nan != None{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41872) Fix DataFrame createDataframe handling of None

2023-01-03 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41872:
--
Description: 
{code:java}
row = self.spark.createDataFrame([("Alice", None, None, None)], 
schema).fillna(True).first()
self.assertEqual(row.age, None){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 231, in test_fillna
    self.assertEqual(row.age, None)
AssertionError: nan != None{code}
 
{code:java}
row = (
self.spark.createDataFrame([("Alice", 10, None)], schema)
.replace(10, 20, subset=["name", "height"])
.first()
)
self.assertEqual(row.name, "Alice")
self.assertEqual(row.age, 10)
self.assertEqual(row.height, None) {code}
{code:java}
Traceback (most recent call last):   File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 372, in test_replace     self.assertEqual(row.height, None) 
AssertionError: nan != None
{code}

  was:
{code:java}
row = self.spark.createDataFrame([("Alice", None, None, None)], 
schema).fillna(True).first()
self.assertEqual(row.age, None){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 231, in test_fillna
    self.assertEqual(row.age, None)
AssertionError: nan != None{code}


> Fix DataFrame createDataframe handling of None
> --
>
> Key: SPARK-41872
> URL: https://issues.apache.org/jira/browse/SPARK-41872
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> row = self.spark.createDataFrame([("Alice", None, None, None)], 
> schema).fillna(True).first()
> self.assertEqual(row.age, None){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 231, in test_fillna
>     self.assertEqual(row.age, None)
> AssertionError: nan != None{code}
>  
> {code:java}
> row = (
> self.spark.createDataFrame([("Alice", 10, None)], schema)
> .replace(10, 20, subset=["name", "height"])
> .first()
> )
> self.assertEqual(row.name, "Alice")
> self.assertEqual(row.age, 10)
> self.assertEqual(row.height, None) {code}
> {code:java}
> Traceback (most recent call last):   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 372, in test_replace     self.assertEqual(row.height, None) 
> AssertionError: nan != None
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41873) Implement DataFrame `pandas_api`

2023-01-03 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41873:
--
Summary: Implement DataFrame `pandas_api`  (was: Implement DataFrameReader 
`pandas_api`)

> Implement DataFrame `pandas_api`
> 
>
> Key: SPARK-41873
> URL: https://issues.apache.org/jira/browse/SPARK-41873
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 276, in pyspark.sql.connect.functions.input_file_name
> Failed example:
>     df = spark.read.text(path)
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 
> 1, in 
>         df = spark.read.text(path)
>     AttributeError: 'DataFrameReader' object has no attribute 'text'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41873) Implement DataFrameReader `pandas_api`

2023-01-03 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41873:
-

 Summary: Implement DataFrameReader `pandas_api`
 Key: SPARK-41873
 URL: https://issues.apache.org/jira/browse/SPARK-41873
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 276, in pyspark.sql.connect.functions.input_file_name
Failed example:
    df = spark.read.text(path)
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 
1, in 
        df = spark.read.text(path)
    AttributeError: 'DataFrameReader' object has no attribute 'text'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41873) Implement DataFrame `pandas_api`

2023-01-03 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41873:
--
Description: (was: {code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 276, in pyspark.sql.connect.functions.input_file_name
Failed example:
    df = spark.read.text(path)
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 
1, in 
        df = spark.read.text(path)
    AttributeError: 'DataFrameReader' object has no attribute 'text'{code})

> Implement DataFrame `pandas_api`
> 
>
> Key: SPARK-41873
> URL: https://issues.apache.org/jira/browse/SPARK-41873
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41872) Fix DataFrame fillna with bool

2023-01-03 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41872:
--
Description: 
{code:java}
row = self.spark.createDataFrame([("Alice", None, None, None)], 
schema).fillna(True).first()
self.assertEqual(row.age, None){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 231, in test_fillna
    self.assertEqual(row.age, None)
AssertionError: nan != None{code}

  was:
{code:java}
df = self.spark.range(10e10).toDF("id")
such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 556, in test_extended_hint_types
    hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 482, in hint
    raise TypeError(
TypeError: param should be a int or str, but got float 1.2345{code}


> Fix DataFrame fillna with bool
> --
>
> Key: SPARK-41872
> URL: https://issues.apache.org/jira/browse/SPARK-41872
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> row = self.spark.createDataFrame([("Alice", None, None, None)], 
> schema).fillna(True).first()
> self.assertEqual(row.age, None){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 231, in test_fillna
>     self.assertEqual(row.age, None)
> AssertionError: nan != None{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41872) Fix DataFrame fillna with bool

2023-01-03 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41872:
-

 Summary: Fix DataFrame fillna with bool
 Key: SPARK-41872
 URL: https://issues.apache.org/jira/browse/SPARK-41872
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
df = self.spark.range(10e10).toDF("id")
such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 556, in test_extended_hint_types
    hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 482, in hint
    raise TypeError(
TypeError: param should be a int or str, but got float 1.2345{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41871) DataFrame hint parameter can be str, list, float or int

2023-01-03 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41871:
-

 Summary: DataFrame hint parameter can be str, list, float or int
 Key: SPARK-41871
 URL: https://issues.apache.org/jira/browse/SPARK-41871
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
df = self.spark.createDataFrame([("Alice", 50), ("Alice", 60)], ["name", "age"])

# shouldn't drop a non-null row
self.assertEqual(df.dropDuplicates().count(), 2)

self.assertEqual(df.dropDuplicates(["name"]).count(), 1)

self.assertEqual(df.dropDuplicates(["name", "age"]).count(), 2)

type_error_msg = "Parameter 'subset' must be a list of columns"
with self.assertRaisesRegex(TypeError, type_error_msg):
df.dropDuplicates("name"){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 128, in test_drop_duplicates
    with self.assertRaisesRegex(TypeError, type_error_msg):
AssertionError: TypeError not raised{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41871) DataFrame hint parameter can be str, list, float or int

2023-01-03 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41871:
--
Description: 
{code:java}
df = self.spark.range(10e10).toDF("id")
such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 556, in test_extended_hint_types
    hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 482, in hint
    raise TypeError(
TypeError: param should be a int or str, but got float 1.2345{code}

  was:
{code:java}
df = self.spark.createDataFrame([("Alice", 50), ("Alice", 60)], ["name", "age"])

# shouldn't drop a non-null row
self.assertEqual(df.dropDuplicates().count(), 2)

self.assertEqual(df.dropDuplicates(["name"]).count(), 1)

self.assertEqual(df.dropDuplicates(["name", "age"]).count(), 2)

type_error_msg = "Parameter 'subset' must be a list of columns"
with self.assertRaisesRegex(TypeError, type_error_msg):
df.dropDuplicates("name"){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 128, in test_drop_duplicates
    with self.assertRaisesRegex(TypeError, type_error_msg):
AssertionError: TypeError not raised{code}


> DataFrame hint parameter can be str, list, float or int
> ---
>
> Key: SPARK-41871
> URL: https://issues.apache.org/jira/browse/SPARK-41871
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.range(10e10).toDF("id")
> such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
> hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 556, in test_extended_hint_types
>     hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 482, in hint
>     raise TypeError(
> TypeError: param should be a int or str, but got float 1.2345{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41870) Handle duplicate columns in `createDataFrame`

2023-01-03 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41870:
-

 Summary: Handle duplicate columns in `createDataFrame`
 Key: SPARK-41870
 URL: https://issues.apache.org/jira/browse/SPARK-41870
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
import array

data = [Row(longarray=array.array("l", [-9223372036854775808, 0, 
9223372036854775807]))]
df = self.spark.createDataFrame(data) {code}
Error:
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 1220, in test_create_dataframe_from_array_of_long
    df = self.spark.createDataFrame(data)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", line 
260, in createDataFrame
    table = pa.Table.from_pylist([row.asDict(recursive=True) for row in _data])
  File "pyarrow/table.pxi", line 3700, in pyarrow.lib.Table.from_pylist
  File "pyarrow/table.pxi", line 5221, in pyarrow.lib._from_pylist
  File "pyarrow/table.pxi", line 3575, in pyarrow.lib.Table.from_arrays
  File "pyarrow/table.pxi", line 1383, in pyarrow.lib._sanitize_arrays
  File "pyarrow/table.pxi", line 1364, in pyarrow.lib._schema_from_arrays
  File "pyarrow/array.pxi", line 320, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert array('l', [-9223372036854775808, 
0, 9223372036854775807]) with type array.array: did not recognize Python value 
type when inferring an Arrow data type{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41870) Handle duplicate columns in `createDataFrame`

2023-01-03 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41870:
--
Description: 
{code:java}
df = self.spark.createDataFrame([(1, 2)], ["c", "c"]){code}
Error:
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 65, in test_duplicated_column_names
    df = self.spark.createDataFrame([(1, 2)], ["c", "c"])
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", line 
277, in createDataFrame
    raise ValueError(
ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 
elements{code}

  was:
{code:java}
import array

data = [Row(longarray=array.array("l", [-9223372036854775808, 0, 
9223372036854775807]))]
df = self.spark.createDataFrame(data) {code}
Error:
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 1220, in test_create_dataframe_from_array_of_long
    df = self.spark.createDataFrame(data)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", line 
260, in createDataFrame
    table = pa.Table.from_pylist([row.asDict(recursive=True) for row in _data])
  File "pyarrow/table.pxi", line 3700, in pyarrow.lib.Table.from_pylist
  File "pyarrow/table.pxi", line 5221, in pyarrow.lib._from_pylist
  File "pyarrow/table.pxi", line 3575, in pyarrow.lib.Table.from_arrays
  File "pyarrow/table.pxi", line 1383, in pyarrow.lib._sanitize_arrays
  File "pyarrow/table.pxi", line 1364, in pyarrow.lib._schema_from_arrays
  File "pyarrow/array.pxi", line 320, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert array('l', [-9223372036854775808, 
0, 9223372036854775807]) with type array.array: did not recognize Python value 
type when inferring an Arrow data type{code}


> Handle duplicate columns in `createDataFrame`
> -
>
> Key: SPARK-41870
> URL: https://issues.apache.org/jira/browse/SPARK-41870
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame([(1, 2)], ["c", "c"]){code}
> Error:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 65, in test_duplicated_column_names
>     df = self.spark.createDataFrame([(1, 2)], ["c", "c"])
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", 
> line 277, in createDataFrame
>     raise ValueError(
> ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 
> elements{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41869) DataFrame dropDuplicates should throw error on non list argument

2023-01-03 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41869:
--
Description: 
{code:java}
df = self.spark.createDataFrame([("Alice", 50), ("Alice", 60)], ["name", "age"])

# shouldn't drop a non-null row
self.assertEqual(df.dropDuplicates().count(), 2)

self.assertEqual(df.dropDuplicates(["name"]).count(), 1)

self.assertEqual(df.dropDuplicates(["name", "age"]).count(), 2)

type_error_msg = "Parameter 'subset' must be a list of columns"
with self.assertRaisesRegex(TypeError, type_error_msg):
df.dropDuplicates("name"){code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 128, in test_drop_duplicates
    with self.assertRaisesRegex(TypeError, type_error_msg):
AssertionError: TypeError not raised{code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1270, in pyspark.sql.connect.functions.explode
Failed example:
    eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type 
"STRUCT" while it's required to be "MAP".
    Plan:  {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1364, in pyspark.sql.connect.functions.inline
Failed example:
    df.select(inline(df.structlist)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(inline(df.structlist)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is 
of type "ARRAY" while it's required to be "STRUCT".
    Plan:  {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1411, in pyspark.sql.connect.functions.map_filter
Failed example:
    df.select(map_filter(
        "data", lambda _, v: v > 30.0).alias("data_filtered")
    ).show(truncate=False)
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(map_filter(
      File 
"/Users/s.singh/personal/spark-

[jira] [Created] (SPARK-41869) DataFrame dropDuplicates should throw error on non list argument

2023-01-03 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41869:
-

 Summary: DataFrame dropDuplicates should throw error on non list 
argument
 Key: SPARK-41869
 URL: https://issues.apache.org/jira/browse/SPARK-41869
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1270, in pyspark.sql.connect.functions.explode
Failed example:
    eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type 
"STRUCT" while it's required to be "MAP".
    Plan:  {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1364, in pyspark.sql.connect.functions.inline
Failed example:
    df.select(inline(df.structlist)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(inline(df.structlist)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is 
of type "ARRAY" while it's required to be "STRUCT".
    Plan:  {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1411, in pyspark.sql.connect.functions.map_filter
Failed example:
    df.select(map_filter(
        "data", lambda _, v: v > 30.0).alias("data_filtered")
    ).show(truncate=False)
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(map_filter(
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pand

[jira] [Commented] (SPARK-41855) `createDataFrame` doesn't handle None/NaN properly

2023-01-03 Thread Sandeep Singh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654255#comment-17654255
 ] 

Sandeep Singh commented on SPARK-41855:
---

[~podongfeng] there is another failure which might be similar
{code:java}
self.assertEqual(
self.spark.createDataFrame(data=[Decimal("NaN")], 
schema="decimal").collect(),
[Row(value=None)],
) {code}
cc: [~gurwls223] 

> `createDataFrame` doesn't handle None/NaN properly
> --
>
> Key: SPARK-41855
> URL: https://issues.apache.org/jira/browse/SPARK-41855
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:python}
> data = [Row(id=1, value=float("NaN")), Row(id=2, value=42.0), 
> Row(id=3, value=None)]
> # +---+-+
> # | id|value|
> # +---+-+
> # |  1|  NaN|
> # |  2| 42.0|
> # |  3| null|
> # +---+-+
> cdf = self.connect.createDataFrame(data)
> sdf = self.spark.createDataFrame(data)
> print()
> print()
> print(cdf._show_string(100, 100, False))
> print()
> print(cdf.schema)
> print()
> print(sdf._jdf.showString(100, 100, False))
> print()
> print(sdf.schema)
> self.compare_by_show(cdf, sdf)
> {code}
> {code:java}
> +---+-+
> | id|value|
> +---+-+
> |  1| null|
> |  2| 42.0|
> |  3| null|
> +---+-+
> StructType([StructField('id', LongType(), True), StructField('value', 
> DoubleType(), True)])
> +---+-+
> | id|value|
> +---+-+
> |  1|  NaN|
> |  2| 42.0|
> |  3| null|
> +---+-+
> StructType([StructField('id', LongType(), True), StructField('value', 
> DoubleType(), True)])
> {code}
> this issue is due to that `createDataFrame` can't handle None/NaN properly:
> 1, in the conversion from local data to pd.DataFrame, it automatically 
> converts both None and NaN to NaN
> 2, then in the conversion from pd.DataFrame to pa.Table, it always converts 
> NaN to null



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41856) Enable test_freqItems, test_input_files, test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found

2023-01-03 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41856:
--
Summary: Enable test_freqItems, test_input_files, 
test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found  (was: 
Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, 
test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found)

> Enable test_freqItems, test_input_files, test_toDF_with_schema_string, 
> test_to_pandas_required_pandas_not_found
> ---
>
> Key: SPARK-41856
> URL: https://issues.apache.org/jira/browse/SPARK-41856
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> 5 tests pass now. Should enable them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41868) Support data type Duration(NANOSECOND)

2023-01-03 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41868:
--
Description: 
{code:java}
import pandas as pd
from datetime import timedelta

df = self.spark.createDataFrame(pd.DataFrame({"a": 
[timedelta(microseconds=123)]})) {code}
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 1291, in test_create_dataframe_from_pandas_with_day_time_interval
    self.assertEqual(df.toPandas().a.iloc[0], timedelta(microseconds=123))
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
    return self._session.client.to_pandas(query)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
    return self._execute_and_fetch(req)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
    self._handle_error(rpc_error)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
623, in _handle_error
    raise SparkConnectException(status.message, info.reason) from None
pyspark.sql.connect.client.SparkConnectException: 
(org.apache.spark.SparkUnsupportedOperationException) Unsupported data type: 
Duration(NANOSECOND){code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1966, in pyspark.sql.connect.functions.hour
Failed example:
    df.select(hour('ts').alias('hour')).collect()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(hour('ts').alias('hour')).collect()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1017, in collect
        pdf = self.toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
623, in _handle_error
        raise SparkConnectException(status.message, info.reason) from None
    pyspark.sql.connect.client.SparkConnectException: 
(org.apache.spark.SparkUnsupportedOperationException) Unsupported data type: 
Timestamp(NANOSECOND, null){code}


> Support data type Duration(NANOSECOND)
> --
>
> Key: SPARK-41868
> URL: https://issues.apache.org/jira/browse/SPARK-41868
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> import pandas as pd
> from datetime import timedelta
> df = self.spark.createDataFrame(pd.DataFrame({"a": 
> [timedelta(microseconds=123)]})) {code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 1291, in test_create_dataframe_from_pandas_with_day_time_interval
>     self.assertEqual(df.toPandas().a.iloc[0], timedelta(microseconds=123))
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>     return self._session.client.to_pandas(query)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>     return self._execute_and_fetch(req)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>     self._handle_error(rpc_error)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 623, in _handle_error
>     raise SparkConnectException(status.message, info.reason) from None
> pyspark.sql.connect.client.SparkConnectException: 
> (org.apache.spark.SparkUnsupportedOperationException) Unsupported data type: 
> Duration(NANOSECOND){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41868) Support data type Duration(NANOSECOND)

2023-01-03 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41868:
-

 Summary: Support data type Duration(NANOSECOND)
 Key: SPARK-41868
 URL: https://issues.apache.org/jira/browse/SPARK-41868
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1966, in pyspark.sql.connect.functions.hour
Failed example:
    df.select(hour('ts').alias('hour')).collect()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(hour('ts').alias('hour')).collect()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1017, in collect
        pdf = self.toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
623, in _handle_error
        raise SparkConnectException(status.message, info.reason) from None
    pyspark.sql.connect.client.SparkConnectException: 
(org.apache.spark.SparkUnsupportedOperationException) Unsupported data type: 
Timestamp(NANOSECOND, null){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41866) Make `createDataFrame` support array

2023-01-03 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41866:
--
Description: 
{code:java}
import array

data = [Row(longarray=array.array("l", [-9223372036854775808, 0, 
9223372036854775807]))]
df = self.spark.createDataFrame(data) {code}
Error:
{code:java}
Traceback (most recent call last):
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", 
line 1220, in test_create_dataframe_from_array_of_long
    df = self.spark.createDataFrame(data)
  File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", line 
260, in createDataFrame
    table = pa.Table.from_pylist([row.asDict(recursive=True) for row in _data])
  File "pyarrow/table.pxi", line 3700, in pyarrow.lib.Table.from_pylist
  File "pyarrow/table.pxi", line 5221, in pyarrow.lib._from_pylist
  File "pyarrow/table.pxi", line 3575, in pyarrow.lib.Table.from_arrays
  File "pyarrow/table.pxi", line 1383, in pyarrow.lib._sanitize_arrays
  File "pyarrow/table.pxi", line 1364, in pyarrow.lib._schema_from_arrays
  File "pyarrow/array.pxi", line 320, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert array('l', [-9223372036854775808, 
0, 9223372036854775807]) with type array.array: did not recognize Python value 
type when inferring an Arrow data type{code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 2331, in pyspark.sql.connect.functions.call_udf
Failed example:
    _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType())
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType())
    AttributeError: 'SparkSession' object has no attribute 'udf'{code}


> Make `createDataFrame` support array
> 
>
> Key: SPARK-41866
> URL: https://issues.apache.org/jira/browse/SPARK-41866
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> import array
> data = [Row(longarray=array.array("l", [-9223372036854775808, 0, 
> 9223372036854775807]))]
> df = self.spark.createDataFrame(data) {code}
> Error:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 1220, in test_create_dataframe_from_array_of_long
>     df = self.spark.createDataFrame(data)
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/session.py", 
> line 260, in createDataFrame
>     table = pa.Table.from_pylist([row.asDict(recursive=True) for row in 
> _data])
>   File "pyarrow/table.pxi", line 3700, in pyarrow.lib.Table.from_pylist
>   File "pyarrow/table.pxi", line 5221, in pyarrow.lib._from_pylist
>   File "pyarrow/table.pxi", line 3575, in pyarrow.lib.Table.from_arrays
>   File "pyarrow/table.pxi", line 1383, in pyarrow.lib._sanitize_arrays
>   File "pyarrow/table.pxi", line 1364, in pyarrow.lib._schema_from_arrays
>   File "pyarrow/array.pxi", line 320, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Could not convert array('l', [-9223372036854775808, 
> 0, 9223372036854775807]) with type array.array: did not recognize Python 
> value type when inferring an Arrow data type{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41866) Make `createDataFrame` support array

2023-01-03 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41866:
-

 Summary: Make `createDataFrame` support array
 Key: SPARK-41866
 URL: https://issues.apache.org/jira/browse/SPARK-41866
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 2331, in pyspark.sql.connect.functions.call_udf
Failed example:
    _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType())
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType())
    AttributeError: 'SparkSession' object has no attribute 'udf'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41856) Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found

2023-01-03 Thread Sandeep Singh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654239#comment-17654239
 ] 

Sandeep Singh commented on SPARK-41856:
---

[~gurwls223] for some reason its still assigned to you 

> Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, 
> test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found
> --
>
> Key: SPARK-41856
> URL: https://issues.apache.org/jira/browse/SPARK-41856
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> 5 tests pass now. Should enable them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41857) Enable test_between_function, test_datetime_functions, test_expr, test_math_functions, test_window_functions_cumulative_sum, test_corr, test_cov, test_crosstab, test_app

2023-01-02 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41857:
--
Summary: Enable test_between_function, test_datetime_functions, test_expr, 
test_math_functions, test_window_functions_cumulative_sum, test_corr, test_cov, 
test_crosstab, test_approxQuantile  (was: Enable test_between_function, 
test_datetime_functions, test_expr, test_function_parity, test_math_functions, 
test_window_functions_cumulative_sum, test_corr, test_cov, test_crosstab, 
test_approxQuantile)

> Enable test_between_function, test_datetime_functions, test_expr, 
> test_math_functions, test_window_functions_cumulative_sum, test_corr, 
> test_cov, test_crosstab, test_approxQuantile
> 
>
> Key: SPARK-41857
> URL: https://issues.apache.org/jira/browse/SPARK-41857
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41857) Enable test_between_function, test_datetime_functions, test_expr, test_function_parity, test_math_functions, test_window_functions_cumulative_sum, test_corr, test_cov, t

2023-01-02 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41857:
--
Summary: Enable test_between_function, test_datetime_functions, test_expr, 
test_function_parity, test_math_functions, 
test_window_functions_cumulative_sum, test_corr, test_cov, test_crosstab, 
test_approxQuantile  (was: Enable 10 tests that pass)

> Enable test_between_function, test_datetime_functions, test_expr, 
> test_function_parity, test_math_functions, 
> test_window_functions_cumulative_sum, test_corr, test_cov, test_crosstab, 
> test_approxQuantile
> --
>
> Key: SPARK-41857
> URL: https://issues.apache.org/jira/browse/SPARK-41857
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41857) Enable 10 tests that pass

2023-01-02 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41857:
-

 Summary: Enable 10 tests that pass
 Key: SPARK-41857
 URL: https://issues.apache.org/jira/browse/SPARK-41857
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh
Assignee: Hyukjin Kwon
 Fix For: 3.4.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41856) Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found

2023-01-02 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41856:
--
Description: 5 tests pass now. Should enable them.  (was: These tests pass 
now. Should enable them.)

> Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, 
> test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found
> --
>
> Key: SPARK-41856
> URL: https://issues.apache.org/jira/browse/SPARK-41856
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> 5 tests pass now. Should enable them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41856) Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, test_toDF_with_schema_string, test_to_pandas_required_pandas_not_found

2023-01-02 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41856:
-

 Summary: Enable test_create_nan_decimal_dataframe, test_freqItems, 
test_input_files, test_toDF_with_schema_string, 
test_to_pandas_required_pandas_not_found
 Key: SPARK-41856
 URL: https://issues.apache.org/jira/browse/SPARK-41856
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, Tests
Affects Versions: 3.4.0
Reporter: Sandeep Singh
Assignee: Hyukjin Kwon
 Fix For: 3.4.0


These tests pass now. Should enable them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41852) Fix `pmod` function

2023-01-02 Thread Sandeep Singh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653750#comment-17653750
 ] 

Sandeep Singh commented on SPARK-41852:
---

[~podongfeng] these are from the doctests
{code:java}
>>> from pyspark.sql.functions import pmod
>>> df = spark.createDataFrame([
... (1.0, float('nan')), (float('nan'), 2.0), (10.0, 3.0),
... (float('nan'), float('nan')), (-3.0, 4.0), (-10.0, 3.0),
... (-5.0, -6.0), (7.0, -8.0), (1.0, 2.0)],
... ("a", "b"))
>>> df.select(pmod("a", "b")).show() {code}

> Fix `pmod` function
> ---
>
> Key: SPARK-41852
> URL: https://issues.apache.org/jira/browse/SPARK-41852
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 622, in pyspark.sql.connect.functions.pmod
> Failed example:
>     df.select(pmod("a", "b")).show()
> Expected:
>     +--+
>     |pmod(a, b)|
>     +--+
>     |       NaN|
>     |       NaN|
>     |       1.0|
>     |       NaN|
>     |       1.0|
>     |       2.0|
>     |      -5.0|
>     |       7.0|
>     |       1.0|
>     +--+
> Got:
>     +--+
>     |pmod(a, b)|
>     +--+
>     |      null|
>     |      null|
>     |       1.0|
>     |      null|
>     |       1.0|
>     |       2.0|
>     |      -5.0|
>     |       7.0|
>     |       1.0|
>     +--+
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41851) Fix `nanvl` function

2023-01-02 Thread Sandeep Singh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653751#comment-17653751
 ] 

Sandeep Singh commented on SPARK-41851:
---

[~podongfeng] 
{code:java}
>>> df = spark.createDataFrame([(1.0, float('nan')), (float('nan'), 2.0)], 
>>> ("a", "b"))
>>> df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, 
>>> df.b).alias("r2")).collect() {code}

> Fix `nanvl` function
> 
>
> Key: SPARK-41851
> URL: https://issues.apache.org/jira/browse/SPARK-41851
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 313, in pyspark.sql.connect.functions.nanvl
> Failed example:
>     df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, 
> df.b).alias("r2")).collect()
> Expected:
>     [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)]
> Got:
>     [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41847) DataFrame mapfield,structlist invalid type

2023-01-02 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41847:
--
Description: 
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1270, in pyspark.sql.connect.functions.explode
Failed example:
    eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type 
"STRUCT" while it's required to be "MAP".
    Plan:  {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1364, in pyspark.sql.connect.functions.inline
Failed example:
    df.select(inline(df.structlist)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(inline(df.structlist)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is 
of type "ARRAY" while it's required to be "STRUCT".
    Plan:  {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1411, in pyspark.sql.connect.functions.map_filter
Failed example:
    df.select(map_filter(
        "data", lambda _, v: v > 30.0).alias("data_filtered")
    ).show(truncate=False)
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(map_filter(
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)

[jira] [Created] (SPARK-41852) Fix `pmod` function

2023-01-02 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41852:
-

 Summary: Fix `pmod` function
 Key: SPARK-41852
 URL: https://issues.apache.org/jira/browse/SPARK-41852
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Sandeep Singh
 Fix For: 3.4.0


{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 313, in pyspark.sql.connect.functions.nanvl
Failed example:
    df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, 
df.b).alias("r2")).collect()
Expected:
    [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)]
Got:
    [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41852) Fix `pmod` function

2023-01-02 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41852:
--
Description: 
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 622, in pyspark.sql.connect.functions.pmod
Failed example:
    df.select(pmod("a", "b")).show()
Expected:
    +--+
    |pmod(a, b)|
    +--+
    |       NaN|
    |       NaN|
    |       1.0|
    |       NaN|
    |       1.0|
    |       2.0|
    |      -5.0|
    |       7.0|
    |       1.0|
    +--+
Got:
    +--+
    |pmod(a, b)|
    +--+
    |      null|
    |      null|
    |       1.0|
    |      null|
    |       1.0|
    |       2.0|
    |      -5.0|
    |       7.0|
    |       1.0|
    +--+
    {code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 313, in pyspark.sql.connect.functions.nanvl
Failed example:
    df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, 
df.b).alias("r2")).collect()
Expected:
    [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)]
Got:
    [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code}


> Fix `pmod` function
> ---
>
> Key: SPARK-41852
> URL: https://issues.apache.org/jira/browse/SPARK-41852
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 622, in pyspark.sql.connect.functions.pmod
> Failed example:
>     df.select(pmod("a", "b")).show()
> Expected:
>     +--+
>     |pmod(a, b)|
>     +--+
>     |       NaN|
>     |       NaN|
>     |       1.0|
>     |       NaN|
>     |       1.0|
>     |       2.0|
>     |      -5.0|
>     |       7.0|
>     |       1.0|
>     +--+
> Got:
>     +--+
>     |pmod(a, b)|
>     +--+
>     |      null|
>     |      null|
>     |       1.0|
>     |      null|
>     |       1.0|
>     |       2.0|
>     |      -5.0|
>     |       7.0|
>     |       1.0|
>     +--+
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41851) Fix `nanvl` function

2023-01-02 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41851:
-

 Summary: Fix `nanvl` function
 Key: SPARK-41851
 URL: https://issues.apache.org/jira/browse/SPARK-41851
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Sandeep Singh
 Fix For: 3.4.0


{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 801, in pyspark.sql.connect.functions.count
Failed example:
    df.select(count(expr("*")), count(df.alphabets)).show()
Expected:
    +++
    |count(1)|count(alphabets)|
    +++
    |       4|               3|
    +++
Got:
    +++
    |count(alphabets)|count(alphabets)|
    +++
    |               3|               3|
    +++
     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41851) Fix `nanvl` function

2023-01-02 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41851:
--
Description: 
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 313, in pyspark.sql.connect.functions.nanvl
Failed example:
    df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, 
df.b).alias("r2")).collect()
Expected:
    [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)]
Got:
    [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 801, in pyspark.sql.connect.functions.count
Failed example:
    df.select(count(expr("*")), count(df.alphabets)).show()
Expected:
    +++
    |count(1)|count(alphabets)|
    +++
    |       4|               3|
    +++
Got:
    +++
    |count(alphabets)|count(alphabets)|
    +++
    |               3|               3|
    +++
     {code}


> Fix `nanvl` function
> 
>
> Key: SPARK-41851
> URL: https://issues.apache.org/jira/browse/SPARK-41851
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 313, in pyspark.sql.connect.functions.nanvl
> Failed example:
>     df.select(nanvl("a", "b").alias("r1"), nanvl(df.a, 
> df.b).alias("r2")).collect()
> Expected:
>     [Row(r1=1.0, r2=1.0), Row(r1=2.0, r2=2.0)]
> Got:
>     [Row(r1=1.0, r2=1.0), Row(r1=nan, r2=nan)]{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41847) DataFrame mapfield,structlist invalid type

2023-01-02 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41847:
--
Description: 
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1270, in pyspark.sql.connect.functions.explode
Failed example:
    eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type 
"STRUCT" while it's required to be "MAP".
    Plan:  {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1364, in pyspark.sql.connect.functions.inline
Failed example:
    df.select(inline(df.structlist)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(inline(df.structlist)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is 
of type "ARRAY" while it's required to be "STRUCT".
    Plan:  {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1411, in pyspark.sql.connect.functions.map_filter
Failed example:
    df.select(map_filter(
        "data", lambda _, v: v > 30.0).alias("data_filtered")
    ).show(truncate=False)
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(map_filter(
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)

[jira] [Commented] (SPARK-41850) Fix `isnan` function

2023-01-02 Thread Sandeep Singh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653738#comment-17653738
 ] 

Sandeep Singh commented on SPARK-41850:
---

This should be moved under SPARK-41283

> Fix `isnan` function
> 
>
> Key: SPARK-41850
> URL: https://issues.apache.org/jira/browse/SPARK-41850
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 288, in pyspark.sql.connect.functions.isnan
> Failed example:
>     df.select("a", "b", isnan("a").alias("r1"), 
> isnan(df.b).alias("r2")).show()
> Expected:
>     +---+---+-+-+
>     |  a|  b|   r1|   r2|
>     +---+---+-+-+
>     |1.0|NaN|false| true|
>     |NaN|2.0| true|false|
>     +---+---+-+-+
> Got:
>     +++-+-+
>     |   a|   b|   r1|   r2|
>     +++-+-+
>     | 1.0|null|false|false|
>     |null| 2.0|false|false|
>     +++-+-+
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41850) Fix `isnan` function

2023-01-02 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41850:
--
Description: 
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 288, in pyspark.sql.connect.functions.isnan
Failed example:
    df.select("a", "b", isnan("a").alias("r1"), isnan(df.b).alias("r2")).show()
Expected:
    +---+---+-+-+
    |  a|  b|   r1|   r2|
    +---+---+-+-+
    |1.0|NaN|false| true|
    |NaN|2.0| true|false|
    +---+---+-+-+
Got:
    +++-+-+
    |   a|   b|   r1|   r2|
    +++-+-+
    | 1.0|null|false|false|
    |null| 2.0|false|false|
    +++-+-+
    {code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 276, in pyspark.sql.connect.functions.input_file_name
Failed example:
    df = spark.read.text(path)
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 
1, in 
        df = spark.read.text(path)
    AttributeError: 'DataFrameReader' object has no attribute 'text'{code}


> Fix `isnan` function
> 
>
> Key: SPARK-41850
> URL: https://issues.apache.org/jira/browse/SPARK-41850
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 288, in pyspark.sql.connect.functions.isnan
> Failed example:
>     df.select("a", "b", isnan("a").alias("r1"), 
> isnan(df.b).alias("r2")).show()
> Expected:
>     +---+---+-+-+
>     |  a|  b|   r1|   r2|
>     +---+---+-+-+
>     |1.0|NaN|false| true|
>     |NaN|2.0| true|false|
>     +---+---+-+-+
> Got:
>     +++-+-+
>     |   a|   b|   r1|   r2|
>     +++-+-+
>     | 1.0|null|false|false|
>     |null| 2.0|false|false|
>     +++-+-+
>     {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41850) Fix `isnan` function

2023-01-02 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41850:
--
Summary: Fix `isnan` function  (was: Fix DataFrameReader.isnan)

> Fix `isnan` function
> 
>
> Key: SPARK-41850
> URL: https://issues.apache.org/jira/browse/SPARK-41850
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 276, in pyspark.sql.connect.functions.input_file_name
> Failed example:
>     df = spark.read.text(path)
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 
> 1, in 
>         df = spark.read.text(path)
>     AttributeError: 'DataFrameReader' object has no attribute 'text'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41850) Fix DataFrameReader.isnan

2023-01-02 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41850:
-

 Summary: Fix DataFrameReader.isnan
 Key: SPARK-41850
 URL: https://issues.apache.org/jira/browse/SPARK-41850
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 276, in pyspark.sql.connect.functions.input_file_name
Failed example:
    df = spark.read.text(path)
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 
1, in 
        df = spark.read.text(path)
    AttributeError: 'DataFrameReader' object has no attribute 'text'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41849) Implement DataFrameReader.text

2023-01-02 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41849:
-

 Summary: Implement DataFrameReader.text
 Key: SPARK-41849
 URL: https://issues.apache.org/jira/browse/SPARK-41849
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1270, in pyspark.sql.connect.functions.explode
Failed example:
    eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type 
"STRUCT" while it's required to be "MAP".
    Plan:  {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1364, in pyspark.sql.connect.functions.inline
Failed example:
    df.select(inline(df.structlist)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(inline(df.structlist)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is 
of type "ARRAY" while it's required to be "STRUCT".
    Plan:  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41849) Implement DataFrameReader.text

2023-01-02 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41849:
--
Description: 
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 276, in pyspark.sql.connect.functions.input_file_name
Failed example:
    df = spark.read.text(path)
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 
1, in 
        df = spark.read.text(path)
    AttributeError: 'DataFrameReader' object has no attribute 'text'{code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1270, in pyspark.sql.connect.functions.explode
Failed example:
    eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type 
"STRUCT" while it's required to be "MAP".
    Plan:  {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1364, in pyspark.sql.connect.functions.inline
Failed example:
    df.select(inline(df.structlist)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(inline(df.structlist)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is 
of type "ARRAY" while it's required to be "STRUCT".
    Plan:  {code}


> Implement DataFrameReader.text
> --
>
> Key: SPARK-41849
> URL: https://issues.apache.org/jira/browse/SPARK-41849
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 276, in pyspark.sql.connect.functions.input_file_name
> Failed example:
>     df = spark.read.text(path)
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  li

[jira] [Updated] (SPARK-41847) DataFrame mapfield,structlist invalid type

2023-01-02 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41847:
--
Description: 
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1270, in pyspark.sql.connect.functions.explode
Failed example:
    eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type 
"STRUCT" while it's required to be "MAP".
    Plan:  {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1364, in pyspark.sql.connect.functions.inline
Failed example:
    df.select(inline(df.structlist)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.select(inline(df.structlist)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `structlist`.`element` is 
of type "ARRAY" while it's required to be "STRUCT".
    Plan:  {code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1270, in pyspark.sql.connect.functions.explode
Failed example:
    eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)

[jira] [Updated] (SPARK-41847) DataFrame mapfield,structlist invalid type

2023-01-02 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41847:
--
Summary: DataFrame mapfield,structlist invalid type  (was: DataFrame 
mapfield invalid type)

> DataFrame mapfield,structlist invalid type
> --
>
> Key: SPARK-41847
> URL: https://issues.apache.org/jira/browse/SPARK-41847
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1270, in pyspark.sql.connect.functions.explode
> Failed example:
>     eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type 
> "STRUCT" while it's required to be "MAP".
>     Plan:  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41847) DataFrame mapfield invalid type

2023-01-02 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41847:
--
Description: 
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1270, in pyspark.sql.connect.functions.explode
Failed example:
    eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        eDF.select(explode(eDF.mapfield).alias("key", "value")).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[INVALID_COLUMN_OR_FIELD_DATA_TYPE] Column or field `mapfield` is of type 
"STRUCT" while it's required to be "MAP".
    Plan:  {code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1098, in pyspark.sql.connect.functions.rank
Failed example:
    df.withColumn("drank", rank().over(w)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.withColumn("drank", rank().over(w)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
`value` cannot be resolved. Did you mean one of the following? [`_1`]
    Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS 
FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
drank#4003]
    +- Project [0#3998L AS _1#4000L]
       +- LocalRelation [0#3998L] {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1032, in pyspark.sql.connect.functions.cume_dist
Failed example:
    df.withColumn("cd", cume_dist().over(w)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.withColumn("cd", cume_dist().over(w)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py",

[jira] [Created] (SPARK-41847) DataFrame mapfield invalid type

2023-01-02 Thread Sandeep Singh (Jira)

Sandeep Singh created SPARK-41847:
-

 Summary: DataFrame mapfield invalid type
 Key: SPARK-41847
 URL: https://issues.apache.org/jira/browse/SPARK-41847
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Sandeep Singh


{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1098, in pyspark.sql.connect.functions.rank
Failed example:
    df.withColumn("drank", rank().over(w)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.withColumn("drank", rank().over(w)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
`value` cannot be resolved. Did you mean one of the following? [`_1`]
    Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS 
FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
drank#4003]
    +- Project [0#3998L AS _1#4000L]
       +- LocalRelation [0#3998L] {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1032, in pyspark.sql.connect.functions.cume_dist
Failed example:
    df.withColumn("cd", cume_dist().over(w)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.withColumn("cd", cume_dist().over(w)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
`value` cannot be resolved. Did you mean one of the following? [`_1`]
    Plan: 'Project [_1#2202L, cume_dist() windowspecdefinition('value ASC NULLS 
FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) 
AS cd#2205]
    +- Project [0#2200L AS _1#2202L]
       +- LocalRelation [0#2200L] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41846) DataFrame windowspec functions : unresolved columns

2023-01-02 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41846:
--
Summary: DataFrame windowspec functions : unresolved columns  (was: 
DataFrame aggregation functions : unresolved columns)

> DataFrame windowspec functions : unresolved columns
> ---
>
> Key: SPARK-41846
> URL: https://issues.apache.org/jira/browse/SPARK-41846
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1098, in pyspark.sql.connect.functions.rank
> Failed example:
>     df.withColumn("drank", rank().over(w)).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.withColumn("drank", rank().over(w)).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
> `value` cannot be resolved. Did you mean one of the following? [`_1`]
>     Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS 
> FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS drank#4003]
>     +- Project [0#3998L AS _1#4000L]
>        +- LocalRelation [0#3998L] {code}
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1032, in pyspark.sql.connect.functions.cume_dist
> Failed example:
>     df.withColumn("cd", cume_dist().over(w)).show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.withColumn("cd", cume_dist().over(w)).show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
> `value` cannot be resolved. Did you mean one of the following? [`_1`]
>     Plan: 'Project [_1#2202L, cume_dist() windowspecdefinition('value ASC 
> NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), 
> currentrow$())) AS cd#2205]
>     +- Project [0#2200L AS _1#2202L]
>        +- LocalRelation [0#2200L] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...

[jira] [Updated] (SPARK-41846) DataFrame aggregation functions : unresolved columns

2023-01-02 Thread Sandeep Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh updated SPARK-41846:
--
Description: 
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1098, in pyspark.sql.connect.functions.rank
Failed example:
    df.withColumn("drank", rank().over(w)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.withColumn("drank", rank().over(w)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
`value` cannot be resolved. Did you mean one of the following? [`_1`]
    Plan: 'Project [_1#4000L, rank() windowspecdefinition('value ASC NULLS 
FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
drank#4003]
    +- Project [0#3998L AS _1#4000L]
       +- LocalRelation [0#3998L] {code}
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1032, in pyspark.sql.connect.functions.cume_dist
Failed example:
    df.withColumn("cd", cume_dist().over(w)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.withColumn("cd", cume_dist().over(w)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 1031, in toPandas
        return self._session.client.to_pandas(query)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
413, in to_pandas
        return self._execute_and_fetch(req)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
573, in _execute_and_fetch
        self._handle_error(rpc_error)
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", line 
619, in _handle_error
        raise SparkConnectAnalysisException(
    pyspark.sql.connect.client.SparkConnectAnalysisException: 
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
`value` cannot be resolved. Did you mean one of the following? [`_1`]
    Plan: 'Project [_1#2202L, cume_dist() windowspecdefinition('value ASC NULLS 
FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) 
AS cd#2205]
    +- Project [0#2200L AS _1#2202L]
       +- LocalRelation [0#2200L] {code}

  was:
{code:java}
File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
line 1098, in pyspark.sql.connect.functions.rank
Failed example:
    df.withColumn("drank", rank().over(w)).show()
Exception raised:
    Traceback (most recent call last):
      File 
"/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
 line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "", line 1, in 

        df.withColumn("drank", rank().over(w)).show()
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 534, in show
        print(self._show_string(n, truncate, vertical))
      File 
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
line 423, in _show_string
        ).toPandas()
      File 
"/Users/s.sin

1 2 3 4 >

1 - 100 of 323 matches

Mail list logo