[jira] [Created] (SPARK-41961) Support table-valued functions with LATERAL

2023-01-09 Thread Allison Wang (Jira)
Allison Wang created SPARK-41961:


 Summary: Support table-valued functions with LATERAL
 Key: SPARK-41961
 URL: https://issues.apache.org/jira/browse/SPARK-41961
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Allison Wang


Support table-valued functions with the LATERAL subquery. For example:

{{select * from t, lateral explode(array(t.c1, t.c2))}}

Currently, this query throws a parse exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-41953) Shuffle output location refetch during shuffle migration in decommission

2023-01-09 Thread Mridul Muralidharan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656480#comment-17656480
 ] 

Mridul Muralidharan edited comment on SPARK-41953 at 1/10/23 7:48 AM:
--

A few things:

* Looking at SPARK-27637, we should revisit it - the {{IsExecutorAlive}} 
request does not make sense in case of dynamic resource allocation (DRA) when 
an external shuffle service (ESS) is enabled : we should not be making that 
call. Thoughts [~Ngone51] ? +CC [~csingh]
This also means, relying on ExecutorDeadException when DRA is enabled with ESS 
is configured wont be useful.

For rest of the proposal ...

For (2),  I am not sure about 'Make MapOutputTracker support fetch latest 
output without epoch provided.' - this could have nontrivial interaction with 
other things, and I will need to think through it. Not sure if we can model 
node decommission - where we have block moved from host A to host B without any 
other change - as not requiring an epoch update (or rather, flag the epoch's as 
'compatible' - if there are no interleaving updates), requires analysis 

Assuming we sort out how to get updated state, (3) looks like a reasonable 
approach.





was (Author: mridulm80):

A few things:

* Looking at SPARK-27637, we should revisit it - the {{IsExecutorAlive}} 
request does not make sense in case of dynamic resource allocation (DRA) when 
an external shuffle service (ESS) is enabled : we should not be making that 
call. Thoughts [~Ngone51] ? +CC [~csingh]
This also means, relying on ExecutorDeadException when DRA is enabled with ESS 
is configured wont be useful.

For rest of the proposal ...

For (2),  I am not sure about 'Make MapOutputTracker support fetch latest 
output without epoch provided.' - this could have nontrivial interaction with 
other things, and I will need to think through it. Not sure if we can model 
node decommission - where we have block moved from host A to host B without any 
other change - as not requiring an epoch update, requires analysis 

Assuming we sort out how to get updated state, (3) looks like a reasonable 
approach.




> Shuffle output location refetch during shuffle migration in decommission
> 
>
> Key: SPARK-41953
> URL: https://issues.apache.org/jira/browse/SPARK-41953
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.1
>Reporter: Zhongwei Zhu
>Priority: Major
>
> When shuffle migration enabled during spark decommissionm, shuffle data will 
> be migrated into live executors, then update latest location to 
> MapOutputTracker. It has some issues:
>  # Executors only do map output location fetch in the beginning of the reduce 
> stage, so any shuffle output location change in the middle of reduce will 
> cause FetchFailed as reducer fetch from old location. Even stage retries 
> could solve this, this still cause lots of resource waste as all shuffle read 
> and compute happened before FetchFailed partition will be wasted.
>  # During stage retries, less running tasks cause more executors to be 
> decommissioned and shuffle data location keep changing. In the worst case, 
> stage could need lots of retries, further breaking SLA.
> So I propose to support refetch map output location during reduce phase if 
> shuffle migration is enabled and FetchFailed is caused by a decommissioned 
> dead executor. The detailed steps as below:
>  # When `BlockTransferService` fetch blocks failed from a decommissioned dead 
> executor, ExecutorDeadException(isDecommission as true) will be thrown.
>  # Make MapOutputTracker support fetch latest output without epoch provided.
>  # `ShuffleBlockFetcherIterator` will refetch latest output from 
> MapOutputTrackMaster. For all the shuffle blocks on this decommissioned, 
> there should be a new location on another executor. If not, throw exception 
> as current. If yes, create new local and remote requests to fetch these 
> migrated shuffle blocks. The flow will be similar as failback fetch when push 
> merged fetch failed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41953) Shuffle output location refetch during shuffle migration in decommission

2023-01-09 Thread Mridul Muralidharan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656480#comment-17656480
 ] 

Mridul Muralidharan commented on SPARK-41953:
-


A few things:

* Looking at SPARK-27637, we should revisit it - the {{IsExecutorAlive}} 
request does not make sense in case of dynamic resource allocation (DRA) when 
an external shuffle service (ESS) is enabled : we should not be making that 
call. Thoughts [~Ngone51] ? +CC [~csingh]
This also means, relying on ExecutorDeadException when DRA is enabled with ESS 
is configured wont be useful.

For rest of the proposal ...

For (2),  I am not sure about 'Make MapOutputTracker support fetch latest 
output without epoch provided.' - this could have nontrivial interaction with 
other things, and I will need to think through it. Not sure if we can model 
node decommission - where we have block moved from host A to host B without any 
other change - as not requiring an epoch update, requires analysis 

Assuming we sort out how to get updated state, (3) looks like a reasonable 
approach.




> Shuffle output location refetch during shuffle migration in decommission
> 
>
> Key: SPARK-41953
> URL: https://issues.apache.org/jira/browse/SPARK-41953
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.1
>Reporter: Zhongwei Zhu
>Priority: Major
>
> When shuffle migration enabled during spark decommissionm, shuffle data will 
> be migrated into live executors, then update latest location to 
> MapOutputTracker. It has some issues:
>  # Executors only do map output location fetch in the beginning of the reduce 
> stage, so any shuffle output location change in the middle of reduce will 
> cause FetchFailed as reducer fetch from old location. Even stage retries 
> could solve this, this still cause lots of resource waste as all shuffle read 
> and compute happened before FetchFailed partition will be wasted.
>  # During stage retries, less running tasks cause more executors to be 
> decommissioned and shuffle data location keep changing. In the worst case, 
> stage could need lots of retries, further breaking SLA.
> So I propose to support refetch map output location during reduce phase if 
> shuffle migration is enabled and FetchFailed is caused by a decommissioned 
> dead executor. The detailed steps as below:
>  # When `BlockTransferService` fetch blocks failed from a decommissioned dead 
> executor, ExecutorDeadException(isDecommission as true) will be thrown.
>  # Make MapOutputTracker support fetch latest output without epoch provided.
>  # `ShuffleBlockFetcherIterator` will refetch latest output from 
> MapOutputTrackMaster. For all the shuffle blocks on this decommissioned, 
> there should be a new location on another executor. If not, throw exception 
> as current. If yes, create new local and remote requests to fetch these 
> migrated shuffle blocks. The flow will be similar as failback fetch when push 
> merged fetch failed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41934) Add the unsupported function list for `session`

2023-01-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656465#comment-17656465
 ] 

Apache Spark commented on SPARK-41934:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39478

> Add the unsupported function list for `session`
> ---
>
> Key: SPARK-41934
> URL: https://issues.apache.org/jira/browse/SPARK-41934
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40885) Spark will filter out data field sorting when dynamic partitions and data fields are sorted at the same time

2023-01-09 Thread Enrico Minack (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656459#comment-17656459
 ] 

Enrico Minack edited comment on SPARK-40885 at 1/10/23 7:02 AM:


This should be fixed by SPARK-41959 / 
[#39475|http://example.com|https://github.com/apache/spark/pull/39475].


was (Author: enricomi):
This should be fixed by SPARK-41959.

> Spark will filter out data field sorting when dynamic partitions and data 
> fields are sorted at the same time
> 
>
> Key: SPARK-40885
> URL: https://issues.apache.org/jira/browse/SPARK-40885
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.3.0, 3.2.2
>Reporter: ming95
>Priority: Major
> Attachments: 1666494504884.jpg
>
>
> When using dynamic partitions to write data and sort partitions and data 
> fields, Spark will filter the sorting of data fields.
>  
> reproduce sql:
> {code:java}
> CREATE TABLE `sort_table`(
>   `id` int,
>   `name` string
>   )
> PARTITIONED BY (
>   `dt` string)
> stored as textfile
> LOCATION 'sort_table';CREATE TABLE `test_table`(
>   `id` int,
>   `name` string)
> PARTITIONED BY (
>   `dt` string)
> stored as textfile
> LOCATION
>   'test_table';//gen test data
> insert into test_table partition(dt=20221011) select 10,"15" union all select 
> 1,"10" union  all select 5,"50" union  all select 20,"2" union  all select 
> 30,"14"  ;
> set spark.hadoop.hive.exec.dynamici.partition=true;
> set spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict;
> // this sql sort with partition filed (`dt`) and data filed (`name`), but 
> sort with `name` can not work
> insert overwrite table sort_table partition(dt) select id,name,dt from 
> test_table order by name,dt;
>  {code}
>  
> The Sort operator of DAG has only one sort field, but there are actually two 
> in SQL.(See the attached drawing)
>  
> It relate this issue : https://issues.apache.org/jira/browse/SPARK-40588



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40885) Spark will filter out data field sorting when dynamic partitions and data fields are sorted at the same time

2023-01-09 Thread Enrico Minack (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656459#comment-17656459
 ] 

Enrico Minack edited comment on SPARK-40885 at 1/10/23 7:00 AM:


This should be fixed by SPARK-41959.


was (Author: enricomi):
This should be fixed by SPARK-40885.

> Spark will filter out data field sorting when dynamic partitions and data 
> fields are sorted at the same time
> 
>
> Key: SPARK-40885
> URL: https://issues.apache.org/jira/browse/SPARK-40885
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.3.0, 3.2.2
>Reporter: ming95
>Priority: Major
> Attachments: 1666494504884.jpg
>
>
> When using dynamic partitions to write data and sort partitions and data 
> fields, Spark will filter the sorting of data fields.
>  
> reproduce sql:
> {code:java}
> CREATE TABLE `sort_table`(
>   `id` int,
>   `name` string
>   )
> PARTITIONED BY (
>   `dt` string)
> stored as textfile
> LOCATION 'sort_table';CREATE TABLE `test_table`(
>   `id` int,
>   `name` string)
> PARTITIONED BY (
>   `dt` string)
> stored as textfile
> LOCATION
>   'test_table';//gen test data
> insert into test_table partition(dt=20221011) select 10,"15" union all select 
> 1,"10" union  all select 5,"50" union  all select 20,"2" union  all select 
> 30,"14"  ;
> set spark.hadoop.hive.exec.dynamici.partition=true;
> set spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict;
> // this sql sort with partition filed (`dt`) and data filed (`name`), but 
> sort with `name` can not work
> insert overwrite table sort_table partition(dt) select id,name,dt from 
> test_table order by name,dt;
>  {code}
>  
> The Sort operator of DAG has only one sort field, but there are actually two 
> in SQL.(See the attached drawing)
>  
> It relate this issue : https://issues.apache.org/jira/browse/SPARK-40588



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40885) Spark will filter out data field sorting when dynamic partitions and data fields are sorted at the same time

2023-01-09 Thread Enrico Minack (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656459#comment-17656459
 ] 

Enrico Minack commented on SPARK-40885:
---

This should be fixed by SPARK-40885.

> Spark will filter out data field sorting when dynamic partitions and data 
> fields are sorted at the same time
> 
>
> Key: SPARK-40885
> URL: https://issues.apache.org/jira/browse/SPARK-40885
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.3.0, 3.2.2
>Reporter: ming95
>Priority: Major
> Attachments: 1666494504884.jpg
>
>
> When using dynamic partitions to write data and sort partitions and data 
> fields, Spark will filter the sorting of data fields.
>  
> reproduce sql:
> {code:java}
> CREATE TABLE `sort_table`(
>   `id` int,
>   `name` string
>   )
> PARTITIONED BY (
>   `dt` string)
> stored as textfile
> LOCATION 'sort_table';CREATE TABLE `test_table`(
>   `id` int,
>   `name` string)
> PARTITIONED BY (
>   `dt` string)
> stored as textfile
> LOCATION
>   'test_table';//gen test data
> insert into test_table partition(dt=20221011) select 10,"15" union all select 
> 1,"10" union  all select 5,"50" union  all select 20,"2" union  all select 
> 30,"14"  ;
> set spark.hadoop.hive.exec.dynamici.partition=true;
> set spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict;
> // this sql sort with partition filed (`dt`) and data filed (`name`), but 
> sort with `name` can not work
> insert overwrite table sort_table partition(dt) select id,name,dt from 
> test_table order by name,dt;
>  {code}
>  
> The Sort operator of DAG has only one sort field, but there are actually two 
> in SQL.(See the attached drawing)
>  
> It relate this issue : https://issues.apache.org/jira/browse/SPARK-40588



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41960) Assign name to _LEGACY_ERROR_TEMP_1056

2023-01-09 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-41960:
---

 Summary: Assign name to _LEGACY_ERROR_TEMP_1056
 Key: SPARK-41960
 URL: https://issues.apache.org/jira/browse/SPARK-41960
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Haejoon Lee


Assign name to _LEGACY_ERROR_TEMP_1056



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41595) Support generator function explode/explode_outer in the FROM clause

2023-01-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-41595.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39133
[https://github.com/apache/spark/pull/39133]

> Support generator function explode/explode_outer in the FROM clause
> ---
>
> Key: SPARK-41595
> URL: https://issues.apache.org/jira/browse/SPARK-41595
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, the table-valued generator function explode/explode_outer can only 
> be used in the SELECT clause of a query:
> SELECT explode(array(1, 2))
> This task is to allow table-valued functions to be used in the FROM clause of 
> a query:
> SELECT * FROM explode(array(1, 2))



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41595) Support generator function explode/explode_outer in the FROM clause

2023-01-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-41595:
---

Assignee: Allison Wang

> Support generator function explode/explode_outer in the FROM clause
> ---
>
> Key: SPARK-41595
> URL: https://issues.apache.org/jira/browse/SPARK-41595
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>
> Currently, the table-valued generator function explode/explode_outer can only 
> be used in the SELECT clause of a query:
> SELECT explode(array(1, 2))
> This task is to allow table-valued functions to be used in the FROM clause of 
> a query:
> SELECT * FROM explode(array(1, 2))



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41872) Fix DataFrame createDataframe handling of None

2023-01-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656437#comment-17656437
 ] 

Apache Spark commented on SPARK-41872:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39477

> Fix DataFrame createDataframe handling of None
> --
>
> Key: SPARK-41872
> URL: https://issues.apache.org/jira/browse/SPARK-41872
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> row = self.spark.createDataFrame([("Alice", None, None, None)], 
> schema).fillna(True).first()
> self.assertEqual(row.age, None){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 231, in test_fillna
>     self.assertEqual(row.age, None)
> AssertionError: nan != None{code}
>  
> {code:java}
> row = (
> self.spark.createDataFrame([("Alice", 10, None)], schema)
> .replace(10, 20, subset=["name", "height"])
> .first()
> )
> self.assertEqual(row.name, "Alice")
> self.assertEqual(row.age, 10)
> self.assertEqual(row.height, None) {code}
> {code:java}
> Traceback (most recent call last):   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 372, in test_replace     self.assertEqual(row.height, None) 
> AssertionError: nan != None
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41872) Fix DataFrame createDataframe handling of None

2023-01-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41872:


Assignee: Apache Spark

> Fix DataFrame createDataframe handling of None
> --
>
> Key: SPARK-41872
> URL: https://issues.apache.org/jira/browse/SPARK-41872
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> row = self.spark.createDataFrame([("Alice", None, None, None)], 
> schema).fillna(True).first()
> self.assertEqual(row.age, None){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 231, in test_fillna
>     self.assertEqual(row.age, None)
> AssertionError: nan != None{code}
>  
> {code:java}
> row = (
> self.spark.createDataFrame([("Alice", 10, None)], schema)
> .replace(10, 20, subset=["name", "height"])
> .first()
> )
> self.assertEqual(row.name, "Alice")
> self.assertEqual(row.age, 10)
> self.assertEqual(row.height, None) {code}
> {code:java}
> Traceback (most recent call last):   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 372, in test_replace     self.assertEqual(row.height, None) 
> AssertionError: nan != None
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41872) Fix DataFrame createDataframe handling of None

2023-01-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41872:


Assignee: (was: Apache Spark)

> Fix DataFrame createDataframe handling of None
> --
>
> Key: SPARK-41872
> URL: https://issues.apache.org/jira/browse/SPARK-41872
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> row = self.spark.createDataFrame([("Alice", None, None, None)], 
> schema).fillna(True).first()
> self.assertEqual(row.age, None){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 231, in test_fillna
>     self.assertEqual(row.age, None)
> AssertionError: nan != None{code}
>  
> {code:java}
> row = (
> self.spark.createDataFrame([("Alice", 10, None)], schema)
> .replace(10, 20, subset=["name", "height"])
> .first()
> )
> self.assertEqual(row.name, "Alice")
> self.assertEqual(row.age, 10)
> self.assertEqual(row.height, None) {code}
> {code:java}
> Traceback (most recent call last):   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 372, in test_replace     self.assertEqual(row.height, None) 
> AssertionError: nan != None
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41957) Enable the doctest for `DataFrame.hint`

2023-01-09 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-41957:
-

Assignee: Ruifeng Zheng

> Enable the doctest for `DataFrame.hint`
> ---
>
> Key: SPARK-41957
> URL: https://issues.apache.org/jira/browse/SPARK-41957
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41957) Enable the doctest for `DataFrame.hint`

2023-01-09 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41957.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39472
[https://github.com/apache/spark/pull/39472]

> Enable the doctest for `DataFrame.hint`
> ---
>
> Key: SPARK-41957
> URL: https://issues.apache.org/jira/browse/SPARK-41957
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41907) Function `sampleby` return parity

2023-01-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656431#comment-17656431
 ] 

Apache Spark commented on SPARK-41907:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/39476

> Function `sampleby` return parity
> -
>
> Key: SPARK-41907
> URL: https://issues.apache.org/jira/browse/SPARK-41907
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)])
> sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0)
> self.assertTrue(sampled.count() == 35){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 202, in test_sampleby
> self.assertTrue(sampled.count() == 35)
> AssertionError: False is not true {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41907) Function `sampleby` return parity

2023-01-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41907:


Assignee: (was: Apache Spark)

> Function `sampleby` return parity
> -
>
> Key: SPARK-41907
> URL: https://issues.apache.org/jira/browse/SPARK-41907
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)])
> sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0)
> self.assertTrue(sampled.count() == 35){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 202, in test_sampleby
> self.assertTrue(sampled.count() == 35)
> AssertionError: False is not true {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41907) Function `sampleby` return parity

2023-01-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656430#comment-17656430
 ] 

Apache Spark commented on SPARK-41907:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/39476

> Function `sampleby` return parity
> -
>
> Key: SPARK-41907
> URL: https://issues.apache.org/jira/browse/SPARK-41907
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)])
> sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0)
> self.assertTrue(sampled.count() == 35){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 202, in test_sampleby
> self.assertTrue(sampled.count() == 35)
> AssertionError: False is not true {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41907) Function `sampleby` return parity

2023-01-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41907:


Assignee: Apache Spark

> Function `sampleby` return parity
> -
>
> Key: SPARK-41907
> URL: https://issues.apache.org/jira/browse/SPARK-41907
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)])
> sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0)
> self.assertTrue(sampled.count() == 35){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 202, in test_sampleby
> self.assertTrue(sampled.count() == 35)
> AssertionError: False is not true {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-41907) Function `sampleby` return parity

2023-01-09 Thread jiaan.geng (Jira)


[ https://issues.apache.org/jira/browse/SPARK-41907 ]


jiaan.geng deleted comment on SPARK-41907:


was (Author: beliefer):
I want take a look.

> Function `sampleby` return parity
> -
>
> Key: SPARK-41907
> URL: https://issues.apache.org/jira/browse/SPARK-41907
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)])
> sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0)
> self.assertTrue(sampled.count() == 35){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 202, in test_sampleby
> self.assertTrue(sampled.count() == 35)
> AssertionError: False is not true {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41907) Function `sampleby` return parity

2023-01-09 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656429#comment-17656429
 ] 

jiaan.geng commented on SPARK-41907:


The root cause is the plan is different from pyspark, so the result is not 
determined.

The plan come from pyspark
{code:java}
== Physical Plan ==
* Filter (2)
+- * Scan ExistingRDD (1)


(1) Scan ExistingRDD [codegen id : 1]
Output [2]: [a#4L, b#5L]
Arguments: [a#4L, b#5L], MapPartitionsRDD[9] at applySchemaToPythonRDD at 
NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0)

(2) Filter [codegen id : 1]
Input [2]: [a#4L, b#5L]
Condition : UDF(b#5L, rand(0))
{code}

The plan come from connect

{code:java}
== Physical Plan ==
LocalTableScan (1)


(1) LocalTableScan
Output [2]: [a#5L, b#6L]
Arguments: [a#5L, b#6L]
{code}


> Function `sampleby` return parity
> -
>
> Key: SPARK-41907
> URL: https://issues.apache.org/jira/browse/SPARK-41907
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)])
> sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0)
> self.assertTrue(sampled.count() == 35){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 202, in test_sampleby
> self.assertTrue(sampled.count() == 35)
> AssertionError: False is not true {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41914) Sorting issue with partitioned-writing and planned write optimization disabled

2023-01-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-41914.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39431
[https://github.com/apache/spark/pull/39431]

> Sorting issue with partitioned-writing and planned write optimization disabled
> --
>
> Key: SPARK-41914
> URL: https://issues.apache.org/jira/browse/SPARK-41914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Major
> Fix For: 3.4.0
>
>
> Spark 3.4.0 introduced option 
> {{{}spark.sql.optimizer.plannedWrite.enabled{}}}, which is enabled by 
> default. When disabled, partitioned writing loses in-partition order when 
> spilling occurs.
> This is related to SPARK-40885 where setting option 
> {{spark.sql.optimizer.plannedWrite.enabled}} to {{true}} will remove the 
> existing sort (for {{day}} and {{{}id{}}}) entirely.
> Run this with 512m memory and one executor, e.g.:
> {code}
> spark-shell --driver-memory 512m --master "local[1]"
> {code}
> {code:scala}
> import org.apache.spark.sql.SaveMode
> spark.conf.set("spark.sql.optimizer.plannedWrite.enabled", false)
> val ids = 200
> val days = 2
> val parts = 2
> val ds = spark.range(0, days, 1, parts).withColumnRenamed("id", 
> "day").join(spark.range(0, ids, 1, parts))
> ds.repartition($"day")
>   .sortWithinPartitions($"day", $"id")
>   .write
>   .partitionBy("day")
>   .mode(SaveMode.Overwrite)
>   .csv("interleaved.csv")
> {code}
> Check the written files are sorted (states OK when file is sorted):
> {code:bash}
> for file in interleaved.csv/day\=*/part-*
> do
>   echo "$(sort -n "$file" | md5sum | cut -d " " -f 1)  $file"
> done | md5sum -c
> {code}
> Files should look like this
> {code}
> 0
> 1
> 2
> ...
> 1048576
> 1048577
> 1048578
> ...
> {code}
> But they look like
> {code}
> 0
> 1048576
> 1
> 1048577
> 2
> 1048578
> ...
> {code}
> The cause issue is the same as in SPARK-40588. A sort (for {{{}day{}}}) is 
> added on top of the existing sort (for {{day}} and {{{}id{}}}). Spilling 
> interleaves the sorted spill files.
> {code}
> Sort [input[0, bigint, false] ASC NULLS FIRST], false, 0
> +- AdaptiveSparkPlan isFinalPlan=false
>+- Sort [day#2L ASC NULLS FIRST, id#4L ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(day#2L, 200), REPARTITION_BY_COL, 
> [plan_id=30]
>  +- BroadcastNestedLoopJoin BuildLeft, Inner
> :- BroadcastExchange IdentityBroadcastMode, [plan_id=28]
> :  +- Project [id#0L AS day#2L]
> : +- Range (0, 2, step=1, splits=2)
> +- Range (0, 200, step=1, splits=2)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41914) Sorting issue with partitioned-writing and planned write optimization disabled

2023-01-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-41914:
---

Assignee: Enrico Minack

> Sorting issue with partitioned-writing and planned write optimization disabled
> --
>
> Key: SPARK-41914
> URL: https://issues.apache.org/jira/browse/SPARK-41914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Major
>
> Spark 3.4.0 introduced option 
> {{{}spark.sql.optimizer.plannedWrite.enabled{}}}, which is enabled by 
> default. When disabled, partitioned writing loses in-partition order when 
> spilling occurs.
> This is related to SPARK-40885 where setting option 
> {{spark.sql.optimizer.plannedWrite.enabled}} to {{true}} will remove the 
> existing sort (for {{day}} and {{{}id{}}}) entirely.
> Run this with 512m memory and one executor, e.g.:
> {code}
> spark-shell --driver-memory 512m --master "local[1]"
> {code}
> {code:scala}
> import org.apache.spark.sql.SaveMode
> spark.conf.set("spark.sql.optimizer.plannedWrite.enabled", false)
> val ids = 200
> val days = 2
> val parts = 2
> val ds = spark.range(0, days, 1, parts).withColumnRenamed("id", 
> "day").join(spark.range(0, ids, 1, parts))
> ds.repartition($"day")
>   .sortWithinPartitions($"day", $"id")
>   .write
>   .partitionBy("day")
>   .mode(SaveMode.Overwrite)
>   .csv("interleaved.csv")
> {code}
> Check the written files are sorted (states OK when file is sorted):
> {code:bash}
> for file in interleaved.csv/day\=*/part-*
> do
>   echo "$(sort -n "$file" | md5sum | cut -d " " -f 1)  $file"
> done | md5sum -c
> {code}
> Files should look like this
> {code}
> 0
> 1
> 2
> ...
> 1048576
> 1048577
> 1048578
> ...
> {code}
> But they look like
> {code}
> 0
> 1048576
> 1
> 1048577
> 2
> 1048578
> ...
> {code}
> The cause issue is the same as in SPARK-40588. A sort (for {{{}day{}}}) is 
> added on top of the existing sort (for {{day}} and {{{}id{}}}). Spilling 
> interleaves the sorted spill files.
> {code}
> Sort [input[0, bigint, false] ASC NULLS FIRST], false, 0
> +- AdaptiveSparkPlan isFinalPlan=false
>+- Sort [day#2L ASC NULLS FIRST, id#4L ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(day#2L, 200), REPARTITION_BY_COL, 
> [plan_id=30]
>  +- BroadcastNestedLoopJoin BuildLeft, Inner
> :- BroadcastExchange IdentityBroadcastMode, [plan_id=28]
> :  +- Project [id#0L AS day#2L]
> : +- Range (0, 2, step=1, splits=2)
> +- Range (0, 200, step=1, splits=2)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41959) Improve v1 writes with empty2null

2023-01-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656416#comment-17656416
 ] 

Apache Spark commented on SPARK-41959:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/39475

> Improve v1 writes with empty2null
> -
>
> Key: SPARK-41959
> URL: https://issues.apache.org/jira/browse/SPARK-41959
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41959) Improve v1 writes with empty2null

2023-01-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41959:


Assignee: (was: Apache Spark)

> Improve v1 writes with empty2null
> -
>
> Key: SPARK-41959
> URL: https://issues.apache.org/jira/browse/SPARK-41959
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41959) Improve v1 writes with empty2null

2023-01-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41959:


Assignee: Apache Spark

> Improve v1 writes with empty2null
> -
>
> Key: SPARK-41959
> URL: https://issues.apache.org/jira/browse/SPARK-41959
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41959) Improve v1 writes with empty2null

2023-01-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656415#comment-17656415
 ] 

Apache Spark commented on SPARK-41959:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/39475

> Improve v1 writes with empty2null
> -
>
> Key: SPARK-41959
> URL: https://issues.apache.org/jira/browse/SPARK-41959
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41959) Improve v1 writes with empty2null

2023-01-09 Thread XiDuo You (Jira)
XiDuo You created SPARK-41959:
-

 Summary: Improve v1 writes with empty2null
 Key: SPARK-41959
 URL: https://issues.apache.org/jira/browse/SPARK-41959
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: XiDuo You






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41887) Support DataFrame hint parameter to be list

2023-01-09 Thread Ruifeng Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656407#comment-17656407
 ] 

Ruifeng Zheng commented on SPARK-41887:
---

I am going to work on this one

> Support DataFrame hint parameter to be list
> ---
>
> Key: SPARK-41887
> URL: https://issues.apache.org/jira/browse/SPARK-41887
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.range(10e10).toDF("id")
> such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
> hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41958) Disallow arbitrary custom classpath with proxy user in cluster mode

2023-01-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656397#comment-17656397
 ] 

Apache Spark commented on SPARK-41958:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/39474

> Disallow arbitrary custom classpath with proxy user in cluster mode
> ---
>
> Key: SPARK-41958
> URL: https://issues.apache.org/jira/browse/SPARK-41958
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.3.1, 3.2.3
>Reporter: wuyi
>Priority: Major
>
> To avoid arbitrary classpath in spark cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41958) Disallow arbitrary custom classpath with proxy user in cluster mode

2023-01-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656396#comment-17656396
 ] 

Apache Spark commented on SPARK-41958:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/39474

> Disallow arbitrary custom classpath with proxy user in cluster mode
> ---
>
> Key: SPARK-41958
> URL: https://issues.apache.org/jira/browse/SPARK-41958
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.3.1, 3.2.3
>Reporter: wuyi
>Priority: Major
>
> To avoid arbitrary classpath in spark cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41958) Disallow arbitrary custom classpath with proxy user in cluster mode

2023-01-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41958:


Assignee: Apache Spark

> Disallow arbitrary custom classpath with proxy user in cluster mode
> ---
>
> Key: SPARK-41958
> URL: https://issues.apache.org/jira/browse/SPARK-41958
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.3.1, 3.2.3
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
> To avoid arbitrary classpath in spark cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41958) Disallow arbitrary custom classpath with proxy user in cluster mode

2023-01-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41958:


Assignee: (was: Apache Spark)

> Disallow arbitrary custom classpath with proxy user in cluster mode
> ---
>
> Key: SPARK-41958
> URL: https://issues.apache.org/jira/browse/SPARK-41958
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.3.1, 3.2.3
>Reporter: wuyi
>Priority: Major
>
> To avoid arbitrary classpath in spark cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41958) Disallow arbitrary custom classpath with proxy user in cluster mode

2023-01-09 Thread wuyi (Jira)
wuyi created SPARK-41958:


 Summary: Disallow arbitrary custom classpath with proxy user in 
cluster mode
 Key: SPARK-41958
 URL: https://issues.apache.org/jira/browse/SPARK-41958
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.3, 3.3.1, 3.1.3, 3.0.3, 2.4.8
Reporter: wuyi


To avoid arbitrary classpath in spark cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41919) Unify the schema or datatype in protos

2023-01-09 Thread Ruifeng Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656387#comment-17656387
 ] 

Ruifeng Zheng commented on SPARK-41919:
---

actually, JSON is the standard format of datatype.

Both client and server already support the conversion between JSON <-> DataType

> Unify the schema or datatype in protos
> --
>
> Key: SPARK-41919
> URL: https://issues.apache.org/jira/browse/SPARK-41919
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> this ticket only focus on the protos sent from client to server.
> we normally use 
> {code:java}
>   oneof schema {
> DataType datatype = 2;
> // Server will use Catalyst parser to parse this string to DataType.
> string datatype_str = 3;
>   }
> {code}
> to represent a schema or datatype.
> actually, we can simplify it with just a string. In the server, we can easily 
> parse a DDL-formatted schema or a JSON formatted one.
> {code:java}
>   // (Optional) The schema of local data.
>   // It should be either a DDL-formatted type string or a JSON string.
>   //
>   // The server side will update the column names and data types according to 
> this schema.
>   // If the 'data' is not provided, then this schema will be required.
>   optional string schema = 2;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38173) Quoted column cannot be recognized correctly when quotedRegexColumnNames is true

2023-01-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656380#comment-17656380
 ] 

Apache Spark commented on SPARK-38173:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/39473

> Quoted column cannot be recognized correctly when quotedRegexColumnNames is 
> true
> 
>
> Key: SPARK-38173
> URL: https://issues.apache.org/jira/browse/SPARK-38173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Tongwei
>Assignee: Tongwei
>Priority: Major
> Fix For: 3.3.0
>
>
> When spark.sql.parser.quotedRegexColumnNames=true
> {code:java}
>  SELECT `(C3)?+.+`,`C1` * C2 FROM (SELECT 3 AS C1,2 AS C2,1 AS C3) T;{code}
> The above query will throw an exception
> {code:java}
> Error: org.apache.hive.service.cli.HiveSQLException: Error running query: 
> org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 
> 'multiply'
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:370)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:266)
>         at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:44)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:266)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:261)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:275)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.spark.sql.AnalysisException: Invalid usage of '*' in 
> expression 'multiply'
>         at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50)
>         at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:49)
>         at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:155)
>         at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1700)
>         at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1671)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:342)
>         at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:342)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:339)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:408)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:339)
>         at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.expandStarExpression(Analyzer.scala:1671)
>         at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.$anonfun$buildExpandedProjectList$1(Analyzer.scala:1656)
>  

[jira] [Resolved] (SPARK-41943) Use java api to create files and grant permissions is DiskBlockManager

2023-01-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-41943.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39448
[https://github.com/apache/spark/pull/39448]

> Use java api to create files and grant permissions is DiskBlockManager
> --
>
> Key: SPARK-41943
> URL: https://issues.apache.org/jira/browse/SPARK-41943
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: jingxiong zhong
>Priority: Minor
> Fix For: 3.4.0
>
>
> For method {{{}createDirWithPermission770{}}}, using java api to create files 
> and grant permissions instead of calling shell commands.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38173) Quoted column cannot be recognized correctly when quotedRegexColumnNames is true

2023-01-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656381#comment-17656381
 ] 

Apache Spark commented on SPARK-38173:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/39473

> Quoted column cannot be recognized correctly when quotedRegexColumnNames is 
> true
> 
>
> Key: SPARK-38173
> URL: https://issues.apache.org/jira/browse/SPARK-38173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Tongwei
>Assignee: Tongwei
>Priority: Major
> Fix For: 3.3.0
>
>
> When spark.sql.parser.quotedRegexColumnNames=true
> {code:java}
>  SELECT `(C3)?+.+`,`C1` * C2 FROM (SELECT 3 AS C1,2 AS C2,1 AS C3) T;{code}
> The above query will throw an exception
> {code:java}
> Error: org.apache.hive.service.cli.HiveSQLException: Error running query: 
> org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 
> 'multiply'
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:370)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:266)
>         at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:44)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:266)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:261)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:275)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.spark.sql.AnalysisException: Invalid usage of '*' in 
> expression 'multiply'
>         at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50)
>         at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:49)
>         at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:155)
>         at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1700)
>         at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1671)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:342)
>         at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:342)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:339)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:408)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:339)
>         at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.expandStarExpression(Analyzer.scala:1671)
>         at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.$anonfun$buildExpandedProjectList$1(Analyzer.scala:1656)
>  

[jira] [Commented] (SPARK-41957) Enable the doctest for `DataFrame.hint`

2023-01-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656378#comment-17656378
 ] 

Apache Spark commented on SPARK-41957:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39472

> Enable the doctest for `DataFrame.hint`
> ---
>
> Key: SPARK-41957
> URL: https://issues.apache.org/jira/browse/SPARK-41957
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41957) Enable the doctest for `DataFrame.hint`

2023-01-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41957:


Assignee: (was: Apache Spark)

> Enable the doctest for `DataFrame.hint`
> ---
>
> Key: SPARK-41957
> URL: https://issues.apache.org/jira/browse/SPARK-41957
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41957) Enable the doctest for `DataFrame.hint`

2023-01-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41957:


Assignee: Apache Spark

> Enable the doctest for `DataFrame.hint`
> ---
>
> Key: SPARK-41957
> URL: https://issues.apache.org/jira/browse/SPARK-41957
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41907) Function `sampleby` return parity

2023-01-09 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656376#comment-17656376
 ] 

jiaan.geng commented on SPARK-41907:


I want take a look.

> Function `sampleby` return parity
> -
>
> Key: SPARK-41907
> URL: https://issues.apache.org/jira/browse/SPARK-41907
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)])
> sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0)
> self.assertTrue(sampled.count() == 35){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 202, in test_sampleby
> self.assertTrue(sampled.count() == 35)
> AssertionError: False is not true {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41957) Enable the doctest for `DataFrame.hint`

2023-01-09 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-41957:
-

 Summary: Enable the doctest for `DataFrame.hint`
 Key: SPARK-41957
 URL: https://issues.apache.org/jira/browse/SPARK-41957
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41904) Fix Function `nth_value` functions output

2023-01-09 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng resolved SPARK-41904.

Resolution: Won't Fix

> Fix Function `nth_value` functions output
> -
>
> Key: SPARK-41904
> URL: https://issues.apache.org/jira/browse/SPARK-41904
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> from pyspark.sql import Window
> from pyspark.sql.functions import nth_value
> df = self.spark.createDataFrame(
> [
> ("a", 0, None),
> ("a", 1, "x"),
> ("a", 2, "y"),
> ("a", 3, "z"),
> ("a", 4, None),
> ("b", 1, None),
> ("b", 2, None),
> ],
> schema=("key", "order", "value"),
> )
> w = Window.partitionBy("key").orderBy("order")
> rs = df.select(
> df.key,
> df.order,
> nth_value("value", 2).over(w),
> nth_value("value", 2, False).over(w),
> nth_value("value", 2, True).over(w),
> ).collect()
> expected = [
> ("a", 0, None, None, None),
> ("a", 1, "x", "x", None),
> ("a", 2, "x", "x", "y"),
> ("a", 3, "x", "x", "y"),
> ("a", 4, "x", "x", "y"),
> ("b", 1, None, None, None),
> ("b", 2, None, None, None),
> ]
> for r, ex in zip(sorted(rs), sorted(expected)):
> self.assertEqual(tuple(r), ex[: len(r)]){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 755, in test_nth_value
> self.assertEqual(tuple(r), ex[: len(r)])
> AssertionError: Tuples differ: ('a', 1, 'x', None) != ('a', 1, 'x', 'x')
> First differing element 3:
> None
> 'x'
> - ('a', 1, 'x', None)
> ?   
> + ('a', 1, 'x', 'x')
> ?   ^^^
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41904) Fix Function `nth_value` functions output

2023-01-09 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656374#comment-17656374
 ] 

jiaan.geng commented on SPARK-41904:


The root reason caused by https://issues.apache.org/jira/browse/SPARK-41945 and 
it has been fixed. The issue fixed too.

> Fix Function `nth_value` functions output
> -
>
> Key: SPARK-41904
> URL: https://issues.apache.org/jira/browse/SPARK-41904
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
>
> {code:java}
> from pyspark.sql import Window
> from pyspark.sql.functions import nth_value
> df = self.spark.createDataFrame(
> [
> ("a", 0, None),
> ("a", 1, "x"),
> ("a", 2, "y"),
> ("a", 3, "z"),
> ("a", 4, None),
> ("b", 1, None),
> ("b", 2, None),
> ],
> schema=("key", "order", "value"),
> )
> w = Window.partitionBy("key").orderBy("order")
> rs = df.select(
> df.key,
> df.order,
> nth_value("value", 2).over(w),
> nth_value("value", 2, False).over(w),
> nth_value("value", 2, True).over(w),
> ).collect()
> expected = [
> ("a", 0, None, None, None),
> ("a", 1, "x", "x", None),
> ("a", 2, "x", "x", "y"),
> ("a", 3, "x", "x", "y"),
> ("a", 4, "x", "x", "y"),
> ("b", 1, None, None, None),
> ("b", 2, None, None, None),
> ]
> for r, ex in zip(sorted(rs), sorted(expected)):
> self.assertEqual(tuple(r), ex[: len(r)]){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
>  line 755, in test_nth_value
> self.assertEqual(tuple(r), ex[: len(r)])
> AssertionError: Tuples differ: ('a', 1, 'x', None) != ('a', 1, 'x', 'x')
> First differing element 3:
> None
> 'x'
> - ('a', 1, 'x', None)
> ?   
> + ('a', 1, 'x', 'x')
> ?   ^^^
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41956) Shuffle output location refetch in ShuffleBlockFetcherIterator

2023-01-09 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-41956:


 Summary: Shuffle output location refetch in 
ShuffleBlockFetcherIterator
 Key: SPARK-41956
 URL: https://issues.apache.org/jira/browse/SPARK-41956
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.3.1
Reporter: Zhongwei Zhu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41955) Support fetch latest map output from worker

2023-01-09 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-41955:


 Summary:  Support fetch latest map output from worker
 Key: SPARK-41955
 URL: https://issues.apache.org/jira/browse/SPARK-41955
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.3.1
Reporter: Zhongwei Zhu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41954) Add isDecommissioned in ExecutorDeadException

2023-01-09 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-41954:


 Summary: Add isDecommissioned in ExecutorDeadException
 Key: SPARK-41954
 URL: https://issues.apache.org/jira/browse/SPARK-41954
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.3.1
Reporter: Zhongwei Zhu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41953) Shuffle output location refetch during shuffle migration in decommission

2023-01-09 Thread Zhongwei Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656365#comment-17656365
 ] 

Zhongwei Zhu commented on SPARK-41953:
--

[~dongjoon] [~mridulm80] [~Ngone51] Any comments for this?

> Shuffle output location refetch during shuffle migration in decommission
> 
>
> Key: SPARK-41953
> URL: https://issues.apache.org/jira/browse/SPARK-41953
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.1
>Reporter: Zhongwei Zhu
>Priority: Major
>
> When shuffle migration enabled during spark decommissionm, shuffle data will 
> be migrated into live executors, then update latest location to 
> MapOutputTracker. It has some issues:
>  # Executors only do map output location fetch in the beginning of the reduce 
> stage, so any shuffle output location change in the middle of reduce will 
> cause FetchFailed as reducer fetch from old location. Even stage retries 
> could solve this, this still cause lots of resource waste as all shuffle read 
> and compute happened before FetchFailed partition will be wasted.
>  # During stage retries, less running tasks cause more executors to be 
> decommissioned and shuffle data location keep changing. In the worst case, 
> stage could need lots of retries, further breaking SLA.
> So I propose to support refetch map output location during reduce phase if 
> shuffle migration is enabled and FetchFailed is caused by a decommissioned 
> dead executor. The detailed steps as below:
>  # When `BlockTransferService` fetch blocks failed from a decommissioned dead 
> executor, ExecutorDeadException(isDecommission as true) will be thrown.
>  # Make MapOutputTracker support fetch latest output without epoch provided.
>  # `ShuffleBlockFetcherIterator` will refetch latest output from 
> MapOutputTrackMaster. For all the shuffle blocks on this decommissioned, 
> there should be a new location on another executor. If not, throw exception 
> as current. If yes, create new local and remote requests to fetch these 
> migrated shuffle blocks. The flow will be similar as failback fetch when push 
> merged fetch failed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41953) Shuffle output location refetch during shuffle migration in decommission

2023-01-09 Thread Zhongwei Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongwei Zhu updated SPARK-41953:
-
Description: 
When shuffle migration enabled during spark decommissionm, shuffle data will be 
migrated into live executors, then update latest location to MapOutputTracker. 
It has some issues:
 # Executors only do map output location fetch in the beginning of the reduce 
stage, so any shuffle output location change in the middle of reduce will cause 
FetchFailed as reducer fetch from old location. Even stage retries could solve 
this, this still cause lots of resource waste as all shuffle read and compute 
happened before FetchFailed partition will be wasted.
 # During stage retries, less running tasks cause more executors to be 
decommissioned and shuffle data location keep changing. In the worst case, 
stage could need lots of retries, further breaking SLA.

So I propose to support refetch map output location during reduce phase if 
shuffle migration is enabled and FetchFailed is caused by a decommissioned dead 
executor. The detailed steps as below:
 # When `BlockTransferService` fetch blocks failed from a decommissioned dead 
executor, ExecutorDeadException(isDecommission as true) will be thrown.
 # Make MapOutputTracker support fetch latest output without epoch provided.
 # `ShuffleBlockFetcherIterator` will refetch latest output from 
MapOutputTrackMaster. For all the shuffle blocks on this decommissioned, there 
should be a new location on another executor. If not, throw exception as 
current. If yes, create new local and remote requests to fetch these migrated 
shuffle blocks. The flow will be similar as failback fetch when push merged 
fetch failed.

  was:
When shuffle migration enabled during spark decommissionm, shuffle data will be 
migrated into live executors, then update latest location to MapOutputTracker. 
It has some issues:
 # Executors only do map output location fetch in the beginning of the reduce 
stage, so any shuffle output location change in the middle of reduce will cause 
FetchFailed as reducer fetch from old location. Even stage retries could solve 
this, this still cause lots of resource waste as all shuffle read and compute 
happened before FetchFailed partition will be wasted.
 # During stage retries, less running tasks cause more executors to be 
decommissioned and shuffle data location keep changing. In the worst case, 
stage could need lots of retries, further breaking SLA.

So I propose to support refetch map output location during reduce phase if 
shuffle migration is enabled and FetchFailed is caused by a decommissioned dead 
executor. The detailed steps as


> Shuffle output location refetch during shuffle migration in decommission
> 
>
> Key: SPARK-41953
> URL: https://issues.apache.org/jira/browse/SPARK-41953
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.1
>Reporter: Zhongwei Zhu
>Priority: Major
>
> When shuffle migration enabled during spark decommissionm, shuffle data will 
> be migrated into live executors, then update latest location to 
> MapOutputTracker. It has some issues:
>  # Executors only do map output location fetch in the beginning of the reduce 
> stage, so any shuffle output location change in the middle of reduce will 
> cause FetchFailed as reducer fetch from old location. Even stage retries 
> could solve this, this still cause lots of resource waste as all shuffle read 
> and compute happened before FetchFailed partition will be wasted.
>  # During stage retries, less running tasks cause more executors to be 
> decommissioned and shuffle data location keep changing. In the worst case, 
> stage could need lots of retries, further breaking SLA.
> So I propose to support refetch map output location during reduce phase if 
> shuffle migration is enabled and FetchFailed is caused by a decommissioned 
> dead executor. The detailed steps as below:
>  # When `BlockTransferService` fetch blocks failed from a decommissioned dead 
> executor, ExecutorDeadException(isDecommission as true) will be thrown.
>  # Make MapOutputTracker support fetch latest output without epoch provided.
>  # `ShuffleBlockFetcherIterator` will refetch latest output from 
> MapOutputTrackMaster. For all the shuffle blocks on this decommissioned, 
> there should be a new location on another executor. If not, throw exception 
> as current. If yes, create new local and remote requests to fetch these 
> migrated shuffle blocks. The flow will be similar as failback fetch when push 
> merged fetch failed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: 

[jira] [Assigned] (SPARK-41232) High-order function: array_append

2023-01-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41232:


Assignee: Ankit Prakash Gupta  (was: Senthil Kumar)

> High-order function: array_append
> -
>
> Key: SPARK-41232
> URL: https://issues.apache.org/jira/browse/SPARK-41232
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ankit Prakash Gupta
>Priority: Major
> Fix For: 3.4.0
>
>
> refer to 
> https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_append.html
> 1, about the data type validation:
> In Snowflake’s array_append, array_prepend and array_insert functions, the 
> element data type does not need to match the data type of the existing 
> elements in the array.
> While in Spark, we want to leverage the same data type validation as 
> array_remove.
> 2, about the NULL handling
> Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in 
> different ways.
> Existing functions array_contains, array_position and array_remove in 
> SparkSQL handle NULL in this way, if the input array or/and element is NULL, 
> returns NULL. However, this behavior should be broken.
> We should implement the NULL handling in array_append in this way:
> 2.1, if the array is NULL, returns NULL;
> 2.2 if the array is not NULL, the element is NULL, append the NULL value into 
> the array



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41953) Shuffle output location refetch during shuffle migration in decommission

2023-01-09 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-41953:


 Summary: Shuffle output location refetch during shuffle migration 
in decommission
 Key: SPARK-41953
 URL: https://issues.apache.org/jira/browse/SPARK-41953
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.1
Reporter: Zhongwei Zhu


When shuffle migration enabled during spark decommissionm, shuffle data will be 
migrated into live executors, then update latest location to MapOutputTracker. 
It has some issues:
 # Executors only do map output location fetch in the beginning of the reduce 
stage, so any shuffle output location change in the middle of reduce will cause 
FetchFailed as reducer fetch from old location. Even stage retries could solve 
this, this still cause lots of resource waste as all shuffle read and compute 
happened before FetchFailed partition will be wasted.
 # During stage retries, less running tasks cause more executors to be 
decommissioned and shuffle data location keep changing. In the worst case, 
stage could need lots of retries, further breaking SLA.

So I propose to support refetch map output location during reduce phase if 
shuffle migration is enabled and FetchFailed is caused by a decommissioned dead 
executor. The detailed steps as



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41232) High-order function: array_append

2023-01-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41232.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38865
[https://github.com/apache/spark/pull/38865]

> High-order function: array_append
> -
>
> Key: SPARK-41232
> URL: https://issues.apache.org/jira/browse/SPARK-41232
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Senthil Kumar
>Priority: Major
> Fix For: 3.4.0
>
>
> refer to 
> https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_append.html
> 1, about the data type validation:
> In Snowflake’s array_append, array_prepend and array_insert functions, the 
> element data type does not need to match the data type of the existing 
> elements in the array.
> While in Spark, we want to leverage the same data type validation as 
> array_remove.
> 2, about the NULL handling
> Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in 
> different ways.
> Existing functions array_contains, array_position and array_remove in 
> SparkSQL handle NULL in this way, if the input array or/and element is NULL, 
> returns NULL. However, this behavior should be broken.
> We should implement the NULL handling in array_append in this way:
> 2.1, if the array is NULL, returns NULL;
> 2.2 if the array is not NULL, the element is NULL, append the NULL value into 
> the array



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41232) High-order function: array_append

2023-01-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41232:


Assignee: Senthil Kumar

> High-order function: array_append
> -
>
> Key: SPARK-41232
> URL: https://issues.apache.org/jira/browse/SPARK-41232
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Senthil Kumar
>Priority: Major
>
> refer to 
> https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_append.html
> 1, about the data type validation:
> In Snowflake’s array_append, array_prepend and array_insert functions, the 
> element data type does not need to match the data type of the existing 
> elements in the array.
> While in Spark, we want to leverage the same data type validation as 
> array_remove.
> 2, about the NULL handling
> Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in 
> different ways.
> Existing functions array_contains, array_position and array_remove in 
> SparkSQL handle NULL in this way, if the input array or/and element is NULL, 
> returns NULL. However, this behavior should be broken.
> We should implement the NULL handling in array_append in this way:
> 2.1, if the array is NULL, returns NULL;
> 2.2 if the array is not NULL, the element is NULL, append the NULL value into 
> the array



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41432) Protobuf serializer for SparkPlanGraphWrapper

2023-01-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656340#comment-17656340
 ] 

Apache Spark commented on SPARK-41432:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/39471

> Protobuf serializer for SparkPlanGraphWrapper
> -
>
> Key: SPARK-41432
> URL: https://issues.apache.org/jira/browse/SPARK-41432
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41432) Protobuf serializer for SparkPlanGraphWrapper

2023-01-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656339#comment-17656339
 ] 

Apache Spark commented on SPARK-41432:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/39471

> Protobuf serializer for SparkPlanGraphWrapper
> -
>
> Key: SPARK-41432
> URL: https://issues.apache.org/jira/browse/SPARK-41432
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41952) Upgrade Parquet to fix off-heap memory leaks in Zstd codec

2023-01-09 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated SPARK-41952:

Shepherd: Dongjoon Hyun

> Upgrade Parquet to fix off-heap memory leaks in Zstd codec
> --
>
> Key: SPARK-41952
> URL: https://issues.apache.org/jira/browse/SPARK-41952
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.1.3, 3.3.1, 3.2.3
>Reporter: Alexey Kudinkin
>Priority: Critical
>
> Recently, native memory leak have been discovered in Parquet in conjunction 
> of it using Zstd decompressor from luben/zstd-jni library (PARQUET-2160).
> This is very problematic to a point where we can't use Parquet w/ Zstd due to 
> pervasive OOMs taking down our executors and disrupting our jobs.
> Luckily fix addressing this had already landed in Parquet:
> [https://github.com/apache/parquet-mr/pull/982]
>  
> Now, we just need to
>  # Updated version of Parquet is released in a timely manner
>  # Spark is upgraded onto this new version in the upcoming release
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41952) Upgrade Parquet to fix off-heap memory leaks in Zstd codec

2023-01-09 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created SPARK-41952:
---

 Summary: Upgrade Parquet to fix off-heap memory leaks in Zstd codec
 Key: SPARK-41952
 URL: https://issues.apache.org/jira/browse/SPARK-41952
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 3.2.3, 3.3.1, 3.1.3
Reporter: Alexey Kudinkin


Recently, native memory leak have been discovered in Parquet in conjunction of 
it using Zstd decompressor from luben/zstd-jni library (PARQUET-2160).

This is very problematic to a point where we can't use Parquet w/ Zstd due to 
pervasive OOMs taking down our executors and disrupting our jobs.

Luckily fix addressing this had already landed in Parquet:
[https://github.com/apache/parquet-mr/pull/982]

 

Now, we just need to
 # Updated version of Parquet is released in a timely manner
 # Spark is upgraded onto this new version in the upcoming release

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41951) Update SQL migration guide and documentations

2023-01-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-41951.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39470
[https://github.com/apache/spark/pull/39470]

> Update SQL migration guide and documentations
> -
>
> Key: SPARK-41951
> URL: https://issues.apache.org/jira/browse/SPARK-41951
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41951) Update SQL migration guide and documentations

2023-01-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-41951:
-

Assignee: Dongjoon Hyun

> Update SQL migration guide and documentations
> -
>
> Key: SPARK-41951
> URL: https://issues.apache.org/jira/browse/SPARK-41951
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41951) Update SQL migration guide and documentations

2023-01-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41951:


Assignee: Apache Spark

> Update SQL migration guide and documentations
> -
>
> Key: SPARK-41951
> URL: https://issues.apache.org/jira/browse/SPARK-41951
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41951) Update SQL migration guide and documentations

2023-01-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656321#comment-17656321
 ] 

Apache Spark commented on SPARK-41951:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39470

> Update SQL migration guide and documentations
> -
>
> Key: SPARK-41951
> URL: https://issues.apache.org/jira/browse/SPARK-41951
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41951) Update SQL migration guide and documentations

2023-01-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41951:


Assignee: (was: Apache Spark)

> Update SQL migration guide and documentations
> -
>
> Key: SPARK-41951
> URL: https://issues.apache.org/jira/browse/SPARK-41951
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41951) Update SQL migration guide and documentations

2023-01-09 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-41951:
-

 Summary: Update SQL migration guide and documentations
 Key: SPARK-41951
 URL: https://issues.apache.org/jira/browse/SPARK-41951
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41947) Update the contents of error class guidelines

2023-01-09 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-41947:


Assignee: Haejoon Lee

> Update the contents of error class guidelines
> -
>
> Key: SPARK-41947
> URL: https://issues.apache.org/jira/browse/SPARK-41947
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> The error class guidelines for `core/src/main/resources/error/README.md` is 
> out of date, we should update the guidelines to match the current behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41947) Update the contents of error class guidelines

2023-01-09 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-41947.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39464
[https://github.com/apache/spark/pull/39464]

> Update the contents of error class guidelines
> -
>
> Key: SPARK-41947
> URL: https://issues.apache.org/jira/browse/SPARK-41947
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> The error class guidelines for `core/src/main/resources/error/README.md` is 
> out of date, we should update the guidelines to match the current behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41848) Tasks are over-scheduled with TaskResourceProfile

2023-01-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-41848.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39410
[https://github.com/apache/spark/pull/39410]

> Tasks are over-scheduled with TaskResourceProfile
> -
>
> Key: SPARK-41848
> URL: https://issues.apache.org/jira/browse/SPARK-41848
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: wuyi
>Assignee: huangtengfei
>Priority: Blocker
> Fix For: 3.4.0
>
>
> {code:java}
> test("SPARK-XXX") {
>   val conf = new 
> SparkConf().setAppName("test").setMaster("local-cluster[1,4,1024]")
>   sc = new SparkContext(conf)
>   val req = new TaskResourceRequests().cpus(3)
>   val rp = new ResourceProfileBuilder().require(req).build()
>   val res = sc.parallelize(Seq(0, 1), 2).withResources(rp).map { x =>
> Thread.sleep(5000)
> x * 2
>   }.collect()
>   assert(res === Array(0, 2))
> } {code}
> In this test, tasks are supposed to be scheduled in order since each task 
> requires 3 cores but the executor only has 4 cores. However, we noticed 2 
> tasks are launched concurrently from the logs.
> It turns out that we used the TaskResourceProfile (taskCpus=3) of the taskset 
> for task scheduling:
> {code:java}
> val rpId = taskSet.taskSet.resourceProfileId
> val taskSetProf = sc.resourceProfileManager.resourceProfileFromId(rpId)
> val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(taskSetProf, 
> conf) {code}
> but the ResourceProfile (taskCpus=1) of the executor for updating the free 
> cores in ExecutorData:
> {code:java}
> val rpId = executorData.resourceProfileId
> val prof = scheduler.sc.resourceProfileManager.resourceProfileFromId(rpId)
> val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(prof, conf)
> executorData.freeCores -= taskCpus {code}
> which results in the inconsistency of the available cores.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41848) Tasks are over-scheduled with TaskResourceProfile

2023-01-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-41848:
-

Assignee: huangtengfei

> Tasks are over-scheduled with TaskResourceProfile
> -
>
> Key: SPARK-41848
> URL: https://issues.apache.org/jira/browse/SPARK-41848
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: wuyi
>Assignee: huangtengfei
>Priority: Blocker
>
> {code:java}
> test("SPARK-XXX") {
>   val conf = new 
> SparkConf().setAppName("test").setMaster("local-cluster[1,4,1024]")
>   sc = new SparkContext(conf)
>   val req = new TaskResourceRequests().cpus(3)
>   val rp = new ResourceProfileBuilder().require(req).build()
>   val res = sc.parallelize(Seq(0, 1), 2).withResources(rp).map { x =>
> Thread.sleep(5000)
> x * 2
>   }.collect()
>   assert(res === Array(0, 2))
> } {code}
> In this test, tasks are supposed to be scheduled in order since each task 
> requires 3 cores but the executor only has 4 cores. However, we noticed 2 
> tasks are launched concurrently from the logs.
> It turns out that we used the TaskResourceProfile (taskCpus=3) of the taskset 
> for task scheduling:
> {code:java}
> val rpId = taskSet.taskSet.resourceProfileId
> val taskSetProf = sc.resourceProfileManager.resourceProfileFromId(rpId)
> val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(taskSetProf, 
> conf) {code}
> but the ResourceProfile (taskCpus=1) of the executor for updating the free 
> cores in ExecutorData:
> {code:java}
> val rpId = executorData.resourceProfileId
> val prof = scheduler.sc.resourceProfileManager.resourceProfileFromId(rpId)
> val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(prof, conf)
> executorData.freeCores -= taskCpus {code}
> which results in the inconsistency of the available cores.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41860) Make AvroScanBuilder and JsonScanBuilder case classes

2023-01-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-41860.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39366
[https://github.com/apache/spark/pull/39366]

> Make AvroScanBuilder and JsonScanBuilder case classes
> -
>
> Key: SPARK-41860
> URL: https://issues.apache.org/jira/browse/SPARK-41860
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.1
>Reporter: Lorenzo Martini
>Assignee: Lorenzo Martini
>Priority: Trivial
> Fix For: 3.4.0
>
>
> Most of the existing `ScanBuilder`s inheriting from `FileScanBuilder` are 
> case classes which is very nice and useful. However the Json and the Avro 
> ones aren't, which makes it quite inconvenient when working with those. I 
> don't see any particular reason why these two would not be case classes as 
> well given they are very similar to the other ones and very simple and 
> stateless



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41860) Make AvroScanBuilder and JsonScanBuilder case classes

2023-01-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-41860:
-

Assignee: Lorenzo Martini

> Make AvroScanBuilder and JsonScanBuilder case classes
> -
>
> Key: SPARK-41860
> URL: https://issues.apache.org/jira/browse/SPARK-41860
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.1
>Reporter: Lorenzo Martini
>Assignee: Lorenzo Martini
>Priority: Trivial
>
> Most of the existing `ScanBuilder`s inheriting from `FileScanBuilder` are 
> case classes which is very nice and useful. However the Json and the Avro 
> ones aren't, which makes it quite inconvenient when working with those. I 
> don't see any particular reason why these two would not be case classes as 
> well given they are very similar to the other ones and very simple and 
> stateless



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41147) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1042

2023-01-09 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee resolved SPARK-41147.
-
Resolution: Duplicate

> Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1042
> ---
>
> Key: SPARK-41147
> URL: https://issues.apache.org/jira/browse/SPARK-41147
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should make all LEGACY_ERROR_TEMP to the proper name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41708) Pull v1write information to WriteFiles

2023-01-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656105#comment-17656105
 ] 

Apache Spark commented on SPARK-41708:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/39468

> Pull v1write information to WriteFiles
> --
>
> Key: SPARK-41708
> URL: https://issues.apache.org/jira/browse/SPARK-41708
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
> Make WriteFiles hold v1 write information



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41708) Pull v1write information to WriteFiles

2023-01-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656104#comment-17656104
 ] 

Apache Spark commented on SPARK-41708:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/39468

> Pull v1write information to WriteFiles
> --
>
> Key: SPARK-41708
> URL: https://issues.apache.org/jira/browse/SPARK-41708
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
> Make WriteFiles hold v1 write information



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41950) mlflow doctest fails for pandas API on SPark

2023-01-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41950:


Assignee: Apache Spark

> mlflow doctest fails for pandas API on SPark
> 
>
> Key: SPARK-41950
> URL: https://issues.apache.org/jira/browse/SPARK-41950
> Project: Spark
>  Issue Type: Test
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> {code}
> File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in 
> pyspark.pandas.mlflow.load_model
> Failed example:
> prediction_df
> Exception raised:
> Traceback (most recent call last):
>   File "/usr/lib/python3.9/doctest.py", line 1336, in __run
> exec(compile(example.source, filename, "single",
>   File "", line 1, in 
> 
> prediction_df
>   File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13322, in 
> __repr__
> pdf = cast("DataFrame", 
> self._get_or_create_repr_pandas_cache(max_display_count))
>   File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13313, in 
> _get_or_create_repr_pandas_cache
> self, "_repr_pandas_cache", {n: self.head(n + 
> 1)._to_internal_pandas()}
>   File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13308, in 
> _to_internal_pandas
> return self._internal.to_pandas_frame
>   File "/__w/spark/spark/python/pyspark/pandas/utils.py", line 588, in 
> wrapped_lazy_property
> setattr(self, attr_name, fn(self))
>   File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1056, 
> in to_pandas_frame
> pdf = sdf.toPandas()
>   File "/__w/spark/spark/python/pyspark/sql/pandas/conversion.py", line 
> 208, in toPandas
> pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
>   File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 1197, in 
> collect
> sock_info = self._jdf.collectToPython()
>   File 
> "/__w/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", 
> line 1322, in __call__
> return_value = get_return_value(
>   File "/__w/spark/spark/python/pyspark/sql/utils.py", line 209, in deco
> raise converted from None
> pyspark.sql.utils.PythonException: 
>   An exception was thrown from the Python worker. Please see the stack 
> trace below.
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 829, in main
> process()
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 821, in process
> serializer.dump_stream(out_iter, outfile)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 345, in dump_stream
> return ArrowStreamSerializer.dump_stream(self, 
> init_stream_yield_batches(), stream)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 86, in dump_stream
> for batch in iterator:
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 338, in init_stream_yield_batches
> for series in iterator:
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 519, in func
> for result_batch, result_type in result_iter:
>   File 
> "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 
> 1253, in udf
> yield _predict_row_batch(batch_predict_fn, row_batch_args)
>   File 
> "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 
> 1057, in _predict_row_batch
> result = predict_fn(pdf)
>   File 
> "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 
> 1237, in batch_predict_fn
> return loaded_model.predict(pdf)
>   File 
> "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 413, 
> in predict
> return self._predict_fn(data)
>   File 
> "/usr/local/lib/python3.9/dist-packages/sklearn/linear_model/_base.py", line 
> 355, in predict
> return self._decision_function(X)
>   File 
> "/usr/local/lib/python3.9/dist-packages/sklearn/linear_model/_base.py", line 
> 338, in _decision_function
> X = self._validate_data(X, accept_sparse=["csr", "csc", "coo"], 
> reset=False)
>   File "/usr/local/lib/python3.9/dist-packages/sklearn/base.py", line 
> 518, in _validate_data
> self._check_feature_names(X, reset=reset)
>   File "/usr/local/lib/python3.9/dist-packages/sklearn/base.py", line 
> 451, in _check_feature_names
> raise ValueError(message)
> ValueError: The feature names should match those that were passed during 
> fit.
>  

[jira] [Commented] (SPARK-41950) mlflow doctest fails for pandas API on SPark

2023-01-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656097#comment-17656097
 ] 

Apache Spark commented on SPARK-41950:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39467

> mlflow doctest fails for pandas API on SPark
> 
>
> Key: SPARK-41950
> URL: https://issues.apache.org/jira/browse/SPARK-41950
> Project: Spark
>  Issue Type: Test
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in 
> pyspark.pandas.mlflow.load_model
> Failed example:
> prediction_df
> Exception raised:
> Traceback (most recent call last):
>   File "/usr/lib/python3.9/doctest.py", line 1336, in __run
> exec(compile(example.source, filename, "single",
>   File "", line 1, in 
> 
> prediction_df
>   File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13322, in 
> __repr__
> pdf = cast("DataFrame", 
> self._get_or_create_repr_pandas_cache(max_display_count))
>   File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13313, in 
> _get_or_create_repr_pandas_cache
> self, "_repr_pandas_cache", {n: self.head(n + 
> 1)._to_internal_pandas()}
>   File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13308, in 
> _to_internal_pandas
> return self._internal.to_pandas_frame
>   File "/__w/spark/spark/python/pyspark/pandas/utils.py", line 588, in 
> wrapped_lazy_property
> setattr(self, attr_name, fn(self))
>   File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1056, 
> in to_pandas_frame
> pdf = sdf.toPandas()
>   File "/__w/spark/spark/python/pyspark/sql/pandas/conversion.py", line 
> 208, in toPandas
> pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
>   File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 1197, in 
> collect
> sock_info = self._jdf.collectToPython()
>   File 
> "/__w/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", 
> line 1322, in __call__
> return_value = get_return_value(
>   File "/__w/spark/spark/python/pyspark/sql/utils.py", line 209, in deco
> raise converted from None
> pyspark.sql.utils.PythonException: 
>   An exception was thrown from the Python worker. Please see the stack 
> trace below.
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 829, in main
> process()
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 821, in process
> serializer.dump_stream(out_iter, outfile)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 345, in dump_stream
> return ArrowStreamSerializer.dump_stream(self, 
> init_stream_yield_batches(), stream)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 86, in dump_stream
> for batch in iterator:
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 338, in init_stream_yield_batches
> for series in iterator:
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 519, in func
> for result_batch, result_type in result_iter:
>   File 
> "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 
> 1253, in udf
> yield _predict_row_batch(batch_predict_fn, row_batch_args)
>   File 
> "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 
> 1057, in _predict_row_batch
> result = predict_fn(pdf)
>   File 
> "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 
> 1237, in batch_predict_fn
> return loaded_model.predict(pdf)
>   File 
> "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 413, 
> in predict
> return self._predict_fn(data)
>   File 
> "/usr/local/lib/python3.9/dist-packages/sklearn/linear_model/_base.py", line 
> 355, in predict
> return self._decision_function(X)
>   File 
> "/usr/local/lib/python3.9/dist-packages/sklearn/linear_model/_base.py", line 
> 338, in _decision_function
> X = self._validate_data(X, accept_sparse=["csr", "csc", "coo"], 
> reset=False)
>   File "/usr/local/lib/python3.9/dist-packages/sklearn/base.py", line 
> 518, in _validate_data
> self._check_feature_names(X, reset=reset)
>   File "/usr/local/lib/python3.9/dist-packages/sklearn/base.py", line 
> 451, in _check_feature_names
> raise ValueError(message)
> 

[jira] [Assigned] (SPARK-41950) mlflow doctest fails for pandas API on SPark

2023-01-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41950:


Assignee: (was: Apache Spark)

> mlflow doctest fails for pandas API on SPark
> 
>
> Key: SPARK-41950
> URL: https://issues.apache.org/jira/browse/SPARK-41950
> Project: Spark
>  Issue Type: Test
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in 
> pyspark.pandas.mlflow.load_model
> Failed example:
> prediction_df
> Exception raised:
> Traceback (most recent call last):
>   File "/usr/lib/python3.9/doctest.py", line 1336, in __run
> exec(compile(example.source, filename, "single",
>   File "", line 1, in 
> 
> prediction_df
>   File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13322, in 
> __repr__
> pdf = cast("DataFrame", 
> self._get_or_create_repr_pandas_cache(max_display_count))
>   File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13313, in 
> _get_or_create_repr_pandas_cache
> self, "_repr_pandas_cache", {n: self.head(n + 
> 1)._to_internal_pandas()}
>   File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13308, in 
> _to_internal_pandas
> return self._internal.to_pandas_frame
>   File "/__w/spark/spark/python/pyspark/pandas/utils.py", line 588, in 
> wrapped_lazy_property
> setattr(self, attr_name, fn(self))
>   File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1056, 
> in to_pandas_frame
> pdf = sdf.toPandas()
>   File "/__w/spark/spark/python/pyspark/sql/pandas/conversion.py", line 
> 208, in toPandas
> pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
>   File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 1197, in 
> collect
> sock_info = self._jdf.collectToPython()
>   File 
> "/__w/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", 
> line 1322, in __call__
> return_value = get_return_value(
>   File "/__w/spark/spark/python/pyspark/sql/utils.py", line 209, in deco
> raise converted from None
> pyspark.sql.utils.PythonException: 
>   An exception was thrown from the Python worker. Please see the stack 
> trace below.
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 829, in main
> process()
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 821, in process
> serializer.dump_stream(out_iter, outfile)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 345, in dump_stream
> return ArrowStreamSerializer.dump_stream(self, 
> init_stream_yield_batches(), stream)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 86, in dump_stream
> for batch in iterator:
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 338, in init_stream_yield_batches
> for series in iterator:
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 519, in func
> for result_batch, result_type in result_iter:
>   File 
> "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 
> 1253, in udf
> yield _predict_row_batch(batch_predict_fn, row_batch_args)
>   File 
> "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 
> 1057, in _predict_row_batch
> result = predict_fn(pdf)
>   File 
> "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 
> 1237, in batch_predict_fn
> return loaded_model.predict(pdf)
>   File 
> "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 413, 
> in predict
> return self._predict_fn(data)
>   File 
> "/usr/local/lib/python3.9/dist-packages/sklearn/linear_model/_base.py", line 
> 355, in predict
> return self._decision_function(X)
>   File 
> "/usr/local/lib/python3.9/dist-packages/sklearn/linear_model/_base.py", line 
> 338, in _decision_function
> X = self._validate_data(X, accept_sparse=["csr", "csc", "coo"], 
> reset=False)
>   File "/usr/local/lib/python3.9/dist-packages/sklearn/base.py", line 
> 518, in _validate_data
> self._check_feature_names(X, reset=reset)
>   File "/usr/local/lib/python3.9/dist-packages/sklearn/base.py", line 
> 451, in _check_feature_names
> raise ValueError(message)
> ValueError: The feature names should match those that were passed during 
> fit.
> Feature names unseen 

[jira] [Updated] (SPARK-41950) mlflow doctest fails for pandas API on SPark

2023-01-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41950:
-
Issue Type: Test  (was: Bug)

> mlflow doctest fails for pandas API on SPark
> 
>
> Key: SPARK-41950
> URL: https://issues.apache.org/jira/browse/SPARK-41950
> Project: Spark
>  Issue Type: Test
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in 
> pyspark.pandas.mlflow.load_model
> Failed example:
> prediction_df
> Exception raised:
> Traceback (most recent call last):
>   File "/usr/lib/python3.9/doctest.py", line 1336, in __run
> exec(compile(example.source, filename, "single",
>   File "", line 1, in 
> 
> prediction_df
>   File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13322, in 
> __repr__
> pdf = cast("DataFrame", 
> self._get_or_create_repr_pandas_cache(max_display_count))
>   File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13313, in 
> _get_or_create_repr_pandas_cache
> self, "_repr_pandas_cache", {n: self.head(n + 
> 1)._to_internal_pandas()}
>   File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13308, in 
> _to_internal_pandas
> return self._internal.to_pandas_frame
>   File "/__w/spark/spark/python/pyspark/pandas/utils.py", line 588, in 
> wrapped_lazy_property
> setattr(self, attr_name, fn(self))
>   File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1056, 
> in to_pandas_frame
> pdf = sdf.toPandas()
>   File "/__w/spark/spark/python/pyspark/sql/pandas/conversion.py", line 
> 208, in toPandas
> pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
>   File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 1197, in 
> collect
> sock_info = self._jdf.collectToPython()
>   File 
> "/__w/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", 
> line 1322, in __call__
> return_value = get_return_value(
>   File "/__w/spark/spark/python/pyspark/sql/utils.py", line 209, in deco
> raise converted from None
> pyspark.sql.utils.PythonException: 
>   An exception was thrown from the Python worker. Please see the stack 
> trace below.
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 829, in main
> process()
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 821, in process
> serializer.dump_stream(out_iter, outfile)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 345, in dump_stream
> return ArrowStreamSerializer.dump_stream(self, 
> init_stream_yield_batches(), stream)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 86, in dump_stream
> for batch in iterator:
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 338, in init_stream_yield_batches
> for series in iterator:
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 519, in func
> for result_batch, result_type in result_iter:
>   File 
> "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 
> 1253, in udf
> yield _predict_row_batch(batch_predict_fn, row_batch_args)
>   File 
> "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 
> 1057, in _predict_row_batch
> result = predict_fn(pdf)
>   File 
> "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 
> 1237, in batch_predict_fn
> return loaded_model.predict(pdf)
>   File 
> "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 413, 
> in predict
> return self._predict_fn(data)
>   File 
> "/usr/local/lib/python3.9/dist-packages/sklearn/linear_model/_base.py", line 
> 355, in predict
> return self._decision_function(X)
>   File 
> "/usr/local/lib/python3.9/dist-packages/sklearn/linear_model/_base.py", line 
> 338, in _decision_function
> X = self._validate_data(X, accept_sparse=["csr", "csc", "coo"], 
> reset=False)
>   File "/usr/local/lib/python3.9/dist-packages/sklearn/base.py", line 
> 518, in _validate_data
> self._check_feature_names(X, reset=reset)
>   File "/usr/local/lib/python3.9/dist-packages/sklearn/base.py", line 
> 451, in _check_feature_names
> raise ValueError(message)
> ValueError: The feature names should match those that were passed during 
> fit.
> Feature names unseen at fit time:

[jira] [Created] (SPARK-41950) mlflow doctest fails for pandas API on SPark

2023-01-09 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-41950:


 Summary: mlflow doctest fails for pandas API on SPark
 Key: SPARK-41950
 URL: https://issues.apache.org/jira/browse/SPARK-41950
 Project: Spark
  Issue Type: Bug
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


{code}
File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in 
pyspark.pandas.mlflow.load_model
Failed example:
prediction_df
Exception raised:
Traceback (most recent call last):
  File "/usr/lib/python3.9/doctest.py", line 1336, in __run
exec(compile(example.source, filename, "single",
  File "", line 1, in 
prediction_df
  File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13322, in 
__repr__
pdf = cast("DataFrame", 
self._get_or_create_repr_pandas_cache(max_display_count))
  File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13313, in 
_get_or_create_repr_pandas_cache
self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()}
  File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13308, in 
_to_internal_pandas
return self._internal.to_pandas_frame
  File "/__w/spark/spark/python/pyspark/pandas/utils.py", line 588, in 
wrapped_lazy_property
setattr(self, attr_name, fn(self))
  File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1056, in 
to_pandas_frame
pdf = sdf.toPandas()
  File "/__w/spark/spark/python/pyspark/sql/pandas/conversion.py", line 
208, in toPandas
pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
  File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 1197, in 
collect
sock_info = self._jdf.collectToPython()
  File 
"/__w/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 
1322, in __call__
return_value = get_return_value(
  File "/__w/spark/spark/python/pyspark/sql/utils.py", line 209, in deco
raise converted from None
pyspark.sql.utils.PythonException: 
  An exception was thrown from the Python worker. Please see the stack 
trace below.
Traceback (most recent call last):
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
829, in main
process()
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
821, in process
serializer.dump_stream(out_iter, outfile)
  File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 345, in dump_stream
return ArrowStreamSerializer.dump_stream(self, 
init_stream_yield_batches(), stream)
  File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 86, in dump_stream
for batch in iterator:
  File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 338, in init_stream_yield_batches
for series in iterator:
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
519, in func
for result_batch, result_type in result_iter:
  File "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", 
line 1253, in udf
yield _predict_row_batch(batch_predict_fn, row_batch_args)
  File "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", 
line 1057, in _predict_row_batch
result = predict_fn(pdf)
  File "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", 
line 1237, in batch_predict_fn
return loaded_model.predict(pdf)
  File "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", 
line 413, in predict
return self._predict_fn(data)
  File 
"/usr/local/lib/python3.9/dist-packages/sklearn/linear_model/_base.py", line 
355, in predict
return self._decision_function(X)
  File 
"/usr/local/lib/python3.9/dist-packages/sklearn/linear_model/_base.py", line 
338, in _decision_function
X = self._validate_data(X, accept_sparse=["csr", "csc", "coo"], 
reset=False)
  File "/usr/local/lib/python3.9/dist-packages/sklearn/base.py", line 518, 
in _validate_data
self._check_feature_names(X, reset=reset)
  File "/usr/local/lib/python3.9/dist-packages/sklearn/base.py", line 451, 
in _check_feature_names
raise ValueError(message)
ValueError: The feature names should match those that were passed during 
fit.
Feature names unseen at fit time:
- 0
- 1
Feature names seen at fit time, yet now missing:
- x1
- x2
{code}

https://github.com/apache/spark/actions/runs/3871715040/jobs/6600578830



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41944) Pass configurations when local remote mode is on

2023-01-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41944:


Assignee: Hyukjin Kwon

> Pass configurations when local remote mode is on
> 
>
> Key: SPARK-41944
> URL: https://issues.apache.org/jira/browse/SPARK-41944
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> local remote mode does not currently pass the configurations properly to the 
> server is that's specified. We should pass them properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41944) Pass configurations when local remote mode is on

2023-01-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41944.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39463
[https://github.com/apache/spark/pull/39463]

> Pass configurations when local remote mode is on
> 
>
> Key: SPARK-41944
> URL: https://issues.apache.org/jira/browse/SPARK-41944
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> local remote mode does not currently pass the configurations properly to the 
> server is that's specified. We should pass them properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41945) Python: connect client lost column data with pyarrow.Table.to_pylist

2023-01-09 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-41945:
-

Assignee: jiaan.geng

> Python: connect client lost column data with pyarrow.Table.to_pylist
> 
>
> Key: SPARK-41945
> URL: https://issues.apache.org/jira/browse/SPARK-41945
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> Python: connect client should not use pyarrow.Table.to_pylist to transform 
> fetched data.
> For example:
> the data in pyarrow.Table show below.
> {code:java}
> pyarrow.Table
> key: string
> order: int64
> nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST 
> RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string
> nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST 
> RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string
> nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order ASC 
> NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string
> 
> key: [["a","a","a","a","a","b","b"]]
> order: [[0,1,2,3,4,1,2]]
> nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST 
> RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): 
> [[null,"x","x","x","x",null,null]]
> nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST 
> RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): 
> [[null,"x","x","x","x",null,null]]
> nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order ASC 
> NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): 
> [[null,null,"y","y","y",null,null]]
> {code}
> The table have five columns show above.
> But the data after call pyarrow.Table.to_pylist() show below.
> {code:java}
> [{
>   'key': 'a',
>   'order': 0,
>   'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
> FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None,
>   'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
> ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None
> }, {
>   'key': 'a',
>   'order': 1,
>   'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
> FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x',
>   'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
> ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None
> }, {
>   'key': 'a',
>   'order': 2,
>   'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
> FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x',
>   'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
> ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'y'
> }, {
>   'key': 'a',
>   'order': 3,
>   'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
> FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x',
>   'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
> ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'y'
> }, {
>   'key': 'a',
>   'order': 4,
>   'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
> FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x',
>   'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
> ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'y'
> }, {
>   'key': 'b',
>   'order': 1,
>   'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
> FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None,
>   'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
> ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None
> }, {
>   'key': 'b',
>   'order': 2,
>   'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
> FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None,
>   'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
> ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None
> }]
> {code}
> There are only four columns left.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41945) Python: connect client lost column data with pyarrow.Table.to_pylist

2023-01-09 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41945.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39461
[https://github.com/apache/spark/pull/39461]

> Python: connect client lost column data with pyarrow.Table.to_pylist
> 
>
> Key: SPARK-41945
> URL: https://issues.apache.org/jira/browse/SPARK-41945
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.4.0
>
>
> Python: connect client should not use pyarrow.Table.to_pylist to transform 
> fetched data.
> For example:
> the data in pyarrow.Table show below.
> {code:java}
> pyarrow.Table
> key: string
> order: int64
> nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST 
> RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string
> nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST 
> RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string
> nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order ASC 
> NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string
> 
> key: [["a","a","a","a","a","b","b"]]
> order: [[0,1,2,3,4,1,2]]
> nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST 
> RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): 
> [[null,"x","x","x","x",null,null]]
> nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST 
> RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): 
> [[null,"x","x","x","x",null,null]]
> nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order ASC 
> NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): 
> [[null,null,"y","y","y",null,null]]
> {code}
> The table have five columns show above.
> But the data after call pyarrow.Table.to_pylist() show below.
> {code:java}
> [{
>   'key': 'a',
>   'order': 0,
>   'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
> FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None,
>   'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
> ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None
> }, {
>   'key': 'a',
>   'order': 1,
>   'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
> FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x',
>   'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
> ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None
> }, {
>   'key': 'a',
>   'order': 2,
>   'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
> FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x',
>   'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
> ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'y'
> }, {
>   'key': 'a',
>   'order': 3,
>   'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
> FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x',
>   'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
> ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'y'
> }, {
>   'key': 'a',
>   'order': 4,
>   'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
> FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x',
>   'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
> ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'y'
> }, {
>   'key': 'b',
>   'order': 1,
>   'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
> FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None,
>   'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
> ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None
> }, {
>   'key': 'b',
>   'order': 2,
>   'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
> FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None,
>   'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
> ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None
> }]
> {code}
> There are only four columns left.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41842) Support data type Timestamp(NANOSECOND, null)

2023-01-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41842:
-
Fix Version/s: (was: 3.4.0)

> Support data type Timestamp(NANOSECOND, null)
> -
>
> Key: SPARK-41842
> URL: https://issues.apache.org/jira/browse/SPARK-41842
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Ruifeng Zheng
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1966, in pyspark.sql.connect.functions.hour
> Failed example:
>     df.select(hour('ts').alias('hour')).collect()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.select(hour('ts').alias('hour')).collect()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1017, in collect
>         pdf = self.toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 623, in _handle_error
>         raise SparkConnectException(status.message, info.reason) from None
>     pyspark.sql.connect.client.SparkConnectException: 
> (org.apache.spark.SparkUnsupportedOperationException) Unsupported data type: 
> Timestamp(NANOSECOND, null){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-41842) Support data type Timestamp(NANOSECOND, null)

2023-01-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-41842:
--

Let me reopen this since 
https://github.com/apache/spark/blob/56c7cf33929d7d42b7d299c0bb7e895963241214/python/pyspark/sql/tests/connect/test_parity_dataframe.py#L40-L48
 doesn't pass yet.

> Support data type Timestamp(NANOSECOND, null)
> -
>
> Key: SPARK-41842
> URL: https://issues.apache.org/jira/browse/SPARK-41842
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 1966, in pyspark.sql.connect.functions.hour
> Failed example:
>     df.select(hour('ts').alias('hour')).collect()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         df.select(hour('ts').alias('hour')).collect()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1017, in collect
>         pdf = self.toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 623, in _handle_error
>         raise SparkConnectException(status.message, info.reason) from None
>     pyspark.sql.connect.client.SparkConnectException: 
> (org.apache.spark.SparkUnsupportedOperationException) Unsupported data type: 
> Timestamp(NANOSECOND, null){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41741) [SQL] ParquetFilters StringStartsWith push down matching string do not use UTF-8

2023-01-09 Thread Jiale He (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656023#comment-17656023
 ] 

Jiale He commented on SPARK-41741:
--

[~yumwang] 

Do you mean this?

!image-2023-01-09-18-27-53-479.png|width=967,height=519!

> [SQL] ParquetFilters StringStartsWith push down matching string do not use 
> UTF-8
> 
>
> Key: SPARK-41741
> URL: https://issues.apache.org/jira/browse/SPARK-41741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jiale He
>Priority: Major
> Attachments: image-2022-12-28-18-00-00-861.png, 
> image-2022-12-28-18-00-21-586.png, image-2023-01-09-11-10-31-262.png, 
> image-2023-01-09-18-27-53-479.png, 
> part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet
>
>
> Hello ~
>  
> I found a problem, but there are two ways to solve it.
>  
> The parquet filter is pushed down. When using the like '***%' statement to 
> query, if the system default encoding is not UTF-8, it may cause an error.
>  
> There are two ways to bypass this problem as far as I know
> 1. spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8"
> 2. spark.sql.parquet.filterPushdown.string.startsWith=false
>  
> The following is the information to reproduce this problem
> The parquet sample file is in the attachment
> {code:java}
> spark.read.parquet("file:///home/kylin/hjldir/part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet").createTempView("tmp”)
> spark.sql("select * from tmp where `1` like '啦啦乐乐%'").show(false) {code}
> !image-2022-12-28-18-00-00-861.png|width=879,height=430!
>  
>   !image-2022-12-28-18-00-21-586.png|width=799,height=731!
>  
> I think the correct code should be:
> {code:java}
> private val strToBinary = 
> Binary.fromReusedByteArray(v.getBytes(StandardCharsets.UTF_8)) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41741) [SQL] ParquetFilters StringStartsWith push down matching string do not use UTF-8

2023-01-09 Thread Jiale He (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiale He updated SPARK-41741:
-
Attachment: image-2023-01-09-18-27-53-479.png

> [SQL] ParquetFilters StringStartsWith push down matching string do not use 
> UTF-8
> 
>
> Key: SPARK-41741
> URL: https://issues.apache.org/jira/browse/SPARK-41741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jiale He
>Priority: Major
> Attachments: image-2022-12-28-18-00-00-861.png, 
> image-2022-12-28-18-00-21-586.png, image-2023-01-09-11-10-31-262.png, 
> image-2023-01-09-18-27-53-479.png, 
> part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet
>
>
> Hello ~
>  
> I found a problem, but there are two ways to solve it.
>  
> The parquet filter is pushed down. When using the like '***%' statement to 
> query, if the system default encoding is not UTF-8, it may cause an error.
>  
> There are two ways to bypass this problem as far as I know
> 1. spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8"
> 2. spark.sql.parquet.filterPushdown.string.startsWith=false
>  
> The following is the information to reproduce this problem
> The parquet sample file is in the attachment
> {code:java}
> spark.read.parquet("file:///home/kylin/hjldir/part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet").createTempView("tmp”)
> spark.sql("select * from tmp where `1` like '啦啦乐乐%'").show(false) {code}
> !image-2022-12-28-18-00-00-861.png|width=879,height=430!
>  
>   !image-2022-12-28-18-00-21-586.png|width=799,height=731!
>  
> I think the correct code should be:
> {code:java}
> private val strToBinary = 
> Binary.fromReusedByteArray(v.getBytes(StandardCharsets.UTF_8)) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41949) Make stage scheduling support local-cluster mode

2023-01-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41949:


Assignee: (was: Apache Spark)

> Make stage scheduling support local-cluster mode
> 
>
> Key: SPARK-41949
> URL: https://issues.apache.org/jira/browse/SPARK-41949
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.4.0
>Reporter: Weichen Xu
>Priority: Major
>
> Make stage scheduling support local-cluster mode.
> This is useful in testing, especially for test code of third-party python 
> libraries that depends on pyspark, many tests are written with pytest, but 
> pytest is hard to integrate with a standalone spark cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41949) Make stage scheduling support local-cluster mode

2023-01-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17655989#comment-17655989
 ] 

Apache Spark commented on SPARK-41949:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/39424

> Make stage scheduling support local-cluster mode
> 
>
> Key: SPARK-41949
> URL: https://issues.apache.org/jira/browse/SPARK-41949
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.4.0
>Reporter: Weichen Xu
>Priority: Major
>
> Make stage scheduling support local-cluster mode.
> This is useful in testing, especially for test code of third-party python 
> libraries that depends on pyspark, many tests are written with pytest, but 
> pytest is hard to integrate with a standalone spark cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41949) Make stage scheduling support local-cluster mode

2023-01-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41949:


Assignee: Apache Spark

> Make stage scheduling support local-cluster mode
> 
>
> Key: SPARK-41949
> URL: https://issues.apache.org/jira/browse/SPARK-41949
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.4.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Major
>
> Make stage scheduling support local-cluster mode.
> This is useful in testing, especially for test code of third-party python 
> libraries that depends on pyspark, many tests are written with pytest, but 
> pytest is hard to integrate with a standalone spark cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41949) Make stage scheduling support local-cluster mode

2023-01-09 Thread Weichen Xu (Jira)
Weichen Xu created SPARK-41949:
--

 Summary: Make stage scheduling support local-cluster mode
 Key: SPARK-41949
 URL: https://issues.apache.org/jira/browse/SPARK-41949
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 3.4.0
Reporter: Weichen Xu


Make stage scheduling support local-cluster mode.


This is useful in testing, especially for test code of third-party python 
libraries that depends on pyspark, many tests are written with pytest, but 
pytest is hard to integrate with a standalone spark cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41948) Fix NPE for error classes: CANNOT_PARSE_JSON_FIELD

2023-01-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17655982#comment-17655982
 ] 

Apache Spark commented on SPARK-41948:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39466

> Fix NPE for error classes: CANNOT_PARSE_JSON_FIELD
> --
>
> Key: SPARK-41948
> URL: https://issues.apache.org/jira/browse/SPARK-41948
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41948) Fix NPE for error classes: CANNOT_PARSE_JSON_FIELD

2023-01-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17655981#comment-17655981
 ] 

Apache Spark commented on SPARK-41948:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39466

> Fix NPE for error classes: CANNOT_PARSE_JSON_FIELD
> --
>
> Key: SPARK-41948
> URL: https://issues.apache.org/jira/browse/SPARK-41948
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41948) Fix NPE for error classes: CANNOT_PARSE_JSON_FIELD

2023-01-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41948:


Assignee: (was: Apache Spark)

> Fix NPE for error classes: CANNOT_PARSE_JSON_FIELD
> --
>
> Key: SPARK-41948
> URL: https://issues.apache.org/jira/browse/SPARK-41948
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41948) Fix NPE for error classes: CANNOT_PARSE_JSON_FIELD

2023-01-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41948:


Assignee: Apache Spark

> Fix NPE for error classes: CANNOT_PARSE_JSON_FIELD
> --
>
> Key: SPARK-41948
> URL: https://issues.apache.org/jira/browse/SPARK-41948
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41780) `regexp_replace('', '[a\\\\d]{0, 2}', 'x')` causes an internal error

2023-01-09 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-41780:


Assignee: BingKun Pan

> `regexp_replace('', '[ad]{0, 2}', 'x')` causes an internal error
> 
>
> Key: SPARK-41780
> URL: https://issues.apache.org/jira/browse/SPARK-41780
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
> Environment: Spark3.3.0 local mode
>Reporter: Remzi Yang
>Assignee: BingKun Pan
>Priority: Major
> Attachments: image-2023-01-04-14-12-26-126.png
>
>
> {code:scala}
> scala> spark.sql("select regexp_replace('', '[ad]{0,2}', 'x')").show
> +--+
> |regexp_replace(, [a\d]\{0,2}, x, 1)|
> +--+
> |                                 x|
> +--+
>  
>  
> scala> spark.sql("select regexp_replace('', '[ad]{0, 2}', 'x')").show
> org.apache.spark.SparkException: The Spark SQL phase optimization failed with 
> an internal error. Please, fill a bug report in, and provide the full stack 
> trace.
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41780) `regexp_replace('', '[a\\\\d]{0, 2}', 'x')` causes an internal error

2023-01-09 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-41780.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39383
[https://github.com/apache/spark/pull/39383]

> `regexp_replace('', '[ad]{0, 2}', 'x')` causes an internal error
> 
>
> Key: SPARK-41780
> URL: https://issues.apache.org/jira/browse/SPARK-41780
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
> Environment: Spark3.3.0 local mode
>Reporter: Remzi Yang
>Assignee: BingKun Pan
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: image-2023-01-04-14-12-26-126.png
>
>
> {code:scala}
> scala> spark.sql("select regexp_replace('', '[ad]{0,2}', 'x')").show
> +--+
> |regexp_replace(, [a\d]\{0,2}, x, 1)|
> +--+
> |                                 x|
> +--+
>  
>  
> scala> spark.sql("select regexp_replace('', '[ad]{0, 2}', 'x')").show
> org.apache.spark.SparkException: The Spark SQL phase optimization failed with 
> an internal error. Please, fill a bug report in, and provide the full stack 
> trace.
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41948) Fix NPE for error classes: CANNOT_PARSE_JSON_FIELD

2023-01-09 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-41948:
---

 Summary: Fix NPE for error classes: CANNOT_PARSE_JSON_FIELD
 Key: SPARK-41948
 URL: https://issues.apache.org/jira/browse/SPARK-41948
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-41945) Python: connect client lost column data with pyarrow.Table.to_pylist

2023-01-09 Thread jiaan.geng (Jira)


[ https://issues.apache.org/jira/browse/SPARK-41945 ]


jiaan.geng deleted comment on SPARK-41945:


was (Author: beliefer):
I'm working on.

> Python: connect client lost column data with pyarrow.Table.to_pylist
> 
>
> Key: SPARK-41945
> URL: https://issues.apache.org/jira/browse/SPARK-41945
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Python: connect client should not use pyarrow.Table.to_pylist to transform 
> fetched data.
> For example:
> the data in pyarrow.Table show below.
> {code:java}
> pyarrow.Table
> key: string
> order: int64
> nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST 
> RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string
> nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST 
> RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string
> nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order ASC 
> NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string
> 
> key: [["a","a","a","a","a","b","b"]]
> order: [[0,1,2,3,4,1,2]]
> nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST 
> RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): 
> [[null,"x","x","x","x",null,null]]
> nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST 
> RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): 
> [[null,"x","x","x","x",null,null]]
> nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order ASC 
> NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): 
> [[null,null,"y","y","y",null,null]]
> {code}
> The table have five columns show above.
> But the data after call pyarrow.Table.to_pylist() show below.
> {code:java}
> [{
>   'key': 'a',
>   'order': 0,
>   'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
> FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None,
>   'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
> ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None
> }, {
>   'key': 'a',
>   'order': 1,
>   'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
> FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x',
>   'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
> ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None
> }, {
>   'key': 'a',
>   'order': 2,
>   'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
> FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x',
>   'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
> ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'y'
> }, {
>   'key': 'a',
>   'order': 3,
>   'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
> FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x',
>   'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
> ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'y'
> }, {
>   'key': 'a',
>   'order': 4,
>   'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
> FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x',
>   'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
> ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'y'
> }, {
>   'key': 'b',
>   'order': 1,
>   'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
> FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None,
>   'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
> ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None
> }, {
>   'key': 'b',
>   'order': 2,
>   'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
> FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None,
>   'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
> ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None
> }]
> {code}
> There are only four columns left.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41945) Python: connect client lost column data with pyarrow.Table.to_pylist

2023-01-09 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-41945:
---
Description: 
Python: connect client should not use pyarrow.Table.to_pylist to transform 
fetched data.
For example:
the data in pyarrow.Table show below.

{code:java}
pyarrow.Table
key: string
order: int64
nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE 
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string
nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE 
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string
nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order ASC 
NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string

key: [["a","a","a","a","a","b","b"]]
order: [[0,1,2,3,4,1,2]]
nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE 
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): [[null,"x","x","x","x",null,null]]
nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE 
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): [[null,"x","x","x","x",null,null]]
nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order ASC 
NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): 
[[null,null,"y","y","y",null,null]]
{code}

The table have five columns show above.
But the data after call pyarrow.Table.to_pylist() show below.

{code:java}
[{
'key': 'a',
'order': 0,
'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None,
'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None
}, {
'key': 'a',
'order': 1,
'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x',
'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None
}, {
'key': 'a',
'order': 2,
'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x',
'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'y'
}, {
'key': 'a',
'order': 3,
'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x',
'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'y'
}, {
'key': 'a',
'order': 4,
'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x',
'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'y'
}, {
'key': 'b',
'order': 1,
'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None,
'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None
}, {
'key': 'b',
'order': 2,
'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS 
FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None,
'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order 
ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None
}]
{code}

There are only four columns left.


  was:
Python: connect client should not use pyarrow.Table.to_pylist to transform 
fetched data.
For example:
the data in pyarrow.Table show below.

{code:java}
pyarrow.Table
key: string
order: int64
nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE 
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string
nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE 
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string
nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order ASC 
NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string

key: [["a","a","a","a","a","b","b"]]
order: [[0,1,2,3,4,1,2]]
nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE 
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): [[null,"x","x","x","x",null,null]]
nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE 
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): [[null,"x","x","x","x",null,null]]
nth_value(value, 2) ignore nulls OVER