[jira] [Updated] (SPARK-37528) Schedule Tasks By Input Size

2022-04-01 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-37528:
--
Affects Version/s: 3.4.0
   (was: 3.3.0)

> Schedule Tasks By Input Size
> 
>
> Key: SPARK-37528
> URL: https://issues.apache.org/jira/browse/SPARK-37528
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> In general, the larger input data size means longer running time. So ideally, 
> we can let DAGScheduler submit bigger input size task first. It can reduce 
> the whole stage running time. For example, we have one stage with 4 tasks and 
> the defaultParallelism is 2 and the 4 tasks have different running time [1s, 
> 3s, 2s, 4s].
> - in normal, the running time of the stage is: 7s
> - if big task first, the running time of the stage is: 5s



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38757) Update the Oracle docker image

2022-04-01 Thread Luca Canali (Jira)
Luca Canali created SPARK-38757:
---

 Summary: Update the Oracle docker image
 Key: SPARK-38757
 URL: https://issues.apache.org/jira/browse/SPARK-38757
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.2.1
Reporter: Luca Canali


This proposes to update the Docker image used for integration tests and builds 
from Oracle XE version 18.4.0 to Oracle XE version 21.3.0.

Currently Oracle XE version 18.4.0 is being used. Oracle 18c support has ended 
in 2021. Oracle 21c is the latest release of the Oracle RDBMS.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38757) Update the Oracle docker image version used for test and integration

2022-04-01 Thread Luca Canali (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-38757:

Summary: Update the Oracle docker image version used for test and 
integration  (was: Update the Oracle docker image)

> Update the Oracle docker image version used for test and integration
> 
>
> Key: SPARK-38757
> URL: https://issues.apache.org/jira/browse/SPARK-38757
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.1
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes to update the Docker image used for integration tests and 
> builds from Oracle XE version 18.4.0 to Oracle XE version 21.3.0.
> Currently Oracle XE version 18.4.0 is being used. Oracle 18c support has 
> ended in 2021. Oracle 21c is the latest release of the Oracle RDBMS.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38757) Update the Oracle docker image version used for test and integration

2022-04-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515807#comment-17515807
 ] 

Apache Spark commented on SPARK-38757:
--

User 'LucaCanali' has created a pull request for this issue:
https://github.com/apache/spark/pull/36036

> Update the Oracle docker image version used for test and integration
> 
>
> Key: SPARK-38757
> URL: https://issues.apache.org/jira/browse/SPARK-38757
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.1
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes to update the Docker image used for integration tests and 
> builds from Oracle XE version 18.4.0 to Oracle XE version 21.3.0.
> Currently Oracle XE version 18.4.0 is being used. Oracle 18c support has 
> ended in 2021. Oracle 21c is the latest release of the Oracle RDBMS.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38757) Update the Oracle docker image version used for test and integration

2022-04-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38757:


Assignee: (was: Apache Spark)

> Update the Oracle docker image version used for test and integration
> 
>
> Key: SPARK-38757
> URL: https://issues.apache.org/jira/browse/SPARK-38757
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.1
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes to update the Docker image used for integration tests and 
> builds from Oracle XE version 18.4.0 to Oracle XE version 21.3.0.
> Currently Oracle XE version 18.4.0 is being used. Oracle 18c support has 
> ended in 2021. Oracle 21c is the latest release of the Oracle RDBMS.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38757) Update the Oracle docker image version used for test and integration

2022-04-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38757:


Assignee: Apache Spark

> Update the Oracle docker image version used for test and integration
> 
>
> Key: SPARK-38757
> URL: https://issues.apache.org/jira/browse/SPARK-38757
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.1
>Reporter: Luca Canali
>Assignee: Apache Spark
>Priority: Minor
>
> This proposes to update the Docker image used for integration tests and 
> builds from Oracle XE version 18.4.0 to Oracle XE version 21.3.0.
> Currently Oracle XE version 18.4.0 is being used. Oracle 18c support has 
> ended in 2021. Oracle 21c is the latest release of the Oracle RDBMS.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38758) Web UI add heap dump

2022-04-01 Thread Jinpeng Chi (Jira)
Jinpeng Chi created SPARK-38758:
---

 Summary: Web UI add heap dump 
 Key: SPARK-38758
 URL: https://issues.apache.org/jira/browse/SPARK-38758
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.2.1, 3.1.2
Reporter: Jinpeng Chi


The current Web UI can dump threads, so I want to add memory dump



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38358) Add migration guide for spark.sql.hive.convertMetastoreInsertDir and spark.sql.hive.convertMetastoreCtas

2022-04-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515836#comment-17515836
 ] 

Apache Spark commented on SPARK-38358:
--

User 'cutiechi' has created a pull request for this issue:
https://github.com/apache/spark/pull/36037

> Add migration guide for spark.sql.hive.convertMetastoreInsertDir and 
> spark.sql.hive.convertMetastoreCtas
> 
>
> Key: SPARK-38358
> URL: https://issues.apache.org/jira/browse/SPARK-38358
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.1
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.3.0
>
>
> After we migration to spark3, many job throw exception since in data source 
> API, we can't support overwrite into partition table while reading from this 
> table. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38759) Add StreamingQueryListener support in PySpark

2022-04-01 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-38759:


 Summary: Add StreamingQueryListener support in PySpark
 Key: SPARK-38759
 URL: https://issues.apache.org/jira/browse/SPARK-38759
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Structured Streaming
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


PySpark currently does not have the support of {{StreamingQueryListener}} in 
PySpark whereas DStream has it. This feature is important especially with 
{{Dataset.observe}} so users can monitor what's going on in their queries.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38760) Implement DataFrame.observe in PySpark

2022-04-01 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-38760:


 Summary: Implement DataFrame.observe in PySpark
 Key: SPARK-38760
 URL: https://issues.apache.org/jira/browse/SPARK-38760
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Structured Streaming
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


This JIRA is blocked by SPARK-38759.

We should better have the support of DataFrame.observe for PySpark Structured 
Streaming users so they can mintor their queries.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38684) Stream-stream outer join has a possible correctness issue due to weakly read consistent on outer iterators

2022-04-01 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-38684.
--
Fix Version/s: 3.3.0
   3.2.2
   Resolution: Fixed

Issue resolved by pull request 36002
[https://github.com/apache/spark/pull/36002]

> Stream-stream outer join has a possible correctness issue due to weakly read 
> consistent on outer iterators
> --
>
> Key: SPARK-38684
> URL: https://issues.apache.org/jira/browse/SPARK-38684
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.3.0, 3.2.2
>
>
> We figured out stream-stream join has the same issue with SPARK-38320 on the 
> appended iterators. Since the root cause is same as SPARK-38320, this is only 
> reproducible with RocksDB state store provider, but even with HDFS backed 
> state store provider, it is not guaranteed by interface contract hence may 
> depend on the JVM vendor, version, etc.
> I can easily construct the scenario of “data loss” in state store.
> Condition follows:
>  * Use stream-stream time interval outer join
>  ** left outer join has an issue on left side, right outer join has an issue 
> on right side, full outer join has an issue on both sides
>  * At batch N, produce row(s) on the problematic side which are non-late
>  * At the same batch (batch N), some row(s) on the problematic side should be 
> evicted by watermark condition
> When the condition is fulfilled, out of sync happens with keyToNumValues 
> between state and the iterator in evict phase. If eviction of the row happens 
> for the grouping key (updating keyToNumValues), the eviction phase 
> “overwrites” keyToNumValues in the state as the value it calculates.
> Given that the eviction phase “do not know” about the new rows 
> (keyToNumValues is out of sync), effectively discarding all rows from the 
> state being added in the batch N.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38684) Stream-stream outer join has a possible correctness issue due to weakly read consistent on outer iterators

2022-04-01 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-38684:


Assignee: Jungtaek Lim

> Stream-stream outer join has a possible correctness issue due to weakly read 
> consistent on outer iterators
> --
>
> Key: SPARK-38684
> URL: https://issues.apache.org/jira/browse/SPARK-38684
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Blocker
>  Labels: correctness
>
> We figured out stream-stream join has the same issue with SPARK-38320 on the 
> appended iterators. Since the root cause is same as SPARK-38320, this is only 
> reproducible with RocksDB state store provider, but even with HDFS backed 
> state store provider, it is not guaranteed by interface contract hence may 
> depend on the JVM vendor, version, etc.
> I can easily construct the scenario of “data loss” in state store.
> Condition follows:
>  * Use stream-stream time interval outer join
>  ** left outer join has an issue on left side, right outer join has an issue 
> on right side, full outer join has an issue on both sides
>  * At batch N, produce row(s) on the problematic side which are non-late
>  * At the same batch (batch N), some row(s) on the problematic side should be 
> evicted by watermark condition
> When the condition is fulfilled, out of sync happens with keyToNumValues 
> between state and the iterator in evict phase. If eviction of the row happens 
> for the grouping key (updating keyToNumValues), the eviction phase 
> “overwrites” keyToNumValues in the state as the value it calculates.
> Given that the eviction phase “do not know” about the new rows 
> (keyToNumValues is out of sync), effectively discarding all rows from the 
> state being added in the batch N.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38761) DS V2 supports push down misc non-aggregate functions

2022-04-01 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-38761:
--

 Summary: DS V2 supports push down misc non-aggregate functions
 Key: SPARK-38761
 URL: https://issues.apache.org/jira/browse/SPARK-38761
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.4.0
Reporter: jiaan.geng


Currently, Spark have a lot misc non-aggregate functions of ANSI standard.
abs,
coalesce,
nullif,
when

DS V2 should supports push down these misc non-aggregate functions



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38759) Add StreamingQueryListener support in PySpark

2022-04-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515862#comment-17515862
 ] 

Apache Spark commented on SPARK-38759:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/36038

> Add StreamingQueryListener support in PySpark
> -
>
> Key: SPARK-38759
> URL: https://issues.apache.org/jira/browse/SPARK-38759
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> PySpark currently does not have the support of {{StreamingQueryListener}} in 
> PySpark whereas DStream has it. This feature is important especially with 
> {{Dataset.observe}} so users can monitor what's going on in their queries.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38761) DS V2 supports push down misc non-aggregate functions

2022-04-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515861#comment-17515861
 ] 

Apache Spark commented on SPARK-38761:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/36039

> DS V2 supports push down misc non-aggregate functions
> -
>
> Key: SPARK-38761
> URL: https://issues.apache.org/jira/browse/SPARK-38761
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark have a lot misc non-aggregate functions of ANSI standard.
> abs,
> coalesce,
> nullif,
> when
> DS V2 should supports push down these misc non-aggregate functions



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38759) Add StreamingQueryListener support in PySpark

2022-04-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38759:


Assignee: Apache Spark

> Add StreamingQueryListener support in PySpark
> -
>
> Key: SPARK-38759
> URL: https://issues.apache.org/jira/browse/SPARK-38759
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> PySpark currently does not have the support of {{StreamingQueryListener}} in 
> PySpark whereas DStream has it. This feature is important especially with 
> {{Dataset.observe}} so users can monitor what's going on in their queries.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38761) DS V2 supports push down misc non-aggregate functions

2022-04-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38761:


Assignee: (was: Apache Spark)

> DS V2 supports push down misc non-aggregate functions
> -
>
> Key: SPARK-38761
> URL: https://issues.apache.org/jira/browse/SPARK-38761
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark have a lot misc non-aggregate functions of ANSI standard.
> abs,
> coalesce,
> nullif,
> when
> DS V2 should supports push down these misc non-aggregate functions



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38759) Add StreamingQueryListener support in PySpark

2022-04-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38759:


Assignee: (was: Apache Spark)

> Add StreamingQueryListener support in PySpark
> -
>
> Key: SPARK-38759
> URL: https://issues.apache.org/jira/browse/SPARK-38759
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> PySpark currently does not have the support of {{StreamingQueryListener}} in 
> PySpark whereas DStream has it. This feature is important especially with 
> {{Dataset.observe}} so users can monitor what's going on in their queries.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38761) DS V2 supports push down misc non-aggregate functions

2022-04-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38761:


Assignee: Apache Spark

> DS V2 supports push down misc non-aggregate functions
> -
>
> Key: SPARK-38761
> URL: https://issues.apache.org/jira/browse/SPARK-38761
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> Currently, Spark have a lot misc non-aggregate functions of ANSI standard.
> abs,
> coalesce,
> nullif,
> when
> DS V2 should supports push down these misc non-aggregate functions



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38758) Web UI add heap dump

2022-04-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38758:


Assignee: (was: Apache Spark)

> Web UI add heap dump 
> -
>
> Key: SPARK-38758
> URL: https://issues.apache.org/jira/browse/SPARK-38758
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.1.2, 3.2.1
>Reporter: Jinpeng Chi
>Priority: Major
>
> The current Web UI can dump threads, so I want to add memory dump



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38758) Web UI add heap dump

2022-04-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38758:


Assignee: Apache Spark

> Web UI add heap dump 
> -
>
> Key: SPARK-38758
> URL: https://issues.apache.org/jira/browse/SPARK-38758
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.1.2, 3.2.1
>Reporter: Jinpeng Chi
>Assignee: Apache Spark
>Priority: Major
>
> The current Web UI can dump threads, so I want to add memory dump



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38758) Web UI add heap dump

2022-04-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515865#comment-17515865
 ] 

Apache Spark commented on SPARK-38758:
--

User 'cutiechi' has created a pull request for this issue:
https://github.com/apache/spark/pull/36037

> Web UI add heap dump 
> -
>
> Key: SPARK-38758
> URL: https://issues.apache.org/jira/browse/SPARK-38758
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.1.2, 3.2.1
>Reporter: Jinpeng Chi
>Priority: Major
>
> The current Web UI can dump threads, so I want to add memory dump



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38762) Provide query context in Decimal overflow errors

2022-04-01 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-38762:
--

 Summary: Provide query context in Decimal overflow errors
 Key: SPARK-38762
 URL: https://issues.apache.org/jira/browse/SPARK-38762
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38762) Provide query context in Decimal overflow errors

2022-04-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38762:


Assignee: Apache Spark  (was: Gengliang Wang)

> Provide query context in Decimal overflow errors
> 
>
> Key: SPARK-38762
> URL: https://issues.apache.org/jira/browse/SPARK-38762
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38762) Provide query context in Decimal overflow errors

2022-04-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515890#comment-17515890
 ] 

Apache Spark commented on SPARK-38762:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/36040

> Provide query context in Decimal overflow errors
> 
>
> Key: SPARK-38762
> URL: https://issues.apache.org/jira/browse/SPARK-38762
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38762) Provide query context in Decimal overflow errors

2022-04-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515889#comment-17515889
 ] 

Apache Spark commented on SPARK-38762:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/36040

> Provide query context in Decimal overflow errors
> 
>
> Key: SPARK-38762
> URL: https://issues.apache.org/jira/browse/SPARK-38762
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38762) Provide query context in Decimal overflow errors

2022-04-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38762:


Assignee: Gengliang Wang  (was: Apache Spark)

> Provide query context in Decimal overflow errors
> 
>
> Key: SPARK-38762
> URL: https://issues.apache.org/jira/browse/SPARK-38762
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38703) High GC and memory footprint after switch to ZSTD

2022-04-01 Thread Cheng Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515907#comment-17515907
 ] 

Cheng Pan commented on SPARK-38703:
---

SPARK-34390 may helps, our benchmark of 1T TPC-DS shows the benefits. 
(compression using zstd in shuffle, not parquet)

{code:bash}
+---+---+---+-+
|lz4| sum(task_cpu_time_s)  | sum(task_run_time_s)  | 
sum(gc_time_s)  |
+---+---+---+-+
| lz4   | 1871242.5 | 3861923.8 | 197151.5  
  |
| zstd  | 1989641.6 | 3326399.8 | 244333.2  
  |
| zstd_buffer_pool  | 1912032.0 | 3342339.4 | 187262.3  
  |
+---+---+---+-+
{code}





> High GC and memory footprint after switch to ZSTD
> -
>
> Key: SPARK-38703
> URL: https://issues.apache.org/jira/browse/SPARK-38703
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output
>Affects Versions: 3.1.2
>Reporter: Michael Taranov
>Priority: Major
>
> Hi All,
> We started to switch our Spark pipelines to read parquet with ZSTD 
> compression. 
> After the switch we see that memory footprint is much larger than previously 
> with SNAPPY.
> Additionally GC stats of the jobs are much higher comparing to SNAPPY with 
> the same workload as previously. 
> Is there any configurations that may be relevant to read path, that may help 
> in such cases ?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-38703) High GC and memory footprint after switch to ZSTD

2022-04-01 Thread Cheng Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515907#comment-17515907
 ] 

Cheng Pan edited comment on SPARK-38703 at 4/1/22 12:55 PM:


SPARK-34390 may help, our benchmark of 1T TPC-DS shows the benefits. 
(compression using zstd in shuffle, not parquet)

{code:bash}
+---+---+---+-+
|lz4| sum(task_cpu_time_s)  | sum(task_run_time_s)  | 
sum(gc_time_s)  |
+---+---+---+-+
| lz4   | 1871242.5 | 3861923.8 | 197151.5  
  |
| zstd  | 1989641.6 | 3326399.8 | 244333.2  
  |
| zstd_buffer_pool  | 1912032.0 | 3342339.4 | 187262.3  
  |
+---+---+---+-+
{code}






was (Author: pan3793):
SPARK-34390 may helps, our benchmark of 1T TPC-DS shows the benefits. 
(compression using zstd in shuffle, not parquet)

{code:bash}
+---+---+---+-+
|lz4| sum(task_cpu_time_s)  | sum(task_run_time_s)  | 
sum(gc_time_s)  |
+---+---+---+-+
| lz4   | 1871242.5 | 3861923.8 | 197151.5  
  |
| zstd  | 1989641.6 | 3326399.8 | 244333.2  
  |
| zstd_buffer_pool  | 1912032.0 | 3342339.4 | 187262.3  
  |
+---+---+---+-+
{code}





> High GC and memory footprint after switch to ZSTD
> -
>
> Key: SPARK-38703
> URL: https://issues.apache.org/jira/browse/SPARK-38703
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output
>Affects Versions: 3.1.2
>Reporter: Michael Taranov
>Priority: Major
>
> Hi All,
> We started to switch our Spark pipelines to read parquet with ZSTD 
> compression. 
> After the switch we see that memory footprint is much larger than previously 
> with SNAPPY.
> Additionally GC stats of the jobs are much higher comparing to SNAPPY with 
> the same workload as previously. 
> Is there any configurations that may be relevant to read path, that may help 
> in such cases ?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-38703) High GC and memory footprint after switch to ZSTD

2022-04-01 Thread Cheng Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515907#comment-17515907
 ] 

Cheng Pan edited comment on SPARK-38703 at 4/1/22 12:56 PM:


SPARK-34390 may help, our benchmark of 1T TPC-DS shows the benefits. 
(compression using zstd in shuffle, not parquet)

{code:bash}
+---+---+---+-+
|compression| sum(task_cpu_time_s)  | sum(task_run_time_s)  | 
sum(gc_time_s)  |
+---+---+---+-+
| lz4   | 1871242.5 | 3861923.8 | 197151.5  
  |
| zstd  | 1989641.6 | 3326399.8 | 244333.2  
  |
| zstd_buffer_pool  | 1912032.0 | 3342339.4 | 187262.3  
  |
+---+---+---+-+
{code}






was (Author: pan3793):
SPARK-34390 may help, our benchmark of 1T TPC-DS shows the benefits. 
(compression using zstd in shuffle, not parquet)

{code:bash}
+---+---+---+-+
|lz4| sum(task_cpu_time_s)  | sum(task_run_time_s)  | 
sum(gc_time_s)  |
+---+---+---+-+
| lz4   | 1871242.5 | 3861923.8 | 197151.5  
  |
| zstd  | 1989641.6 | 3326399.8 | 244333.2  
  |
| zstd_buffer_pool  | 1912032.0 | 3342339.4 | 187262.3  
  |
+---+---+---+-+
{code}





> High GC and memory footprint after switch to ZSTD
> -
>
> Key: SPARK-38703
> URL: https://issues.apache.org/jira/browse/SPARK-38703
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output
>Affects Versions: 3.1.2
>Reporter: Michael Taranov
>Priority: Major
>
> Hi All,
> We started to switch our Spark pipelines to read parquet with ZSTD 
> compression. 
> After the switch we see that memory footprint is much larger than previously 
> with SNAPPY.
> Additionally GC stats of the jobs are much higher comparing to SNAPPY with 
> the same workload as previously. 
> Is there any configurations that may be relevant to read path, that may help 
> in such cases ?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38614) After Spark update, df.show() shows incorrect F.percent_rank results

2022-04-01 Thread ZygD (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZygD updated SPARK-38614:
-
Component/s: SQL

> After Spark update, df.show() shows incorrect F.percent_rank results
> 
>
> Key: SPARK-38614
> URL: https://issues.apache.org/jira/browse/SPARK-38614
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.2.0, 3.2.1
>Reporter: ZygD
>Priority: Major
>  Labels: correctness
>
> Expected result is obtained using Spark 3.1.2, but not 3.2.0 or 3.2.1
> *Minimal reproducible example*
> {code:java}
> from pyspark.sql import SparkSession, functions as F, Window as W
> spark = SparkSession.builder.getOrCreate()
>  
> df = spark.range(101).withColumn('pr', F.percent_rank().over(W.orderBy('id')))
> df.show(3)
> df.show(5) {code}
> *Expected result*
> {code:java}
> +---++
> | id|  pr|
> +---++
> |  0| 0.0|
> |  1|0.01|
> |  2|0.02|
> +---++
> only showing top 3 rows
> +---++
> | id|  pr|
> +---++
> |  0| 0.0|
> |  1|0.01|
> |  2|0.02|
> |  3|0.03|
> |  4|0.04|
> +---++
> only showing top 5 rows{code}
> *Actual result*
> {code:java}
> +---+--+
> | id|pr|
> +---+--+
> |  0|   0.0|
> |  1|0.|
> |  2|0.|
> +---+--+
> only showing top 3 rows
> +---+---+
> | id| pr|
> +---+---+
> |  0|0.0|
> |  1|0.2|
> |  2|0.4|
> |  3|0.6|
> |  4|0.8|
> +---+---+
> only showing top 5 rows{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38763) Pandas API on spark Can`t apply lamda to columns.

2022-04-01 Thread Jira
Bjørn Jørgensen created SPARK-38763:
---

 Summary: Pandas API on spark Can`t apply lamda to columns.  
 Key: SPARK-38763
 URL: https://issues.apache.org/jira/browse/SPARK-38763
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.3.0, 3.4.0
Reporter: Bjørn Jørgensen


When I use a spark master build from 08 November 21 I can use this code to 
rename columns 

{code:java}
pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', x))
pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
{code}

But now after I get this error when I use this code.

---
ValueErrorTraceback (most recent call last)
Input In [5], in ()
> 1 pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', x))
  2 pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
  3 pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))

File /opt/spark/python/pyspark/pandas/frame.py:10636, in DataFrame.rename(self, 
mapper, index, columns, axis, inplace, level, errors)
  10632 index_mapper_fn, index_mapper_ret_dtype, index_mapper_ret_stype = 
gen_mapper_fn(
  10633 index
  10634 )
  10635 if columns:
> 10636 columns_mapper_fn, _, _ = gen_mapper_fn(columns)
  10638 if not index and not columns:
  10639 raise ValueError("Either `index` or `columns` should be provided.")

File /opt/spark/python/pyspark/pandas/frame.py:10603, in 
DataFrame.rename..gen_mapper_fn(mapper)
  10601 elif callable(mapper):
  10602 mapper_callable = cast(Callable, mapper)
> 10603 return_type = cast(ScalarType, infer_return_type(mapper))
  10604 dtype = return_type.dtype
  10605 spark_return_type = return_type.spark_type

File /opt/spark/python/pyspark/pandas/typedef/typehints.py:563, in 
infer_return_type(f)
560 tpe = get_type_hints(f).get("return", None)
562 if tpe is None:
--> 563 raise ValueError("A return value is required for the input 
function")
565 if hasattr(tpe, "__origin__") and issubclass(tpe.__origin__, 
SeriesType):
566 tpe = tpe.__args__[0]

ValueError: A return value is required for the input function






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38763) Pandas API on spark Can`t apply lamda to columns.

2022-04-01 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-38763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515991#comment-17515991
 ] 

Bjørn Jørgensen commented on SPARK-38763:
-

[~XinrongM]

> Pandas API on spark Can`t apply lamda to columns.  
> ---
>
> Key: SPARK-38763
> URL: https://issues.apache.org/jira/browse/SPARK-38763
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> When I use a spark master build from 08 November 21 I can use this code to 
> rename columns 
> {code:java}
> pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> {code}
> But now after I get this error when I use this code.
> ---
> ValueErrorTraceback (most recent call last)
> Input In [5], in ()
> > 1 pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', 
> x))
>   2 pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
>   3 pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> File /opt/spark/python/pyspark/pandas/frame.py:10636, in 
> DataFrame.rename(self, mapper, index, columns, axis, inplace, level, errors)
>   10632 index_mapper_fn, index_mapper_ret_dtype, index_mapper_ret_stype = 
> gen_mapper_fn(
>   10633 index
>   10634 )
>   10635 if columns:
> > 10636 columns_mapper_fn, _, _ = gen_mapper_fn(columns)
>   10638 if not index and not columns:
>   10639 raise ValueError("Either `index` or `columns` should be 
> provided.")
> File /opt/spark/python/pyspark/pandas/frame.py:10603, in 
> DataFrame.rename..gen_mapper_fn(mapper)
>   10601 elif callable(mapper):
>   10602 mapper_callable = cast(Callable, mapper)
> > 10603 return_type = cast(ScalarType, infer_return_type(mapper))
>   10604 dtype = return_type.dtype
>   10605 spark_return_type = return_type.spark_type
> File /opt/spark/python/pyspark/pandas/typedef/typehints.py:563, in 
> infer_return_type(f)
> 560 tpe = get_type_hints(f).get("return", None)
> 562 if tpe is None:
> --> 563 raise ValueError("A return value is required for the input 
> function")
> 565 if hasattr(tpe, "__origin__") and issubclass(tpe.__origin__, 
> SeriesType):
> 566 tpe = tpe.__args__[0]
> ValueError: A return value is required for the input function



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-38763) Pandas API on spark Can`t apply lamda to columns.

2022-04-01 Thread Jira


[ https://issues.apache.org/jira/browse/SPARK-38763 ]


Bjørn Jørgensen deleted comment on SPARK-38763:
-

was (Author: bjornjorgensen):
[~XinrongM]

> Pandas API on spark Can`t apply lamda to columns.  
> ---
>
> Key: SPARK-38763
> URL: https://issues.apache.org/jira/browse/SPARK-38763
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> When I use a spark master build from 08 November 21 I can use this code to 
> rename columns 
> {code:java}
> pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> {code}
> But now after I get this error when I use this code.
> ---
> ValueErrorTraceback (most recent call last)
> Input In [5], in ()
> > 1 pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', 
> x))
>   2 pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
>   3 pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> File /opt/spark/python/pyspark/pandas/frame.py:10636, in 
> DataFrame.rename(self, mapper, index, columns, axis, inplace, level, errors)
>   10632 index_mapper_fn, index_mapper_ret_dtype, index_mapper_ret_stype = 
> gen_mapper_fn(
>   10633 index
>   10634 )
>   10635 if columns:
> > 10636 columns_mapper_fn, _, _ = gen_mapper_fn(columns)
>   10638 if not index and not columns:
>   10639 raise ValueError("Either `index` or `columns` should be 
> provided.")
> File /opt/spark/python/pyspark/pandas/frame.py:10603, in 
> DataFrame.rename..gen_mapper_fn(mapper)
>   10601 elif callable(mapper):
>   10602 mapper_callable = cast(Callable, mapper)
> > 10603 return_type = cast(ScalarType, infer_return_type(mapper))
>   10604 dtype = return_type.dtype
>   10605 spark_return_type = return_type.spark_type
> File /opt/spark/python/pyspark/pandas/typedef/typehints.py:563, in 
> infer_return_type(f)
> 560 tpe = get_type_hints(f).get("return", None)
> 562 if tpe is None:
> --> 563 raise ValueError("A return value is required for the input 
> function")
> 565 if hasattr(tpe, "__origin__") and issubclass(tpe.__origin__, 
> SeriesType):
> 566 tpe = tpe.__args__[0]
> ValueError: A return value is required for the input function



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37903) Replace string_typehints with get_type_hints.

2022-04-01 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-37903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516012#comment-17516012
 ] 

Bjørn Jørgensen commented on SPARK-37903:
-

I do have some problems now when I use lamba on columns 
[SPARK-38763|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-38763] 

> Replace string_typehints with get_type_hints.
> -
>
> Key: SPARK-37903
> URL: https://issues.apache.org/jira/browse/SPARK-37903
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently we have a hacky way to resolve type hints written as strings, but 
> we can use {{get_type_hints}} instead.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38764) spark thrift server issue: Length field is empty for varchar fields

2022-04-01 Thread Ayan Ray (Jira)
Ayan Ray created SPARK-38764:


 Summary: spark thrift server issue: Length field is empty for 
varchar fields
 Key: SPARK-38764
 URL: https://issues.apache.org/jira/browse/SPARK-38764
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1
Reporter: Ayan Ray


I am trying to read Data from Spark Thrift Server using SAS. In the table 
definition through DBeaver, I am seeing that *Length* field is empty only for 
fields with *VARCHAR* data type. I can see the length in the Data Type field as 
{*}varchar(32){*}. But that doesn't suffice my purpose as the SAS application 
taps into the Length field. Since, this field is not populated now, SAS is 
defaulting to the max size and as a result its becoming extremely slow. I get 
the length field populated in Hive.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38765) Implement `inplace` parameter of `Series.clip`

2022-04-01 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-38765:


 Summary: Implement `inplace` parameter of `Series.clip`
 Key: SPARK-38765
 URL: https://issues.apache.org/jira/browse/SPARK-38765
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Implement `inplace` parameter of `Series.clip`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38765) Implement `inplace` parameter of `Series.clip`

2022-04-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516074#comment-17516074
 ] 

Apache Spark commented on SPARK-38765:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/36041

> Implement `inplace` parameter of `Series.clip`
> --
>
> Key: SPARK-38765
> URL: https://issues.apache.org/jira/browse/SPARK-38765
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Implement `inplace` parameter of `Series.clip`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38765) Implement `inplace` parameter of `Series.clip`

2022-04-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38765:


Assignee: (was: Apache Spark)

> Implement `inplace` parameter of `Series.clip`
> --
>
> Key: SPARK-38765
> URL: https://issues.apache.org/jira/browse/SPARK-38765
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Implement `inplace` parameter of `Series.clip`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38765) Implement `inplace` parameter of `Series.clip`

2022-04-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38765:


Assignee: Apache Spark

> Implement `inplace` parameter of `Series.clip`
> --
>
> Key: SPARK-38765
> URL: https://issues.apache.org/jira/browse/SPARK-38765
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Implement `inplace` parameter of `Series.clip`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38765) Implement `inplace` parameter of `Series.clip`

2022-04-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516075#comment-17516075
 ] 

Apache Spark commented on SPARK-38765:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/36041

> Implement `inplace` parameter of `Series.clip`
> --
>
> Key: SPARK-38765
> URL: https://issues.apache.org/jira/browse/SPARK-38765
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Implement `inplace` parameter of `Series.clip`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38763) Pandas API on spark Can`t apply lamda to columns.

2022-04-01 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-38763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516087#comment-17516087
 ] 

Bjørn Jørgensen commented on SPARK-38763:
-

This error is https://github.com/apache/spark/pull/35236

in typehints.py line 562 

{code:java}
if tpe is None:
raise ValueError("A return value is required for the input function")
{code}


> Pandas API on spark Can`t apply lamda to columns.  
> ---
>
> Key: SPARK-38763
> URL: https://issues.apache.org/jira/browse/SPARK-38763
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> When I use a spark master build from 08 November 21 I can use this code to 
> rename columns 
> {code:java}
> pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> {code}
> But now after I get this error when I use this code.
> ---
> ValueErrorTraceback (most recent call last)
> Input In [5], in ()
> > 1 pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', 
> x))
>   2 pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
>   3 pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> File /opt/spark/python/pyspark/pandas/frame.py:10636, in 
> DataFrame.rename(self, mapper, index, columns, axis, inplace, level, errors)
>   10632 index_mapper_fn, index_mapper_ret_dtype, index_mapper_ret_stype = 
> gen_mapper_fn(
>   10633 index
>   10634 )
>   10635 if columns:
> > 10636 columns_mapper_fn, _, _ = gen_mapper_fn(columns)
>   10638 if not index and not columns:
>   10639 raise ValueError("Either `index` or `columns` should be 
> provided.")
> File /opt/spark/python/pyspark/pandas/frame.py:10603, in 
> DataFrame.rename..gen_mapper_fn(mapper)
>   10601 elif callable(mapper):
>   10602 mapper_callable = cast(Callable, mapper)
> > 10603 return_type = cast(ScalarType, infer_return_type(mapper))
>   10604 dtype = return_type.dtype
>   10605 spark_return_type = return_type.spark_type
> File /opt/spark/python/pyspark/pandas/typedef/typehints.py:563, in 
> infer_return_type(f)
> 560 tpe = get_type_hints(f).get("return", None)
> 562 if tpe is None:
> --> 563 raise ValueError("A return value is required for the input 
> function")
> 565 if hasattr(tpe, "__origin__") and issubclass(tpe.__origin__, 
> SeriesType):
> 566 tpe = tpe.__args__[0]
> ValueError: A return value is required for the input function



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38763) Pandas API on spark Can`t apply lamda to columns.

2022-04-01 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516172#comment-17516172
 ] 

Xinrong Meng commented on SPARK-38763:
--

Hi [~bjornjorgensen], thanks for raising that!

The workaround is to use a function with a return type rather than a lambda.

I am fixing this in https://issues.apache.org/jira/browse/SPARK-38766.



> Pandas API on spark Can`t apply lamda to columns.  
> ---
>
> Key: SPARK-38763
> URL: https://issues.apache.org/jira/browse/SPARK-38763
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> When I use a spark master build from 08 November 21 I can use this code to 
> rename columns 
> {code:java}
> pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> {code}
> But now after I get this error when I use this code.
> ---
> ValueErrorTraceback (most recent call last)
> Input In [5], in ()
> > 1 pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', 
> x))
>   2 pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
>   3 pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> File /opt/spark/python/pyspark/pandas/frame.py:10636, in 
> DataFrame.rename(self, mapper, index, columns, axis, inplace, level, errors)
>   10632 index_mapper_fn, index_mapper_ret_dtype, index_mapper_ret_stype = 
> gen_mapper_fn(
>   10633 index
>   10634 )
>   10635 if columns:
> > 10636 columns_mapper_fn, _, _ = gen_mapper_fn(columns)
>   10638 if not index and not columns:
>   10639 raise ValueError("Either `index` or `columns` should be 
> provided.")
> File /opt/spark/python/pyspark/pandas/frame.py:10603, in 
> DataFrame.rename..gen_mapper_fn(mapper)
>   10601 elif callable(mapper):
>   10602 mapper_callable = cast(Callable, mapper)
> > 10603 return_type = cast(ScalarType, infer_return_type(mapper))
>   10604 dtype = return_type.dtype
>   10605 spark_return_type = return_type.spark_type
> File /opt/spark/python/pyspark/pandas/typedef/typehints.py:563, in 
> infer_return_type(f)
> 560 tpe = get_type_hints(f).get("return", None)
> 562 if tpe is None:
> --> 563 raise ValueError("A return value is required for the input 
> function")
> 565 if hasattr(tpe, "__origin__") and issubclass(tpe.__origin__, 
> SeriesType):
> 566 tpe = tpe.__args__[0]
> ValueError: A return value is required for the input function



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38766) Support lambda `column` parameter of `DataFrame.rename`

2022-04-01 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516173#comment-17516173
 ] 

Xinrong Meng commented on SPARK-38766:
--

I am working on that.

> Support lambda `column` parameter of `DataFrame.rename`
> ---
>
> Key: SPARK-38766
> URL: https://issues.apache.org/jira/browse/SPARK-38766
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Support lambda `column` parameter of `DataFrame.rename`.
> The issue was detected in https://issues.apache.org/jira/browse/SPARK-38763.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38766) Support lambda `column` parameter of `DataFrame.rename`

2022-04-01 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-38766:


 Summary: Support lambda `column` parameter of `DataFrame.rename`
 Key: SPARK-38766
 URL: https://issues.apache.org/jira/browse/SPARK-38766
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Support lambda `column` parameter of `DataFrame.rename`.

The issue was detected in https://issues.apache.org/jira/browse/SPARK-38763.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38766) Support lambda `column` parameter of `DataFrame.rename`

2022-04-01 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-38766.
--
Resolution: Duplicate

> Support lambda `column` parameter of `DataFrame.rename`
> ---
>
> Key: SPARK-38766
> URL: https://issues.apache.org/jira/browse/SPARK-38766
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Support lambda `column` parameter of `DataFrame.rename`.
> The issue was detected in https://issues.apache.org/jira/browse/SPARK-38763.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-38763) Pandas API on spark Can`t apply lamda to columns.

2022-04-01 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516172#comment-17516172
 ] 

Xinrong Meng edited comment on SPARK-38763 at 4/1/22 11:56 PM:
---

Hi [~bjornjorgensen], thanks for raising that!

The workaround is to use a function with a return type rather than a lambda.

I am fixing this now.




was (Author: xinrongm):
Hi [~bjornjorgensen], thanks for raising that!

The workaround is to use a function with a return type rather than a lambda.

I am fixing this in https://issues.apache.org/jira/browse/SPARK-38766.



> Pandas API on spark Can`t apply lamda to columns.  
> ---
>
> Key: SPARK-38763
> URL: https://issues.apache.org/jira/browse/SPARK-38763
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> When I use a spark master build from 08 November 21 I can use this code to 
> rename columns 
> {code:java}
> pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> {code}
> But now after I get this error when I use this code.
> ---
> ValueErrorTraceback (most recent call last)
> Input In [5], in ()
> > 1 pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', 
> x))
>   2 pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
>   3 pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> File /opt/spark/python/pyspark/pandas/frame.py:10636, in 
> DataFrame.rename(self, mapper, index, columns, axis, inplace, level, errors)
>   10632 index_mapper_fn, index_mapper_ret_dtype, index_mapper_ret_stype = 
> gen_mapper_fn(
>   10633 index
>   10634 )
>   10635 if columns:
> > 10636 columns_mapper_fn, _, _ = gen_mapper_fn(columns)
>   10638 if not index and not columns:
>   10639 raise ValueError("Either `index` or `columns` should be 
> provided.")
> File /opt/spark/python/pyspark/pandas/frame.py:10603, in 
> DataFrame.rename..gen_mapper_fn(mapper)
>   10601 elif callable(mapper):
>   10602 mapper_callable = cast(Callable, mapper)
> > 10603 return_type = cast(ScalarType, infer_return_type(mapper))
>   10604 dtype = return_type.dtype
>   10605 spark_return_type = return_type.spark_type
> File /opt/spark/python/pyspark/pandas/typedef/typehints.py:563, in 
> infer_return_type(f)
> 560 tpe = get_type_hints(f).get("return", None)
> 562 if tpe is None:
> --> 563 raise ValueError("A return value is required for the input 
> function")
> 565 if hasattr(tpe, "__origin__") and issubclass(tpe.__origin__, 
> SeriesType):
> 566 tpe = tpe.__args__[0]
> ValueError: A return value is required for the input function



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38763) Pandas API on spark Can`t apply lamda to columns.

2022-04-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516175#comment-17516175
 ] 

Apache Spark commented on SPARK-38763:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/36042

> Pandas API on spark Can`t apply lamda to columns.  
> ---
>
> Key: SPARK-38763
> URL: https://issues.apache.org/jira/browse/SPARK-38763
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> When I use a spark master build from 08 November 21 I can use this code to 
> rename columns 
> {code:java}
> pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> {code}
> But now after I get this error when I use this code.
> ---
> ValueErrorTraceback (most recent call last)
> Input In [5], in ()
> > 1 pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', 
> x))
>   2 pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
>   3 pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> File /opt/spark/python/pyspark/pandas/frame.py:10636, in 
> DataFrame.rename(self, mapper, index, columns, axis, inplace, level, errors)
>   10632 index_mapper_fn, index_mapper_ret_dtype, index_mapper_ret_stype = 
> gen_mapper_fn(
>   10633 index
>   10634 )
>   10635 if columns:
> > 10636 columns_mapper_fn, _, _ = gen_mapper_fn(columns)
>   10638 if not index and not columns:
>   10639 raise ValueError("Either `index` or `columns` should be 
> provided.")
> File /opt/spark/python/pyspark/pandas/frame.py:10603, in 
> DataFrame.rename..gen_mapper_fn(mapper)
>   10601 elif callable(mapper):
>   10602 mapper_callable = cast(Callable, mapper)
> > 10603 return_type = cast(ScalarType, infer_return_type(mapper))
>   10604 dtype = return_type.dtype
>   10605 spark_return_type = return_type.spark_type
> File /opt/spark/python/pyspark/pandas/typedef/typehints.py:563, in 
> infer_return_type(f)
> 560 tpe = get_type_hints(f).get("return", None)
> 562 if tpe is None:
> --> 563 raise ValueError("A return value is required for the input 
> function")
> 565 if hasattr(tpe, "__origin__") and issubclass(tpe.__origin__, 
> SeriesType):
> 566 tpe = tpe.__args__[0]
> ValueError: A return value is required for the input function



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38763) Pandas API on spark Can`t apply lamda to columns.

2022-04-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38763:


Assignee: Apache Spark

> Pandas API on spark Can`t apply lamda to columns.  
> ---
>
> Key: SPARK-38763
> URL: https://issues.apache.org/jira/browse/SPARK-38763
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Apache Spark
>Priority: Major
>
> When I use a spark master build from 08 November 21 I can use this code to 
> rename columns 
> {code:java}
> pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> {code}
> But now after I get this error when I use this code.
> ---
> ValueErrorTraceback (most recent call last)
> Input In [5], in ()
> > 1 pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', 
> x))
>   2 pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
>   3 pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> File /opt/spark/python/pyspark/pandas/frame.py:10636, in 
> DataFrame.rename(self, mapper, index, columns, axis, inplace, level, errors)
>   10632 index_mapper_fn, index_mapper_ret_dtype, index_mapper_ret_stype = 
> gen_mapper_fn(
>   10633 index
>   10634 )
>   10635 if columns:
> > 10636 columns_mapper_fn, _, _ = gen_mapper_fn(columns)
>   10638 if not index and not columns:
>   10639 raise ValueError("Either `index` or `columns` should be 
> provided.")
> File /opt/spark/python/pyspark/pandas/frame.py:10603, in 
> DataFrame.rename..gen_mapper_fn(mapper)
>   10601 elif callable(mapper):
>   10602 mapper_callable = cast(Callable, mapper)
> > 10603 return_type = cast(ScalarType, infer_return_type(mapper))
>   10604 dtype = return_type.dtype
>   10605 spark_return_type = return_type.spark_type
> File /opt/spark/python/pyspark/pandas/typedef/typehints.py:563, in 
> infer_return_type(f)
> 560 tpe = get_type_hints(f).get("return", None)
> 562 if tpe is None:
> --> 563 raise ValueError("A return value is required for the input 
> function")
> 565 if hasattr(tpe, "__origin__") and issubclass(tpe.__origin__, 
> SeriesType):
> 566 tpe = tpe.__args__[0]
> ValueError: A return value is required for the input function



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38763) Pandas API on spark Can`t apply lamda to columns.

2022-04-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38763:


Assignee: (was: Apache Spark)

> Pandas API on spark Can`t apply lamda to columns.  
> ---
>
> Key: SPARK-38763
> URL: https://issues.apache.org/jira/browse/SPARK-38763
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> When I use a spark master build from 08 November 21 I can use this code to 
> rename columns 
> {code:java}
> pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> {code}
> But now after I get this error when I use this code.
> ---
> ValueErrorTraceback (most recent call last)
> Input In [5], in ()
> > 1 pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', 
> x))
>   2 pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
>   3 pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> File /opt/spark/python/pyspark/pandas/frame.py:10636, in 
> DataFrame.rename(self, mapper, index, columns, axis, inplace, level, errors)
>   10632 index_mapper_fn, index_mapper_ret_dtype, index_mapper_ret_stype = 
> gen_mapper_fn(
>   10633 index
>   10634 )
>   10635 if columns:
> > 10636 columns_mapper_fn, _, _ = gen_mapper_fn(columns)
>   10638 if not index and not columns:
>   10639 raise ValueError("Either `index` or `columns` should be 
> provided.")
> File /opt/spark/python/pyspark/pandas/frame.py:10603, in 
> DataFrame.rename..gen_mapper_fn(mapper)
>   10601 elif callable(mapper):
>   10602 mapper_callable = cast(Callable, mapper)
> > 10603 return_type = cast(ScalarType, infer_return_type(mapper))
>   10604 dtype = return_type.dtype
>   10605 spark_return_type = return_type.spark_type
> File /opt/spark/python/pyspark/pandas/typedef/typehints.py:563, in 
> infer_return_type(f)
> 560 tpe = get_type_hints(f).get("return", None)
> 562 if tpe is None:
> --> 563 raise ValueError("A return value is required for the input 
> function")
> 565 if hasattr(tpe, "__origin__") and issubclass(tpe.__origin__, 
> SeriesType):
> 566 tpe = tpe.__args__[0]
> ValueError: A return value is required for the input function



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38763) Pandas API on spark Can`t apply lamda to columns.

2022-04-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516176#comment-17516176
 ] 

Apache Spark commented on SPARK-38763:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/36042

> Pandas API on spark Can`t apply lamda to columns.  
> ---
>
> Key: SPARK-38763
> URL: https://issues.apache.org/jira/browse/SPARK-38763
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> When I use a spark master build from 08 November 21 I can use this code to 
> rename columns 
> {code:java}
> pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> {code}
> But now after I get this error when I use this code.
> ---
> ValueErrorTraceback (most recent call last)
> Input In [5], in ()
> > 1 pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', 
> x))
>   2 pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
>   3 pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> File /opt/spark/python/pyspark/pandas/frame.py:10636, in 
> DataFrame.rename(self, mapper, index, columns, axis, inplace, level, errors)
>   10632 index_mapper_fn, index_mapper_ret_dtype, index_mapper_ret_stype = 
> gen_mapper_fn(
>   10633 index
>   10634 )
>   10635 if columns:
> > 10636 columns_mapper_fn, _, _ = gen_mapper_fn(columns)
>   10638 if not index and not columns:
>   10639 raise ValueError("Either `index` or `columns` should be 
> provided.")
> File /opt/spark/python/pyspark/pandas/frame.py:10603, in 
> DataFrame.rename..gen_mapper_fn(mapper)
>   10601 elif callable(mapper):
>   10602 mapper_callable = cast(Callable, mapper)
> > 10603 return_type = cast(ScalarType, infer_return_type(mapper))
>   10604 dtype = return_type.dtype
>   10605 spark_return_type = return_type.spark_type
> File /opt/spark/python/pyspark/pandas/typedef/typehints.py:563, in 
> infer_return_type(f)
> 560 tpe = get_type_hints(f).get("return", None)
> 562 if tpe is None:
> --> 563 raise ValueError("A return value is required for the input 
> function")
> 565 if hasattr(tpe, "__origin__") and issubclass(tpe.__origin__, 
> SeriesType):
> 566 tpe = tpe.__args__[0]
> ValueError: A return value is required for the input function



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38763) Pandas API on spark Can`t apply lamda to columns.

2022-04-01 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516177#comment-17516177
 ] 

Xinrong Meng commented on SPARK-38763:
--

I will backport the fix after approved and merged.

> Pandas API on spark Can`t apply lamda to columns.  
> ---
>
> Key: SPARK-38763
> URL: https://issues.apache.org/jira/browse/SPARK-38763
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> When I use a spark master build from 08 November 21 I can use this code to 
> rename columns 
> {code:java}
> pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
> pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> {code}
> But now after I get this error when I use this code.
> ---
> ValueErrorTraceback (most recent call last)
> Input In [5], in ()
> > 1 pf05 = pf05.rename(columns=lambda x: re.sub('DOFFIN_ESENDERS:', '', 
> x))
>   2 pf05 = pf05.rename(columns=lambda x: re.sub('FORM_SECTION:', '', x))
>   3 pf05 = pf05.rename(columns=lambda x: re.sub('F05_2014:', '', x))
> File /opt/spark/python/pyspark/pandas/frame.py:10636, in 
> DataFrame.rename(self, mapper, index, columns, axis, inplace, level, errors)
>   10632 index_mapper_fn, index_mapper_ret_dtype, index_mapper_ret_stype = 
> gen_mapper_fn(
>   10633 index
>   10634 )
>   10635 if columns:
> > 10636 columns_mapper_fn, _, _ = gen_mapper_fn(columns)
>   10638 if not index and not columns:
>   10639 raise ValueError("Either `index` or `columns` should be 
> provided.")
> File /opt/spark/python/pyspark/pandas/frame.py:10603, in 
> DataFrame.rename..gen_mapper_fn(mapper)
>   10601 elif callable(mapper):
>   10602 mapper_callable = cast(Callable, mapper)
> > 10603 return_type = cast(ScalarType, infer_return_type(mapper))
>   10604 dtype = return_type.dtype
>   10605 spark_return_type = return_type.spark_type
> File /opt/spark/python/pyspark/pandas/typedef/typehints.py:563, in 
> infer_return_type(f)
> 560 tpe = get_type_hints(f).get("return", None)
> 562 if tpe is None:
> --> 563 raise ValueError("A return value is required for the input 
> function")
> 565 if hasattr(tpe, "__origin__") and issubclass(tpe.__origin__, 
> SeriesType):
> 566 tpe = tpe.__args__[0]
> ValueError: A return value is required for the input function



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38767) Support ignoreCorruptFiles and ignoreMissingFiles in Data Source options

2022-04-01 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-38767:
---

 Summary: Support ignoreCorruptFiles and ignoreMissingFiles in Data 
Source options
 Key: SPARK-38767
 URL: https://issues.apache.org/jira/browse/SPARK-38767
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.1
Reporter: Yaohua Zhao






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38620) Replace `value.formatted(formatString)` with `formatString.format(value)` to clean up compilation warning

2022-04-01 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-38620.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 35930
[https://github.com/apache/spark/pull/35930]

> Replace `value.formatted(formatString)` with `formatString.format(value)` to 
> clean up compilation warning
> -
>
> Key: SPARK-38620
> URL: https://issues.apache.org/jira/browse/SPARK-38620
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> There are compile warnings as follows:
> {code:java}
> [WARNING] 
> /spark-source/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingPage.scala:67:
>  [deprecation @ 
> org.apache.spark.streaming.ui.RecordRateUIData.formattedAvg.$anonfun | 
> origin=scala.Predef.StringFormat.formatted | version=2.12.16] method 
> formatted in class StringFormat is deprecated (since 2.12.16): Use 
> `formatString.format(value)` instead of `value.formatted(formatString)`,
> or use the `f""` string interpolator. In Java 15 and later, `formatted` 
> resolves to the new method in String which has reversed parameters.
> [WARNING] 
> /spark-source/sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryPage.scala:201:
>  [deprecation @ 
> org.apache.spark.sql.streaming.ui.StreamingQueryPagedTable.row | 
> origin=scala.Predef.StringFormat.formatted | version=2.12.16] method 
> formatted in class StringFormat is deprecated (since 2.12.16): Use 
> `formatString.format(value)` instead of `value.formatted(formatString)`,
> or use the `f""` string interpolator. In Java 15 and later, `formatted` 
> resolves to the new method in String which has reversed parameters.
> [WARNING] 
> /spark-source/sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryPage.scala:202:
>  [deprecation @ 
> org.apache.spark.sql.streaming.ui.StreamingQueryPagedTable.row | 
> origin=scala.Predef.StringFormat.formatted | version=2.12.16] method 
> formatted in class StringFormat is deprecated (since 2.12.16): Use 
> `formatString.format(value)` instead of `value.formatted(formatString)`,
> or use the `f""` string interpolator. In Java 15 and later, `formatted` 
> resolves to the new method in String which has reversed parameters. {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38620) Replace `value.formatted(formatString)` with `formatString.format(value)` to clean up compilation warning

2022-04-01 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-38620:
-
Priority: Trivial  (was: Minor)

> Replace `value.formatted(formatString)` with `formatString.format(value)` to 
> clean up compilation warning
> -
>
> Key: SPARK-38620
> URL: https://issues.apache.org/jira/browse/SPARK-38620
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Trivial
> Fix For: 3.4.0
>
>
> There are compile warnings as follows:
> {code:java}
> [WARNING] 
> /spark-source/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingPage.scala:67:
>  [deprecation @ 
> org.apache.spark.streaming.ui.RecordRateUIData.formattedAvg.$anonfun | 
> origin=scala.Predef.StringFormat.formatted | version=2.12.16] method 
> formatted in class StringFormat is deprecated (since 2.12.16): Use 
> `formatString.format(value)` instead of `value.formatted(formatString)`,
> or use the `f""` string interpolator. In Java 15 and later, `formatted` 
> resolves to the new method in String which has reversed parameters.
> [WARNING] 
> /spark-source/sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryPage.scala:201:
>  [deprecation @ 
> org.apache.spark.sql.streaming.ui.StreamingQueryPagedTable.row | 
> origin=scala.Predef.StringFormat.formatted | version=2.12.16] method 
> formatted in class StringFormat is deprecated (since 2.12.16): Use 
> `formatString.format(value)` instead of `value.formatted(formatString)`,
> or use the `f""` string interpolator. In Java 15 and later, `formatted` 
> resolves to the new method in String which has reversed parameters.
> [WARNING] 
> /spark-source/sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryPage.scala:202:
>  [deprecation @ 
> org.apache.spark.sql.streaming.ui.StreamingQueryPagedTable.row | 
> origin=scala.Predef.StringFormat.formatted | version=2.12.16] method 
> formatted in class StringFormat is deprecated (since 2.12.16): Use 
> `formatString.format(value)` instead of `value.formatted(formatString)`,
> or use the `f""` string interpolator. In Java 15 and later, `formatted` 
> resolves to the new method in String which has reversed parameters. {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38620) Replace `value.formatted(formatString)` with `formatString.format(value)` to clean up compilation warning

2022-04-01 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-38620:


Assignee: Yang Jie

> Replace `value.formatted(formatString)` with `formatString.format(value)` to 
> clean up compilation warning
> -
>
> Key: SPARK-38620
> URL: https://issues.apache.org/jira/browse/SPARK-38620
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> There are compile warnings as follows:
> {code:java}
> [WARNING] 
> /spark-source/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingPage.scala:67:
>  [deprecation @ 
> org.apache.spark.streaming.ui.RecordRateUIData.formattedAvg.$anonfun | 
> origin=scala.Predef.StringFormat.formatted | version=2.12.16] method 
> formatted in class StringFormat is deprecated (since 2.12.16): Use 
> `formatString.format(value)` instead of `value.formatted(formatString)`,
> or use the `f""` string interpolator. In Java 15 and later, `formatted` 
> resolves to the new method in String which has reversed parameters.
> [WARNING] 
> /spark-source/sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryPage.scala:201:
>  [deprecation @ 
> org.apache.spark.sql.streaming.ui.StreamingQueryPagedTable.row | 
> origin=scala.Predef.StringFormat.formatted | version=2.12.16] method 
> formatted in class StringFormat is deprecated (since 2.12.16): Use 
> `formatString.format(value)` instead of `value.formatted(formatString)`,
> or use the `f""` string interpolator. In Java 15 and later, `formatted` 
> resolves to the new method in String which has reversed parameters.
> [WARNING] 
> /spark-source/sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryPage.scala:202:
>  [deprecation @ 
> org.apache.spark.sql.streaming.ui.StreamingQueryPagedTable.row | 
> origin=scala.Predef.StringFormat.formatted | version=2.12.16] method 
> formatted in class StringFormat is deprecated (since 2.12.16): Use 
> `formatString.format(value)` instead of `value.formatted(formatString)`,
> or use the `f""` string interpolator. In Java 15 and later, `formatted` 
> resolves to the new method in String which has reversed parameters. {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34863) Support nested column in Spark Parquet vectorized readers

2022-04-01 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-34863.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34659
[https://github.com/apache/spark/pull/34659]

> Support nested column in Spark Parquet vectorized readers
> -
>
> Key: SPARK-34863
> URL: https://issues.apache.org/jira/browse/SPARK-34863
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.3.0
>
>
> The task is to support nested column type in Spark Parquet vectorized reader. 
> Currently Parquet vectorized reader does not support nested column type 
> (struct, array and map). We implemented nested column vectorized reader for 
> FB-ORC in our internal fork of Spark. We are seeing performance improvement 
> compared to non-vectorized reader when reading nested columns. In addition, 
> this can also help improve the non-nested column performance when reading 
> non-nested and nested columns together in one query.
>  
> Parquet: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L173]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38768) If limit could pushed down and Data source only have one partition, DS V2 should not do limit again

2022-04-01 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-38768:
--

 Summary: If limit could pushed down and Data source only have one 
partition, DS V2 should not do limit again
 Key: SPARK-38768
 URL: https://issues.apache.org/jira/browse/SPARK-38768
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: jiaan.geng






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38769) [SQL] behavior of schema_of_json not same with 2.4.0

2022-04-01 Thread gabrywu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gabrywu updated SPARK-38769:

Summary: [SQL] behavior of schema_of_json not same with 2.4.0  (was: [SQL] 
behavior schema_of_json not same with 2.4.0)

> [SQL] behavior of schema_of_json not same with 2.4.0
> 
>
> Key: SPARK-38769
> URL: https://issues.apache.org/jira/browse/SPARK-38769
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: gabrywu
>Priority: Minor
>
> When I switch to spark 3.1.1 from spark 2.4.0, I found a built-in function 
> throw errors:
> |== Physical Plan == org.apache.spark.sql.AnalysisException: cannot resolve 
> 'schema_of_json(get_json_object(`adtnl_info_txt`, '$.all_model_scores'))' due 
> to data type mismatch: The input json should be a foldable string expression 
> and not null; however, got get_json_object(`adtnl_info_txt`, 
> '$.all_model_scores').; line 3 pos 2; |
> But schema_of_json worked well in 2.4.0, So, is it a bug, or a new feature, 
> which doesn't support non-Literal expressions?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38769) [SQL] behavior schema_of_json not same with 2.4.0

2022-04-01 Thread gabrywu (Jira)
gabrywu created SPARK-38769:
---

 Summary: [SQL] behavior schema_of_json not same with 2.4.0
 Key: SPARK-38769
 URL: https://issues.apache.org/jira/browse/SPARK-38769
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1
Reporter: gabrywu


When I switch to spark 3.1.1 from spark 2.4.0, I found a built-in function 
throw errors:
|== Physical Plan == org.apache.spark.sql.AnalysisException: cannot resolve 
'schema_of_json(get_json_object(`adtnl_info_txt`, '$.all_model_scores'))' due 
to data type mismatch: The input json should be a foldable string expression 
and not null; however, got get_json_object(`adtnl_info_txt`, 
'$.all_model_scores').; line 3 pos 2; |

But schema_of_json worked well in 2.4.0, So, is it a bug, or a new feature, 
which doesn't support non-Literal expressions?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38768) If limit could pushed down and Data source only have one partition, DS V2 should not do limit again

2022-04-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38768:


Assignee: Apache Spark

> If limit could pushed down and Data source only have one partition, DS V2 
> should not do limit again
> ---
>
> Key: SPARK-38768
> URL: https://issues.apache.org/jira/browse/SPARK-38768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38768) If limit could pushed down and Data source only have one partition, DS V2 should not do limit again

2022-04-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516210#comment-17516210
 ] 

Apache Spark commented on SPARK-38768:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/36043

> If limit could pushed down and Data source only have one partition, DS V2 
> should not do limit again
> ---
>
> Key: SPARK-38768
> URL: https://issues.apache.org/jira/browse/SPARK-38768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38768) If limit could pushed down and Data source only have one partition, DS V2 should not do limit again

2022-04-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38768:


Assignee: Apache Spark

> If limit could pushed down and Data source only have one partition, DS V2 
> should not do limit again
> ---
>
> Key: SPARK-38768
> URL: https://issues.apache.org/jira/browse/SPARK-38768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38768) If limit could pushed down and Data source only have one partition, DS V2 should not do limit again

2022-04-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516209#comment-17516209
 ] 

Apache Spark commented on SPARK-38768:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/36043

> If limit could pushed down and Data source only have one partition, DS V2 
> should not do limit again
> ---
>
> Key: SPARK-38768
> URL: https://issues.apache.org/jira/browse/SPARK-38768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38768) If limit could pushed down and Data source only have one partition, DS V2 should not do limit again

2022-04-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38768:


Assignee: (was: Apache Spark)

> If limit could pushed down and Data source only have one partition, DS V2 
> should not do limit again
> ---
>
> Key: SPARK-38768
> URL: https://issues.apache.org/jira/browse/SPARK-38768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38769) [SQL] behavior of schema_of_json not same with 2.4.0

2022-04-01 Thread gabrywu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516219#comment-17516219
 ] 

gabrywu commented on SPARK-38769:
-

[~maxgekk] 

> [SQL] behavior of schema_of_json not same with 2.4.0
> 
>
> Key: SPARK-38769
> URL: https://issues.apache.org/jira/browse/SPARK-38769
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: gabrywu
>Priority: Minor
>
> When I switch to spark 3.1.1 from spark 2.4.0, I found a built-in function 
> throw errors:
> |== Physical Plan == org.apache.spark.sql.AnalysisException: cannot resolve 
> 'schema_of_json(get_json_object(`adtnl_info_txt`, '$.all_model_scores'))' due 
> to data type mismatch: The input json should be a foldable string expression 
> and not null; however, got get_json_object(`adtnl_info_txt`, 
> '$.all_model_scores').; line 3 pos 2; |
> But schema_of_json worked well in 2.4.0, So, is it a bug, or a new feature, 
> which doesn't support non-Literal expressions?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38770) Simply steps to re write primary resource in k8s spark application

2022-04-01 Thread qian (Jira)
qian created SPARK-38770:


 Summary: Simply steps to re write primary resource in k8s spark 
application
 Key: SPARK-38770
 URL: https://issues.apache.org/jira/browse/SPARK-38770
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.2.1, 3.2.0, 3.1.2, 3.1.1, 3.1.0
Reporter: qian
 Fix For: 3.3.0


re-write primary resource actions use renameMainAppResource method twice and 
second usage has no effect. So, Simply steps to re write primary resource in 
k8s spark application



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38770) Simply steps to re write primary resource in k8s spark application

2022-04-01 Thread qian (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qian updated SPARK-38770:
-
Fix Version/s: 3.4.0
   (was: 3.3.0)

> Simply steps to re write primary resource in k8s spark application
> --
>
> Key: SPARK-38770
> URL: https://issues.apache.org/jira/browse/SPARK-38770
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.2.1
>Reporter: qian
>Priority: Major
> Fix For: 3.4.0
>
>
> re-write primary resource actions use renameMainAppResource method twice and 
> second usage has no effect. So, Simply steps to re write primary resource in 
> k8s spark application



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38770) Simply steps to re write primary resource in k8s spark application

2022-04-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516228#comment-17516228
 ] 

Apache Spark commented on SPARK-38770:
--

User 'dcoliversun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36044

> Simply steps to re write primary resource in k8s spark application
> --
>
> Key: SPARK-38770
> URL: https://issues.apache.org/jira/browse/SPARK-38770
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.2.1
>Reporter: qian
>Priority: Major
> Fix For: 3.4.0
>
>
> re-write primary resource actions use renameMainAppResource method twice and 
> second usage has no effect. So, Simply steps to re write primary resource in 
> k8s spark application



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38770) Simply steps to re write primary resource in k8s spark application

2022-04-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38770:


Assignee: Apache Spark

> Simply steps to re write primary resource in k8s spark application
> --
>
> Key: SPARK-38770
> URL: https://issues.apache.org/jira/browse/SPARK-38770
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.2.1
>Reporter: qian
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.4.0
>
>
> re-write primary resource actions use renameMainAppResource method twice and 
> second usage has no effect. So, Simply steps to re write primary resource in 
> k8s spark application



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38770) Simply steps to re write primary resource in k8s spark application

2022-04-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38770:


Assignee: (was: Apache Spark)

> Simply steps to re write primary resource in k8s spark application
> --
>
> Key: SPARK-38770
> URL: https://issues.apache.org/jira/browse/SPARK-38770
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.2.1
>Reporter: qian
>Priority: Major
> Fix For: 3.4.0
>
>
> re-write primary resource actions use renameMainAppResource method twice and 
> second usage has no effect. So, Simply steps to re write primary resource in 
> k8s spark application



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org