date:20230830

[jira] [Created] (SPARK-45027) Hide internal functions/variables in `pyspark.sql.functions` from auto-completion

2023-08-30 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-45027:
-

 Summary: Hide internal functions/variables in 
`pyspark.sql.functions` from auto-completion
 Key: SPARK-45027
 URL: https://issues.apache.org/jira/browse/SPARK-45027
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45024) Filter out some configs in Session Creation

2023-08-30 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-45024:
-

Assignee: Ruifeng Zheng

> Filter out some configs in Session Creation
> ---
>
> Key: SPARK-45024
> URL: https://issues.apache.org/jira/browse/SPARK-45024
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45024) Filter out some configs in Session Creation

2023-08-30 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-45024.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42741
[https://github.com/apache/spark/pull/42741]

> Filter out some configs in Session Creation
> ---
>
> Key: SPARK-45024
> URL: https://issues.apache.org/jira/browse/SPARK-45024
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45024) Filter out some configs in Session Creation

2023-08-30 Thread Snoot.io (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760710#comment-17760710
 ] 

Snoot.io commented on SPARK-45024:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/42741

> Filter out some configs in Session Creation
> ---
>
> Key: SPARK-45024
> URL: https://issues.apache.org/jira/browse/SPARK-45024
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45024) Filter out some configs in Session Creation

2023-08-30 Thread Snoot.io (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760709#comment-17760709
 ] 

Snoot.io commented on SPARK-45024:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/42741

> Filter out some configs in Session Creation
> ---
>
> Key: SPARK-45024
> URL: https://issues.apache.org/jira/browse/SPARK-45024
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44940) Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

2023-08-30 Thread Snoot.io (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760708#comment-17760708
 ] 

Snoot.io commented on SPARK-44940:
--

User 'sadikovi' has created a pull request for this issue:
https://github.com/apache/spark/pull/42667

> Improve performance of JSON parsing when 
> "spark.sql.json.enablePartialResults" is enabled
> -
>
> Key: SPARK-44940
> URL: https://issues.apache.org/jira/browse/SPARK-44940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0, 4.0.0
>Reporter: Ivan Sadikov
>Priority: Major
>
> Follow-up on https://issues.apache.org/jira/browse/SPARK-40646.
> I found that JSON parsing is significantly slower due to exception creation 
> in control flow. Also, some fields are not parsed correctly and the exception 
> is thrown in certain cases: 
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct(rows.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct$(rows.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getStruct(rows.scala:195)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:590)
>   ... 39 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45018) Add CalendarIntervalType to Python Client

2023-08-30 Thread Snoot.io (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760704#comment-17760704
 ] 

Snoot.io commented on SPARK-45018:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/42743

> Add CalendarIntervalType to Python Client
> -
>
> Key: SPARK-45018
> URL: https://issues.apache.org/jira/browse/SPARK-45018
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing

2023-08-30 Thread Varun Nalla (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760700#comment-17760700
 ] 

Varun Nalla commented on SPARK-44900:
-

[~yxzhang] / [~yao] any update for us ?

> Cached DataFrame keeps growing
> --
>
> Key: SPARK-44900
> URL: https://issues.apache.org/jira/browse/SPARK-44900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Varun Nalla
>Priority: Blocker
>
> Scenario :
> We have a kafka streaming application where the data lookups are happening by 
> joining  another DF which is cached, and the caching strategy is 
> MEMORY_AND_DISK.
> However the size of the cached DataFrame keeps on growing for every micro 
> batch the streaming application process and that's being visible under 
> storage tab.
> A similar stack overflow thread was already raised.
> https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45026) non-command spark.sql should support datatypes not compatible with arrow

2023-08-30 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-45026:
-

 Summary: non-command spark.sql should support datatypes not 
compatible with arrow
 Key: SPARK-45026
 URL: https://issues.apache.org/jira/browse/SPARK-45026
 Project: Spark
  Issue Type: New Feature
  Components: Connect
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45026) spark.sql should support datatypes not compatible with arrow

2023-08-30 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-45026:
--
Summary: spark.sql should support datatypes not compatible with arrow  
(was: non-command spark.sql should support datatypes not compatible with arrow)

> spark.sql should support datatypes not compatible with arrow
> 
>
> Key: SPARK-45026
> URL: https://issues.apache.org/jira/browse/SPARK-45026
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45012) CheckAnalysis should throw inlined plan in AnalysisException

2023-08-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-45012.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42729
[https://github.com/apache/spark/pull/42729]

> CheckAnalysis should throw inlined plan in AnalysisException
> 
>
> Key: SPARK-45012
> URL: https://issues.apache.org/jira/browse/SPARK-45012
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45025) Block manager write to memory store iterator should process thread interrupt

2023-08-30 Thread Anish Shrigondekar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760677#comment-17760677
 ] 

Anish Shrigondekar commented on SPARK-45025:


cc - [~kabhwan] - PR here - [https://github.com/apache/spark/pull/42742]

Thx

> Block manager write to memory store iterator should process thread interrupt
> 
>
> Key: SPARK-45025
> URL: https://issues.apache.org/jira/browse/SPARK-45025
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Anish Shrigondekar
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45025) Block manager write to memory store iterator should process thread interrupt

2023-08-30 Thread Anish Shrigondekar (Jira)

Anish Shrigondekar created SPARK-45025:
--

 Summary: Block manager write to memory store iterator should 
process thread interrupt
 Key: SPARK-45025
 URL: https://issues.apache.org/jira/browse/SPARK-45025
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Anish Shrigondekar






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45024) Filter out some configs in Session Creation

2023-08-30 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-45024:
-

 Summary: Filter out some configs in Session Creation
 Key: SPARK-45024
 URL: https://issues.apache.org/jira/browse/SPARK-45024
 Project: Spark
  Issue Type: New Feature
  Components: Connect, PySpark
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45015) Refine DocStrings of `try_{add, subtract, multiply, divide, avg, sum}`

2023-08-30 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-45015:
-

Assignee: Ruifeng Zheng

> Refine DocStrings of `try_{add, subtract, multiply, divide, avg, sum}`
> --
>
> Key: SPARK-45015
> URL: https://issues.apache.org/jira/browse/SPARK-45015
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45015) Refine DocStrings of `try_{add, subtract, multiply, divide, avg, sum}`

2023-08-30 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-45015.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42735
[https://github.com/apache/spark/pull/42735]

> Refine DocStrings of `try_{add, subtract, multiply, divide, avg, sum}`
> --
>
> Key: SPARK-45015
> URL: https://issues.apache.org/jira/browse/SPARK-45015
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44971) [BUG Fix] PySpark StreamingQuerProgress fromJson

2023-08-30 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44971.
--
Fix Version/s: 3.5.1
 Assignee: Wei Liu
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/42686

> [BUG Fix] PySpark StreamingQuerProgress fromJson 
> -
>
> Key: SPARK-44971
> URL: https://issues.apache.org/jira/browse/SPARK-44971
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Wei Liu
>Assignee: Wei Liu
>Priority: Major
> Fix For: 3.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45014) Clean up fileserver when cleaning up files, jars and archives in SparkContext

2023-08-30 Thread GridGain Integration (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760658#comment-17760658
 ] 

GridGain Integration commented on SPARK-45014:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/42731

> Clean up fileserver when cleaning up files, jars and archives in SparkContext
> -
>
> Key: SPARK-45014
> URL: https://issues.apache.org/jira/browse/SPARK-45014
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> In SPARK-44348, we clean up Spark Context's added files but we don't clean up 
> the ones in fileserver.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45014) Clean up fileserver when cleaning up files, jars and archives in SparkContext

2023-08-30 Thread Ignite TC Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760657#comment-17760657
 ] 

Ignite TC Bot commented on SPARK-45014:
---

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/42731

> Clean up fileserver when cleaning up files, jars and archives in SparkContext
> -
>
> Key: SPARK-45014
> URL: https://issues.apache.org/jira/browse/SPARK-45014
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> In SPARK-44348, we clean up Spark Context's added files but we don't clean up 
> the ones in fileserver.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45023) SPIP: Python Stored Procedures

2023-08-30 Thread Allison Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45023:
-
Shepherd: Hyukjin Kwon

> SPIP: Python Stored Procedures
> --
>
> Key: SPARK-45023
> URL: https://issues.apache.org/jira/browse/SPARK-45023
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Stored procedures are an extension of the ANSI SQL standard. They play a 
> crucial role in improving the capabilities of SQL by encapsulating complex 
> logic into reusable routines. 
> This proposal aims to extend Spark SQL by introducing support for stored 
> procedures, starting with Python as the procedural language. This addition 
> will allow users to execute procedural programs, leveraging programming 
> constructs of Python to perform tasks with complex logic. Additionally, users 
> can persist these procedural routines in catalogs such as HMS for future 
> reuse. By providing this functionality, we intend to seamlessly empower Spark 
> users to integrate with Python routines within their SQL workflows.
> {*}SPIP{*}: 
> [https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45023) SPIP: Python Stored Procedures

2023-08-30 Thread Allison Wang (Jira)

Allison Wang created SPARK-45023:


 Summary: SPIP: Python Stored Procedures
 Key: SPARK-45023
 URL: https://issues.apache.org/jira/browse/SPARK-45023
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 4.0.0
Reporter: Allison Wang


Stored procedures are an extension of the ANSI SQL standard. They play a 
crucial role in improving the capabilities of SQL by encapsulating complex 
logic into reusable routines. 

This proposal aims to extend Spark SQL by introducing support for stored 
procedures, starting with Python as the procedural language. This addition will 
allow users to execute procedural programs, leveraging programming constructs 
of Python to perform tasks with complex logic. Additionally, users can persist 
these procedural routines in catalogs such as HMS for future reuse. By 
providing this functionality, we intend to seamlessly empower Spark users to 
integrate with Python routines within their SQL workflows.

{*}SPIP{*}: 
[https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing]

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43299) JVM Client throw StreamingQueryException when error handling is implemented

2023-08-30 Thread Yihong He (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760620#comment-17760620
 ] 

Yihong He commented on SPARK-43299:
---

[~hvanhovell] Thanks for the reminder! I will make sure it works.

> JVM Client throw StreamingQueryException when error handling is implemented
> ---
>
> Key: SPARK-43299
> URL: https://issues.apache.org/jira/browse/SPARK-43299
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Wei Liu
>Priority: Major
>
> Currently the awaitTermination() method of connect's JVM client's 
> StreamingQuery won't throw error when there is an exception. 
>  
> In Python connect this is directly handled by python client's error-handling 
> framework but such is not existed in JVM client right now.
>  
> We should verify it works when JVM adds that
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44991) Spark json schema inference and fromJson api having inconsistent behavior

2023-08-30 Thread nirav patel (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nirav patel updated SPARK-44991:

Description: 
Spark json reader can infer datatype of a fields. I am ingesting millions of 
datapoints and  generating a `DataFrameA`. what i notice that Schema inference 
mark datatype of a field with tons of Integers and Empty Strings as a Long. 
That is an okay behavior as I don't set `primitivesAsString` as I do want  
primitive type inference. I store `DataFrameA` into `TableA` 

Now, this inference behavior is not respected by `fromJson` of `from_json` api 
when I am trying to write new data on `TableA`. Means, if I read a chunk of new 
input data into using 
`spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` 
reader complains that EmptyString cannot be cast to Long .

ps - `getStruct(TableA)` is psuedo method that returns `struct` of TableA 
schema somehow. and `/path/to/more/data` is new dataset which has some records 
with value for this fields as an empty string.

 

I think if reader doesn't complain about Empty string during schema inference 
it shouldn't complain either on reading without inference. May be treat Empty 
as Null just like during schema inference. Empty string is a legal value for 
String type field but not Number types fields so I don't see any reason not to 
treat it as a Null. Another option is to give additional reader option - 
treatEmptyAsNull so it's more explicit? 

ps - I marked it as bug but could be more suited as improvements.

  was:
Spark json reader can infer datatype of a fields. I am ingesting millions of 
datapoints and  generating a `DataFrameA`. what i notice that Schema inference 
mark datatype of a field with tons of Integers and Empty Strings as a Long. 
That is an okay behavior as I don't set `primitivesAsString` cause I do want  
primitive type inference. I store `DataFrameA` into `TableA` 

Now, this inference behavior is not respected by `fromJson` of `from_json` api 
when I am trying to write new data on `TableA`. Means, if I read a chunk of 
input data into using 
`spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` 
reader complains that EmptyString cannot be cast to Long . `getStruct(TableA)` 
is psuedo method that returns `struct` of TableA schema somehow. and 
`/path/to/more/data` have some value for this fields as an empty string.

I think if reader doesnt complain about Empty string during schema inference it 
shouldn't complain either on reading without inference. May be treat Empty as 
Null just like during schema inference or at least give an additional option - 
treatEmptyAsNull so it's more explicit for application users? 

ps - i marked it as bug but could be more suited as improvements.


> Spark json schema inference and fromJson api having inconsistent behavior
> -
>
> Key: SPARK-44991
> URL: https://issues.apache.org/jira/browse/SPARK-44991
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.2
>Reporter: nirav patel
>Priority: Major
>
> Spark json reader can infer datatype of a fields. I am ingesting millions of 
> datapoints and  generating a `DataFrameA`. what i notice that Schema 
> inference mark datatype of a field with tons of Integers and Empty Strings as 
> a Long. That is an okay behavior as I don't set `primitivesAsString` as I do 
> want  primitive type inference. I store `DataFrameA` into `TableA` 
> Now, this inference behavior is not respected by `fromJson` of `from_json` 
> api when I am trying to write new data on `TableA`. Means, if I read a chunk 
> of new input data into using 
> `spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` 
> reader complains that EmptyString cannot be cast to Long .
> ps - `getStruct(TableA)` is psuedo method that returns `struct` of TableA 
> schema somehow. and `/path/to/more/data` is new dataset which has some 
> records with value for this fields as an empty string.
>  
> I think if reader doesn't complain about Empty string during schema inference 
> it shouldn't complain either on reading without inference. May be treat Empty 
> as Null just like during schema inference. Empty string is a legal value for 
> String type field but not Number types fields so I don't see any reason not 
> to treat it as a Null. Another option is to give additional reader option - 
> treatEmptyAsNull so it's more explicit? 
> ps - I marked it as bug but could be more suited as improvements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45012) CheckAnalysis should throw inlined plan in AnalysisException

2023-08-30 Thread Rui Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-45012:
-
Affects Version/s: 4.0.0
   (was: 3.5.0)

> CheckAnalysis should throw inlined plan in AnalysisException
> 
>
> Key: SPARK-45012
> URL: https://issues.apache.org/jira/browse/SPARK-45012
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45016) Add missing `try_remote_functions` annotations

2023-08-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45016:
-

Assignee: Ruifeng Zheng

> Add missing `try_remote_functions` annotations
> --
>
> Key: SPARK-45016
> URL: https://issues.apache.org/jira/browse/SPARK-45016
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45016) Add missing `try_remote_functions` annotations

2023-08-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45016.
---
Fix Version/s: 3.5.1
   Resolution: Fixed

Issue resolved by pull request 42734
[https://github.com/apache/spark/pull/42734]

> Add missing `try_remote_functions` annotations
> --
>
> Key: SPARK-45016
> URL: https://issues.apache.org/jira/browse/SPARK-45016
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45017) Add CalendarIntervalType to PySpark

2023-08-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45017:
-

Assignee: Ruifeng Zheng

> Add CalendarIntervalType to PySpark
> ---
>
> Key: SPARK-45017
> URL: https://issues.apache.org/jira/browse/SPARK-45017
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45017) Add CalendarIntervalType to PySpark

2023-08-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45017.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42736
[https://github.com/apache/spark/pull/42736]

> Add CalendarIntervalType to PySpark
> ---
>
> Key: SPARK-45017
> URL: https://issues.apache.org/jira/browse/SPARK-45017
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44997) Align example order (Python -> Scala/Java -> R) in all Spark Doc Content

2023-08-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44997.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42712
[https://github.com/apache/spark/pull/42712]

> Align example order (Python -> Scala/Java -> R) in all Spark Doc Content
> 
>
> Key: SPARK-44997
> URL: https://issues.apache.org/jira/browse/SPARK-44997
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44997) Align example order (Python -> Scala/Java -> R) in all Spark Doc Content

2023-08-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44997:
-

Assignee: BingKun Pan

> Align example order (Python -> Scala/Java -> R) in all Spark Doc Content
> 
>
> Key: SPARK-44997
> URL: https://issues.apache.org/jira/browse/SPARK-44997
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42304) Assign name to _LEGACY_ERROR_TEMP_2189

2023-08-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42304:
-

Assignee: Valentin

> Assign name to _LEGACY_ERROR_TEMP_2189
> --
>
> Key: SPARK-42304
> URL: https://issues.apache.org/jira/browse/SPARK-42304
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Valentin
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45005) Reducing the CI time for slow pyspark-pandas-connect tests

2023-08-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45005:
--
Issue Type: Test  (was: Bug)

> Reducing the CI time for slow pyspark-pandas-connect tests
> --
>
> Key: SPARK-45005
> URL: https://issues.apache.org/jira/browse/SPARK-45005
> Project: Spark
>  Issue Type: Test
>  Components: Connect, Pandas API on Spark, Tests
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 4.0.0
>
>
> pyspark-pandas-connect test takes more than 3 hours in Github Actions, so we 
> might need to reduce the execution time. See 
> https://github.com/apache/spark/actions/runs/5989124806/job/16245001034



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44884) Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode is dynamic

2023-08-30 Thread Dipayan Dev (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dipayan Dev updated SPARK-44884:

Priority: Minor  (was: Critical)

> Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode 
> is dynamic
> 
>
> Key: SPARK-44884
> URL: https://issues.apache.org/jira/browse/SPARK-44884
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dipayan Dev
>Priority: Minor
> Attachments: image-2023-08-20-18-46-53-342.png, 
> image-2023-08-25-13-01-42-137.png
>
>
> The issue is not happening in Spark 2.x (I am using 2.4.0), but only in 3.3.0 
> (tested with 3.4.1 as well)
> Code to reproduce the issue
>  
> {code:java}
> scala> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") 
> scala> val DF = Seq(("test1", 123)).toDF("name", "num")
> scala> DF.write.option("path", 
> "gs://test_bucket/table").mode("overwrite").partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1")
>  {code}
>  
> The above code succeeds and creates external Hive table, but {*}there is no 
> SUCCESS file generated{*}.
> Adding the content of the bucket after table creation
> !image-2023-08-25-13-01-42-137.png|width=500,height=130!
>  The same code when running with spark 2.4.0 (with or without external path), 
> generates the SUCCESS file.
> {code:java}
> scala> 
> DF.write.mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1"){code}
> !image-2023-08-20-18-46-53-342.png|width=465,height=166!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44884) Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode is dynamic

2023-08-30 Thread Dipayan Dev (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dipayan Dev updated SPARK-44884:

Priority: Major  (was: Minor)

> Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode 
> is dynamic
> 
>
> Key: SPARK-44884
> URL: https://issues.apache.org/jira/browse/SPARK-44884
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dipayan Dev
>Priority: Major
> Attachments: image-2023-08-20-18-46-53-342.png, 
> image-2023-08-25-13-01-42-137.png
>
>
> The issue is not happening in Spark 2.x (I am using 2.4.0), but only in 3.3.0 
> (tested with 3.4.1 as well)
> Code to reproduce the issue
>  
> {code:java}
> scala> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") 
> scala> val DF = Seq(("test1", 123)).toDF("name", "num")
> scala> DF.write.option("path", 
> "gs://test_bucket/table").mode("overwrite").partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1")
>  {code}
>  
> The above code succeeds and creates external Hive table, but {*}there is no 
> SUCCESS file generated{*}.
> Adding the content of the bucket after table creation
> !image-2023-08-25-13-01-42-137.png|width=500,height=130!
>  The same code when running with spark 2.4.0 (with or without external path), 
> generates the SUCCESS file.
> {code:java}
> scala> 
> DF.write.mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1"){code}
> !image-2023-08-20-18-46-53-342.png|width=465,height=166!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45005) Reducing the CI time for slow pyspark-pandas-connect tests

2023-08-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45005.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42719
[https://github.com/apache/spark/pull/42719]

> Reducing the CI time for slow pyspark-pandas-connect tests
> --
>
> Key: SPARK-45005
> URL: https://issues.apache.org/jira/browse/SPARK-45005
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, Pandas API on Spark, Tests
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 4.0.0
>
>
> pyspark-pandas-connect test takes more than 3 hours in Github Actions, so we 
> might need to reduce the execution time. See 
> https://github.com/apache/spark/actions/runs/5989124806/job/16245001034



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45005) Reducing the CI time for slow pyspark-pandas-connect tests

2023-08-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45005:
-

Assignee: Haejoon Lee

> Reducing the CI time for slow pyspark-pandas-connect tests
> --
>
> Key: SPARK-45005
> URL: https://issues.apache.org/jira/browse/SPARK-45005
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, Pandas API on Spark, Tests
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> pyspark-pandas-connect test takes more than 3 hours in Github Actions, so we 
> might need to reduce the execution time. See 
> https://github.com/apache/spark/actions/runs/5989124806/job/16245001034



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45021) Remove `antlr4-maven-plugin` configuration from `sql/catalyst/pom.xml`

2023-08-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45021.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 42739
[https://github.com/apache/spark/pull/42739]

> Remove `antlr4-maven-plugin` configuration from `sql/catalyst/pom.xml`
> --
>
> Key: SPARK-45021
> URL: https://issues.apache.org/jira/browse/SPARK-45021
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.5.0
>
>
> SPARK-44475 has already moved the relevant configuration to 
> `sql/api/pom.xml`, the configuration in the catalyst module is  unused now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45021) Remove `antlr4-maven-plugin` configuration from `sql/catalyst/pom.xml`

2023-08-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45021:
-

Assignee: Yang Jie

> Remove `antlr4-maven-plugin` configuration from `sql/catalyst/pom.xml`
> --
>
> Key: SPARK-45021
> URL: https://issues.apache.org/jira/browse/SPARK-45021
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> SPARK-44475 has already moved the relevant configuration to 
> `sql/api/pom.xml`, the configuration in the catalyst module is  unused now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45022) Provide context for dataset API errors

2023-08-30 Thread Peter Toth (Jira)

Peter Toth created SPARK-45022:
--

 Summary: Provide context for dataset API errors
 Key: SPARK-45022
 URL: https://issues.apache.org/jira/browse/SPARK-45022
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Peter Toth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44239) Free memory allocated by large vectors when vectors are reset

2023-08-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-44239.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 41782
[https://github.com/apache/spark/pull/41782]

> Free memory allocated by large vectors when vectors are reset
> -
>
> Key: SPARK-44239
> URL: https://issues.apache.org/jira/browse/SPARK-44239
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Wan Kun
>Assignee: Wan Kun
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: image-2023-06-29-12-58-12-256.png, 
> image-2023-06-29-13-03-15-470.png
>
>
> When spark reads a data file into a WritableColumnVector, the memory 
> allocated by the WritableColumnVectors is not freed until the 
> VectorizedColumnReader completes.
> It will save memory allocation time by reusing the allocated array objects. 
> But it also takes up too many unused memory after the current large vector 
> batch has been read.
> Add a memory reserve policy for this scenario which will reuse the allocated 
> array object for small column vectors and free the memory for huge column 
> vectors.
> !image-2023-06-29-12-58-12-256.png!!image-2023-06-29-13-03-15-470.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44239) Free memory allocated by large vectors when vectors are reset

2023-08-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-44239:
---

Assignee: Wan Kun

> Free memory allocated by large vectors when vectors are reset
> -
>
> Key: SPARK-44239
> URL: https://issues.apache.org/jira/browse/SPARK-44239
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Wan Kun
>Assignee: Wan Kun
>Priority: Major
> Attachments: image-2023-06-29-12-58-12-256.png, 
> image-2023-06-29-13-03-15-470.png
>
>
> When spark reads a data file into a WritableColumnVector, the memory 
> allocated by the WritableColumnVectors is not freed until the 
> VectorizedColumnReader completes.
> It will save memory allocation time by reusing the allocated array objects. 
> But it also takes up too many unused memory after the current large vector 
> batch has been read.
> Add a memory reserve policy for this scenario which will reuse the allocated 
> array object for small column vectors and free the memory for huge column 
> vectors.
> !image-2023-06-29-12-58-12-256.png!!image-2023-06-29-13-03-15-470.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45021) Remove `antlr4-maven-plugin` configuration from `sql/catalyst/pom.xml`

2023-08-30 Thread Ignite TC Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760404#comment-17760404
 ] 

Ignite TC Bot commented on SPARK-45021:
---

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/42739

> Remove `antlr4-maven-plugin` configuration from `sql/catalyst/pom.xml`
> --
>
> Key: SPARK-45021
> URL: https://issues.apache.org/jira/browse/SPARK-45021
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Minor
>
> SPARK-44475 has already moved the relevant configuration to 
> `sql/api/pom.xml`, the configuration in the catalyst module is  unused now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45021) Remove `antlr4-maven-plugin` configuration from `sql/catalyst/pom.xml`

2023-08-30 Thread Yang Jie (Jira)

Yang Jie created SPARK-45021:


 Summary: Remove `antlr4-maven-plugin` configuration from 
`sql/catalyst/pom.xml`
 Key: SPARK-45021
 URL: https://issues.apache.org/jira/browse/SPARK-45021
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Yang Jie


SPARK-44475 has already moved the relevant configuration to `sql/api/pom.xml`, 
the configuration in the catalyst module is  unused now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45019) Make workflow scala213 on container & clean env

2023-08-30 Thread Hudson (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760363#comment-17760363
 ] 

Hudson commented on SPARK-45019:


User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/42733

> Make workflow scala213 on container & clean env
> ---
>
> Key: SPARK-45019
> URL: https://issues.apache.org/jira/browse/SPARK-45019
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45020) org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'default' not found (state=08S01,code=0)

2023-08-30 Thread Sruthi Mooriyathvariam (Jira)

Sruthi Mooriyathvariam created SPARK-45020:
--

 Summary: 
org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 
'default' not found (state=08S01,code=0)
 Key: SPARK-45020
 URL: https://issues.apache.org/jira/browse/SPARK-45020
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Sruthi Mooriyathvariam
 Fix For: 3.1.0


There is an alert that fires up when a Spark 3.1 cluster is created using 
shared metastore with Spark 2.4. The alert says DefaultDatabase does not exist. 
This is misleading and thus we need to suppress this alert. 
In the class SessionCatalog.scala, the method requireDbExists() is not handling 
the case when the db = defaultDB. This needs to be added to suppress this 
misleading alert. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45019) Make workflow scala213 on container & clean env

2023-08-30 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-45019:
---

 Summary: Make workflow scala213 on container & clean env
 Key: SPARK-45019
 URL: https://issues.apache.org/jira/browse/SPARK-45019
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45018) Add CalendarIntervalType to Python Client

2023-08-30 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-45018:
-

 Summary: Add CalendarIntervalType to Python Client
 Key: SPARK-45018
 URL: https://issues.apache.org/jira/browse/SPARK-45018
 Project: Spark
  Issue Type: New Feature
  Components: Connect, PySpark
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45017) Add CalendarIntervalType to PySpark

2023-08-30 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-45017:
-

 Summary: Add CalendarIntervalType to PySpark
 Key: SPARK-45017
 URL: https://issues.apache.org/jira/browse/SPARK-45017
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45015) Refine DocStrings of `try_{add, subtract, multiply, divide, avg, sum}`

2023-08-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760302#comment-17760302
 ] 

ASF GitHub Bot commented on SPARK-45015:


User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/42735

> Refine DocStrings of `try_{add, subtract, multiply, divide, avg, sum}`
> --
>
> Key: SPARK-45015
> URL: https://issues.apache.org/jira/browse/SPARK-45015
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45015) Refine DocStrings of `try_{add, subtract, multiply, divide, avg, sum}`

2023-08-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760301#comment-17760301
 ] 

ASF GitHub Bot commented on SPARK-45015:


User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/42735

> Refine DocStrings of `try_{add, subtract, multiply, divide, avg, sum}`
> --
>
> Key: SPARK-45015
> URL: https://issues.apache.org/jira/browse/SPARK-45015
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45015) Refine DocStrings of `try_{add, subtract, multiply, divide, avg, sum}`

2023-08-30 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-45015:
--
Summary: Refine DocStrings of `try_{add, subtract, multiply, divide, avg, 
sum}`  (was: Refine DocString of `try_*` functions)

> Refine DocStrings of `try_{add, subtract, multiply, divide, avg, sum}`
> --
>
> Key: SPARK-45015
> URL: https://issues.apache.org/jira/browse/SPARK-45015
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45016) Add missing `try_remote_functions` annotations

2023-08-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760298#comment-17760298
 ] 

ASF GitHub Bot commented on SPARK-45016:


User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/42734

> Add missing `try_remote_functions` annotations
> --
>
> Key: SPARK-45016
> URL: https://issues.apache.org/jira/browse/SPARK-45016
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45005) Reducing the CI time for slow pyspark-pandas-connect tests

2023-08-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760291#comment-17760291
 ] 

ASF GitHub Bot commented on SPARK-45005:


User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/42719

> Reducing the CI time for slow pyspark-pandas-connect tests
> --
>
> Key: SPARK-45005
> URL: https://issues.apache.org/jira/browse/SPARK-45005
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, Pandas API on Spark, Tests
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>
> pyspark-pandas-connect test takes more than 3 hours in Github Actions, so we 
> might need to reduce the execution time. See 
> https://github.com/apache/spark/actions/runs/5989124806/job/16245001034



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45016) Add missing `try_remote_functions` annotations

2023-08-30 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-45016:
-

 Summary: Add missing `try_remote_functions` annotations
 Key: SPARK-45016
 URL: https://issues.apache.org/jira/browse/SPARK-45016
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.5.0, 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33628) Use the Hive.getPartitionsByNames method instead of Hive.getPartitions in the HiveClientImpl

2023-08-30 Thread Maxim Martynov (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760281#comment-17760281
 ] 

Maxim Martynov edited comment on SPARK-33628 at 8/30/23 8:53 AM:
-

Fixed in SPARK-42480, issue can be closed


was (Author: JIRAUSER283764):
Fixed in SPARK-42480

> Use the Hive.getPartitionsByNames method instead of Hive.getPartitions in the 
> HiveClientImpl
> 
>
> Key: SPARK-33628
> URL: https://issues.apache.org/jira/browse/SPARK-33628
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: jinhai
>Priority: Major
> Attachments: image-2020-12-02-16-57-43-619.png, 
> image-2020-12-03-14-38-19-221.png
>
>
> When partitions are tracked by the catalog, that will compute all custom 
> partition locations, especially when dynamic partitions, and the field 
> staticPartitions is empty.
>  The poor performance of the method listPartitions results in a long period 
> of no response at the Driver.
> When read 12253 partitions, the method getPartitionsByNames takes 2 seconds, 
> and the getPartitions takes 457 seconds, nearly 8 minutes
> !image-2020-12-02-16-57-43-619.png|width=783,height=54!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-33628) Use the Hive.getPartitionsByNames method instead of Hive.getPartitions in the HiveClientImpl

2023-08-30 Thread Maxim Martynov (Jira)



[ https://issues.apache.org/jira/browse/SPARK-33628 ]


Maxim Martynov deleted comment on SPARK-33628:


was (Author: JIRAUSER283764):
Can anyone review this pull request?

> Use the Hive.getPartitionsByNames method instead of Hive.getPartitions in the 
> HiveClientImpl
> 
>
> Key: SPARK-33628
> URL: https://issues.apache.org/jira/browse/SPARK-33628
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: jinhai
>Priority: Major
> Attachments: image-2020-12-02-16-57-43-619.png, 
> image-2020-12-03-14-38-19-221.png
>
>
> When partitions are tracked by the catalog, that will compute all custom 
> partition locations, especially when dynamic partitions, and the field 
> staticPartitions is empty.
>  The poor performance of the method listPartitions results in a long period 
> of no response at the Driver.
> When read 12253 partitions, the method getPartitionsByNames takes 2 seconds, 
> and the getPartitions takes 457 seconds, nearly 8 minutes
> !image-2020-12-02-16-57-43-619.png|width=783,height=54!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33628) Use the Hive.getPartitionsByNames method instead of Hive.getPartitions in the HiveClientImpl

2023-08-30 Thread Maxim Martynov (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760281#comment-17760281
 ] 

Maxim Martynov commented on SPARK-33628:


Fixed in SPARK-42480

> Use the Hive.getPartitionsByNames method instead of Hive.getPartitions in the 
> HiveClientImpl
> 
>
> Key: SPARK-33628
> URL: https://issues.apache.org/jira/browse/SPARK-33628
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: jinhai
>Priority: Major
> Attachments: image-2020-12-02-16-57-43-619.png, 
> image-2020-12-03-14-38-19-221.png
>
>
> When partitions are tracked by the catalog, that will compute all custom 
> partition locations, especially when dynamic partitions, and the field 
> staticPartitions is empty.
>  The poor performance of the method listPartitions results in a long period 
> of no response at the Driver.
> When read 12253 partitions, the method getPartitionsByNames takes 2 seconds, 
> and the getPartitions takes 457 seconds, nearly 8 minutes
> !image-2020-12-02-16-57-43-619.png|width=783,height=54!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45015) Refine DocString of `try_*` functions

2023-08-30 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-45015:
-

 Summary: Refine DocString of `try_*` functions
 Key: SPARK-45015
 URL: https://issues.apache.org/jira/browse/SPARK-45015
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

58 matches

Mail list logo