[jira] [Created] (SPARK-45027) Hide internal functions/variables in `pyspark.sql.functions` from auto-completion
Ruifeng Zheng created SPARK-45027: - Summary: Hide internal functions/variables in `pyspark.sql.functions` from auto-completion Key: SPARK-45027 URL: https://issues.apache.org/jira/browse/SPARK-45027 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45024) Filter out some configs in Session Creation
[ https://issues.apache.org/jira/browse/SPARK-45024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-45024: - Assignee: Ruifeng Zheng > Filter out some configs in Session Creation > --- > > Key: SPARK-45024 > URL: https://issues.apache.org/jira/browse/SPARK-45024 > Project: Spark > Issue Type: New Feature > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45024) Filter out some configs in Session Creation
[ https://issues.apache.org/jira/browse/SPARK-45024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-45024. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42741 [https://github.com/apache/spark/pull/42741] > Filter out some configs in Session Creation > --- > > Key: SPARK-45024 > URL: https://issues.apache.org/jira/browse/SPARK-45024 > Project: Spark > Issue Type: New Feature > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45024) Filter out some configs in Session Creation
[ https://issues.apache.org/jira/browse/SPARK-45024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760710#comment-17760710 ] Snoot.io commented on SPARK-45024: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/42741 > Filter out some configs in Session Creation > --- > > Key: SPARK-45024 > URL: https://issues.apache.org/jira/browse/SPARK-45024 > Project: Spark > Issue Type: New Feature > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45024) Filter out some configs in Session Creation
[ https://issues.apache.org/jira/browse/SPARK-45024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760709#comment-17760709 ] Snoot.io commented on SPARK-45024: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/42741 > Filter out some configs in Session Creation > --- > > Key: SPARK-45024 > URL: https://issues.apache.org/jira/browse/SPARK-45024 > Project: Spark > Issue Type: New Feature > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44940) Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
[ https://issues.apache.org/jira/browse/SPARK-44940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760708#comment-17760708 ] Snoot.io commented on SPARK-44940: -- User 'sadikovi' has created a pull request for this issue: https://github.com/apache/spark/pull/42667 > Improve performance of JSON parsing when > "spark.sql.json.enablePartialResults" is enabled > - > > Key: SPARK-44940 > URL: https://issues.apache.org/jira/browse/SPARK-44940 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0, 3.5.0, 4.0.0 >Reporter: Ivan Sadikov >Priority: Major > > Follow-up on https://issues.apache.org/jira/browse/SPARK-40646. > I found that JSON parsing is significantly slower due to exception creation > in control flow. Also, some fields are not parsed correctly and the exception > is thrown in certain cases: > {code:java} > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to > org.apache.spark.sql.catalyst.InternalRow > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct(rows.scala:51) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct$(rows.scala:51) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getStruct(rows.scala:195) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:590) > ... 39 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45018) Add CalendarIntervalType to Python Client
[ https://issues.apache.org/jira/browse/SPARK-45018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760704#comment-17760704 ] Snoot.io commented on SPARK-45018: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/42743 > Add CalendarIntervalType to Python Client > - > > Key: SPARK-45018 > URL: https://issues.apache.org/jira/browse/SPARK-45018 > Project: Spark > Issue Type: New Feature > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing
[ https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760700#comment-17760700 ] Varun Nalla commented on SPARK-44900: - [~yxzhang] / [~yao] any update for us ? > Cached DataFrame keeps growing > -- > > Key: SPARK-44900 > URL: https://issues.apache.org/jira/browse/SPARK-44900 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Varun Nalla >Priority: Blocker > > Scenario : > We have a kafka streaming application where the data lookups are happening by > joining another DF which is cached, and the caching strategy is > MEMORY_AND_DISK. > However the size of the cached DataFrame keeps on growing for every micro > batch the streaming application process and that's being visible under > storage tab. > A similar stack overflow thread was already raised. > https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45026) non-command spark.sql should support datatypes not compatible with arrow
Ruifeng Zheng created SPARK-45026: - Summary: non-command spark.sql should support datatypes not compatible with arrow Key: SPARK-45026 URL: https://issues.apache.org/jira/browse/SPARK-45026 Project: Spark Issue Type: New Feature Components: Connect Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45026) spark.sql should support datatypes not compatible with arrow
[ https://issues.apache.org/jira/browse/SPARK-45026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-45026: -- Summary: spark.sql should support datatypes not compatible with arrow (was: non-command spark.sql should support datatypes not compatible with arrow) > spark.sql should support datatypes not compatible with arrow > > > Key: SPARK-45026 > URL: https://issues.apache.org/jira/browse/SPARK-45026 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45012) CheckAnalysis should throw inlined plan in AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-45012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-45012. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42729 [https://github.com/apache/spark/pull/42729] > CheckAnalysis should throw inlined plan in AnalysisException > > > Key: SPARK-45012 > URL: https://issues.apache.org/jira/browse/SPARK-45012 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45025) Block manager write to memory store iterator should process thread interrupt
[ https://issues.apache.org/jira/browse/SPARK-45025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760677#comment-17760677 ] Anish Shrigondekar commented on SPARK-45025: cc - [~kabhwan] - PR here - [https://github.com/apache/spark/pull/42742] Thx > Block manager write to memory store iterator should process thread interrupt > > > Key: SPARK-45025 > URL: https://issues.apache.org/jira/browse/SPARK-45025 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Anish Shrigondekar >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45025) Block manager write to memory store iterator should process thread interrupt
Anish Shrigondekar created SPARK-45025: -- Summary: Block manager write to memory store iterator should process thread interrupt Key: SPARK-45025 URL: https://issues.apache.org/jira/browse/SPARK-45025 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 4.0.0 Reporter: Anish Shrigondekar -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45024) Filter out some configs in Session Creation
Ruifeng Zheng created SPARK-45024: - Summary: Filter out some configs in Session Creation Key: SPARK-45024 URL: https://issues.apache.org/jira/browse/SPARK-45024 Project: Spark Issue Type: New Feature Components: Connect, PySpark Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45015) Refine DocStrings of `try_{add, subtract, multiply, divide, avg, sum}`
[ https://issues.apache.org/jira/browse/SPARK-45015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-45015: - Assignee: Ruifeng Zheng > Refine DocStrings of `try_{add, subtract, multiply, divide, avg, sum}` > -- > > Key: SPARK-45015 > URL: https://issues.apache.org/jira/browse/SPARK-45015 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45015) Refine DocStrings of `try_{add, subtract, multiply, divide, avg, sum}`
[ https://issues.apache.org/jira/browse/SPARK-45015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-45015. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42735 [https://github.com/apache/spark/pull/42735] > Refine DocStrings of `try_{add, subtract, multiply, divide, avg, sum}` > -- > > Key: SPARK-45015 > URL: https://issues.apache.org/jira/browse/SPARK-45015 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44971) [BUG Fix] PySpark StreamingQuerProgress fromJson
[ https://issues.apache.org/jira/browse/SPARK-44971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44971. -- Fix Version/s: 3.5.1 Assignee: Wei Liu Resolution: Fixed Fixed in https://github.com/apache/spark/pull/42686 > [BUG Fix] PySpark StreamingQuerProgress fromJson > - > > Key: SPARK-44971 > URL: https://issues.apache.org/jira/browse/SPARK-44971 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.5.0, 3.5.1 >Reporter: Wei Liu >Assignee: Wei Liu >Priority: Major > Fix For: 3.5.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45014) Clean up fileserver when cleaning up files, jars and archives in SparkContext
[ https://issues.apache.org/jira/browse/SPARK-45014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760658#comment-17760658 ] GridGain Integration commented on SPARK-45014: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/42731 > Clean up fileserver when cleaning up files, jars and archives in SparkContext > - > > Key: SPARK-45014 > URL: https://issues.apache.org/jira/browse/SPARK-45014 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > In SPARK-44348, we clean up Spark Context's added files but we don't clean up > the ones in fileserver. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45014) Clean up fileserver when cleaning up files, jars and archives in SparkContext
[ https://issues.apache.org/jira/browse/SPARK-45014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760657#comment-17760657 ] Ignite TC Bot commented on SPARK-45014: --- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/42731 > Clean up fileserver when cleaning up files, jars and archives in SparkContext > - > > Key: SPARK-45014 > URL: https://issues.apache.org/jira/browse/SPARK-45014 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > In SPARK-44348, we clean up Spark Context's added files but we don't clean up > the ones in fileserver. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45023) SPIP: Python Stored Procedures
[ https://issues.apache.org/jira/browse/SPARK-45023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45023: - Shepherd: Hyukjin Kwon > SPIP: Python Stored Procedures > -- > > Key: SPARK-45023 > URL: https://issues.apache.org/jira/browse/SPARK-45023 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Stored procedures are an extension of the ANSI SQL standard. They play a > crucial role in improving the capabilities of SQL by encapsulating complex > logic into reusable routines. > This proposal aims to extend Spark SQL by introducing support for stored > procedures, starting with Python as the procedural language. This addition > will allow users to execute procedural programs, leveraging programming > constructs of Python to perform tasks with complex logic. Additionally, users > can persist these procedural routines in catalogs such as HMS for future > reuse. By providing this functionality, we intend to seamlessly empower Spark > users to integrate with Python routines within their SQL workflows. > {*}SPIP{*}: > [https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45023) SPIP: Python Stored Procedures
Allison Wang created SPARK-45023: Summary: SPIP: Python Stored Procedures Key: SPARK-45023 URL: https://issues.apache.org/jira/browse/SPARK-45023 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 4.0.0 Reporter: Allison Wang Stored procedures are an extension of the ANSI SQL standard. They play a crucial role in improving the capabilities of SQL by encapsulating complex logic into reusable routines. This proposal aims to extend Spark SQL by introducing support for stored procedures, starting with Python as the procedural language. This addition will allow users to execute procedural programs, leveraging programming constructs of Python to perform tasks with complex logic. Additionally, users can persist these procedural routines in catalogs such as HMS for future reuse. By providing this functionality, we intend to seamlessly empower Spark users to integrate with Python routines within their SQL workflows. {*}SPIP{*}: [https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43299) JVM Client throw StreamingQueryException when error handling is implemented
[ https://issues.apache.org/jira/browse/SPARK-43299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760620#comment-17760620 ] Yihong He commented on SPARK-43299: --- [~hvanhovell] Thanks for the reminder! I will make sure it works. > JVM Client throw StreamingQueryException when error handling is implemented > --- > > Key: SPARK-43299 > URL: https://issues.apache.org/jira/browse/SPARK-43299 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > > Currently the awaitTermination() method of connect's JVM client's > StreamingQuery won't throw error when there is an exception. > > In Python connect this is directly handled by python client's error-handling > framework but such is not existed in JVM client right now. > > We should verify it works when JVM adds that > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44991) Spark json schema inference and fromJson api having inconsistent behavior
[ https://issues.apache.org/jira/browse/SPARK-44991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nirav patel updated SPARK-44991: Description: Spark json reader can infer datatype of a fields. I am ingesting millions of datapoints and generating a `DataFrameA`. what i notice that Schema inference mark datatype of a field with tons of Integers and Empty Strings as a Long. That is an okay behavior as I don't set `primitivesAsString` as I do want primitive type inference. I store `DataFrameA` into `TableA` Now, this inference behavior is not respected by `fromJson` of `from_json` api when I am trying to write new data on `TableA`. Means, if I read a chunk of new input data into using `spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` reader complains that EmptyString cannot be cast to Long . ps - `getStruct(TableA)` is psuedo method that returns `struct` of TableA schema somehow. and `/path/to/more/data` is new dataset which has some records with value for this fields as an empty string. I think if reader doesn't complain about Empty string during schema inference it shouldn't complain either on reading without inference. May be treat Empty as Null just like during schema inference. Empty string is a legal value for String type field but not Number types fields so I don't see any reason not to treat it as a Null. Another option is to give additional reader option - treatEmptyAsNull so it's more explicit? ps - I marked it as bug but could be more suited as improvements. was: Spark json reader can infer datatype of a fields. I am ingesting millions of datapoints and generating a `DataFrameA`. what i notice that Schema inference mark datatype of a field with tons of Integers and Empty Strings as a Long. That is an okay behavior as I don't set `primitivesAsString` cause I do want primitive type inference. I store `DataFrameA` into `TableA` Now, this inference behavior is not respected by `fromJson` of `from_json` api when I am trying to write new data on `TableA`. Means, if I read a chunk of input data into using `spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` reader complains that EmptyString cannot be cast to Long . `getStruct(TableA)` is psuedo method that returns `struct` of TableA schema somehow. and `/path/to/more/data` have some value for this fields as an empty string. I think if reader doesnt complain about Empty string during schema inference it shouldn't complain either on reading without inference. May be treat Empty as Null just like during schema inference or at least give an additional option - treatEmptyAsNull so it's more explicit for application users? ps - i marked it as bug but could be more suited as improvements. > Spark json schema inference and fromJson api having inconsistent behavior > - > > Key: SPARK-44991 > URL: https://issues.apache.org/jira/browse/SPARK-44991 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.2 >Reporter: nirav patel >Priority: Major > > Spark json reader can infer datatype of a fields. I am ingesting millions of > datapoints and generating a `DataFrameA`. what i notice that Schema > inference mark datatype of a field with tons of Integers and Empty Strings as > a Long. That is an okay behavior as I don't set `primitivesAsString` as I do > want primitive type inference. I store `DataFrameA` into `TableA` > Now, this inference behavior is not respected by `fromJson` of `from_json` > api when I am trying to write new data on `TableA`. Means, if I read a chunk > of new input data into using > `spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` > reader complains that EmptyString cannot be cast to Long . > ps - `getStruct(TableA)` is psuedo method that returns `struct` of TableA > schema somehow. and `/path/to/more/data` is new dataset which has some > records with value for this fields as an empty string. > > I think if reader doesn't complain about Empty string during schema inference > it shouldn't complain either on reading without inference. May be treat Empty > as Null just like during schema inference. Empty string is a legal value for > String type field but not Number types fields so I don't see any reason not > to treat it as a Null. Another option is to give additional reader option - > treatEmptyAsNull so it's more explicit? > ps - I marked it as bug but could be more suited as improvements. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45012) CheckAnalysis should throw inlined plan in AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-45012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Wang updated SPARK-45012: - Affects Version/s: 4.0.0 (was: 3.5.0) > CheckAnalysis should throw inlined plan in AnalysisException > > > Key: SPARK-45012 > URL: https://issues.apache.org/jira/browse/SPARK-45012 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45016) Add missing `try_remote_functions` annotations
[ https://issues.apache.org/jira/browse/SPARK-45016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45016: - Assignee: Ruifeng Zheng > Add missing `try_remote_functions` annotations > -- > > Key: SPARK-45016 > URL: https://issues.apache.org/jira/browse/SPARK-45016 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45016) Add missing `try_remote_functions` annotations
[ https://issues.apache.org/jira/browse/SPARK-45016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45016. --- Fix Version/s: 3.5.1 Resolution: Fixed Issue resolved by pull request 42734 [https://github.com/apache/spark/pull/42734] > Add missing `try_remote_functions` annotations > -- > > Key: SPARK-45016 > URL: https://issues.apache.org/jira/browse/SPARK-45016 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.5.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45017) Add CalendarIntervalType to PySpark
[ https://issues.apache.org/jira/browse/SPARK-45017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45017: - Assignee: Ruifeng Zheng > Add CalendarIntervalType to PySpark > --- > > Key: SPARK-45017 > URL: https://issues.apache.org/jira/browse/SPARK-45017 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45017) Add CalendarIntervalType to PySpark
[ https://issues.apache.org/jira/browse/SPARK-45017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45017. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42736 [https://github.com/apache/spark/pull/42736] > Add CalendarIntervalType to PySpark > --- > > Key: SPARK-45017 > URL: https://issues.apache.org/jira/browse/SPARK-45017 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44997) Align example order (Python -> Scala/Java -> R) in all Spark Doc Content
[ https://issues.apache.org/jira/browse/SPARK-44997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44997. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42712 [https://github.com/apache/spark/pull/42712] > Align example order (Python -> Scala/Java -> R) in all Spark Doc Content > > > Key: SPARK-44997 > URL: https://issues.apache.org/jira/browse/SPARK-44997 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44997) Align example order (Python -> Scala/Java -> R) in all Spark Doc Content
[ https://issues.apache.org/jira/browse/SPARK-44997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44997: - Assignee: BingKun Pan > Align example order (Python -> Scala/Java -> R) in all Spark Doc Content > > > Key: SPARK-44997 > URL: https://issues.apache.org/jira/browse/SPARK-44997 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42304) Assign name to _LEGACY_ERROR_TEMP_2189
[ https://issues.apache.org/jira/browse/SPARK-42304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42304: - Assignee: Valentin > Assign name to _LEGACY_ERROR_TEMP_2189 > -- > > Key: SPARK-42304 > URL: https://issues.apache.org/jira/browse/SPARK-42304 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Valentin >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45005) Reducing the CI time for slow pyspark-pandas-connect tests
[ https://issues.apache.org/jira/browse/SPARK-45005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45005: -- Issue Type: Test (was: Bug) > Reducing the CI time for slow pyspark-pandas-connect tests > -- > > Key: SPARK-45005 > URL: https://issues.apache.org/jira/browse/SPARK-45005 > Project: Spark > Issue Type: Test > Components: Connect, Pandas API on Spark, Tests >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 4.0.0 > > > pyspark-pandas-connect test takes more than 3 hours in Github Actions, so we > might need to reduce the execution time. See > https://github.com/apache/spark/actions/runs/5989124806/job/16245001034 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44884) Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode is dynamic
[ https://issues.apache.org/jira/browse/SPARK-44884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dipayan Dev updated SPARK-44884: Priority: Minor (was: Critical) > Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode > is dynamic > > > Key: SPARK-44884 > URL: https://issues.apache.org/jira/browse/SPARK-44884 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dipayan Dev >Priority: Minor > Attachments: image-2023-08-20-18-46-53-342.png, > image-2023-08-25-13-01-42-137.png > > > The issue is not happening in Spark 2.x (I am using 2.4.0), but only in 3.3.0 > (tested with 3.4.1 as well) > Code to reproduce the issue > > {code:java} > scala> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") > scala> val DF = Seq(("test1", 123)).toDF("name", "num") > scala> DF.write.option("path", > "gs://test_bucket/table").mode("overwrite").partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1") > {code} > > The above code succeeds and creates external Hive table, but {*}there is no > SUCCESS file generated{*}. > Adding the content of the bucket after table creation > !image-2023-08-25-13-01-42-137.png|width=500,height=130! > The same code when running with spark 2.4.0 (with or without external path), > generates the SUCCESS file. > {code:java} > scala> > DF.write.mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1"){code} > !image-2023-08-20-18-46-53-342.png|width=465,height=166! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44884) Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode is dynamic
[ https://issues.apache.org/jira/browse/SPARK-44884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dipayan Dev updated SPARK-44884: Priority: Major (was: Minor) > Spark doesn't create SUCCESS file in Spark 3.3.0+ when partitionOverwriteMode > is dynamic > > > Key: SPARK-44884 > URL: https://issues.apache.org/jira/browse/SPARK-44884 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dipayan Dev >Priority: Major > Attachments: image-2023-08-20-18-46-53-342.png, > image-2023-08-25-13-01-42-137.png > > > The issue is not happening in Spark 2.x (I am using 2.4.0), but only in 3.3.0 > (tested with 3.4.1 as well) > Code to reproduce the issue > > {code:java} > scala> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") > scala> val DF = Seq(("test1", 123)).toDF("name", "num") > scala> DF.write.option("path", > "gs://test_bucket/table").mode("overwrite").partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1") > {code} > > The above code succeeds and creates external Hive table, but {*}there is no > SUCCESS file generated{*}. > Adding the content of the bucket after table creation > !image-2023-08-25-13-01-42-137.png|width=500,height=130! > The same code when running with spark 2.4.0 (with or without external path), > generates the SUCCESS file. > {code:java} > scala> > DF.write.mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("test_schema.test_tb1"){code} > !image-2023-08-20-18-46-53-342.png|width=465,height=166! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45005) Reducing the CI time for slow pyspark-pandas-connect tests
[ https://issues.apache.org/jira/browse/SPARK-45005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45005. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42719 [https://github.com/apache/spark/pull/42719] > Reducing the CI time for slow pyspark-pandas-connect tests > -- > > Key: SPARK-45005 > URL: https://issues.apache.org/jira/browse/SPARK-45005 > Project: Spark > Issue Type: Bug > Components: Connect, Pandas API on Spark, Tests >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 4.0.0 > > > pyspark-pandas-connect test takes more than 3 hours in Github Actions, so we > might need to reduce the execution time. See > https://github.com/apache/spark/actions/runs/5989124806/job/16245001034 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45005) Reducing the CI time for slow pyspark-pandas-connect tests
[ https://issues.apache.org/jira/browse/SPARK-45005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45005: - Assignee: Haejoon Lee > Reducing the CI time for slow pyspark-pandas-connect tests > -- > > Key: SPARK-45005 > URL: https://issues.apache.org/jira/browse/SPARK-45005 > Project: Spark > Issue Type: Bug > Components: Connect, Pandas API on Spark, Tests >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > pyspark-pandas-connect test takes more than 3 hours in Github Actions, so we > might need to reduce the execution time. See > https://github.com/apache/spark/actions/runs/5989124806/job/16245001034 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45021) Remove `antlr4-maven-plugin` configuration from `sql/catalyst/pom.xml`
[ https://issues.apache.org/jira/browse/SPARK-45021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45021. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 42739 [https://github.com/apache/spark/pull/42739] > Remove `antlr4-maven-plugin` configuration from `sql/catalyst/pom.xml` > -- > > Key: SPARK-45021 > URL: https://issues.apache.org/jira/browse/SPARK-45021 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.5.0 > > > SPARK-44475 has already moved the relevant configuration to > `sql/api/pom.xml`, the configuration in the catalyst module is unused now. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45021) Remove `antlr4-maven-plugin` configuration from `sql/catalyst/pom.xml`
[ https://issues.apache.org/jira/browse/SPARK-45021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45021: - Assignee: Yang Jie > Remove `antlr4-maven-plugin` configuration from `sql/catalyst/pom.xml` > -- > > Key: SPARK-45021 > URL: https://issues.apache.org/jira/browse/SPARK-45021 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > SPARK-44475 has already moved the relevant configuration to > `sql/api/pom.xml`, the configuration in the catalyst module is unused now. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45022) Provide context for dataset API errors
Peter Toth created SPARK-45022: -- Summary: Provide context for dataset API errors Key: SPARK-45022 URL: https://issues.apache.org/jira/browse/SPARK-45022 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Peter Toth -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44239) Free memory allocated by large vectors when vectors are reset
[ https://issues.apache.org/jira/browse/SPARK-44239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-44239. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 41782 [https://github.com/apache/spark/pull/41782] > Free memory allocated by large vectors when vectors are reset > - > > Key: SPARK-44239 > URL: https://issues.apache.org/jira/browse/SPARK-44239 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Wan Kun >Assignee: Wan Kun >Priority: Major > Fix For: 4.0.0 > > Attachments: image-2023-06-29-12-58-12-256.png, > image-2023-06-29-13-03-15-470.png > > > When spark reads a data file into a WritableColumnVector, the memory > allocated by the WritableColumnVectors is not freed until the > VectorizedColumnReader completes. > It will save memory allocation time by reusing the allocated array objects. > But it also takes up too many unused memory after the current large vector > batch has been read. > Add a memory reserve policy for this scenario which will reuse the allocated > array object for small column vectors and free the memory for huge column > vectors. > !image-2023-06-29-12-58-12-256.png!!image-2023-06-29-13-03-15-470.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44239) Free memory allocated by large vectors when vectors are reset
[ https://issues.apache.org/jira/browse/SPARK-44239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-44239: --- Assignee: Wan Kun > Free memory allocated by large vectors when vectors are reset > - > > Key: SPARK-44239 > URL: https://issues.apache.org/jira/browse/SPARK-44239 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Wan Kun >Assignee: Wan Kun >Priority: Major > Attachments: image-2023-06-29-12-58-12-256.png, > image-2023-06-29-13-03-15-470.png > > > When spark reads a data file into a WritableColumnVector, the memory > allocated by the WritableColumnVectors is not freed until the > VectorizedColumnReader completes. > It will save memory allocation time by reusing the allocated array objects. > But it also takes up too many unused memory after the current large vector > batch has been read. > Add a memory reserve policy for this scenario which will reuse the allocated > array object for small column vectors and free the memory for huge column > vectors. > !image-2023-06-29-12-58-12-256.png!!image-2023-06-29-13-03-15-470.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45021) Remove `antlr4-maven-plugin` configuration from `sql/catalyst/pom.xml`
[ https://issues.apache.org/jira/browse/SPARK-45021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760404#comment-17760404 ] Ignite TC Bot commented on SPARK-45021: --- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/42739 > Remove `antlr4-maven-plugin` configuration from `sql/catalyst/pom.xml` > -- > > Key: SPARK-45021 > URL: https://issues.apache.org/jira/browse/SPARK-45021 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Minor > > SPARK-44475 has already moved the relevant configuration to > `sql/api/pom.xml`, the configuration in the catalyst module is unused now. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45021) Remove `antlr4-maven-plugin` configuration from `sql/catalyst/pom.xml`
Yang Jie created SPARK-45021: Summary: Remove `antlr4-maven-plugin` configuration from `sql/catalyst/pom.xml` Key: SPARK-45021 URL: https://issues.apache.org/jira/browse/SPARK-45021 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: Yang Jie SPARK-44475 has already moved the relevant configuration to `sql/api/pom.xml`, the configuration in the catalyst module is unused now. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45019) Make workflow scala213 on container & clean env
[ https://issues.apache.org/jira/browse/SPARK-45019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760363#comment-17760363 ] Hudson commented on SPARK-45019: User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/42733 > Make workflow scala213 on container & clean env > --- > > Key: SPARK-45019 > URL: https://issues.apache.org/jira/browse/SPARK-45019 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45020) org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'default' not found (state=08S01,code=0)
Sruthi Mooriyathvariam created SPARK-45020: -- Summary: org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'default' not found (state=08S01,code=0) Key: SPARK-45020 URL: https://issues.apache.org/jira/browse/SPARK-45020 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.0 Reporter: Sruthi Mooriyathvariam Fix For: 3.1.0 There is an alert that fires up when a Spark 3.1 cluster is created using shared metastore with Spark 2.4. The alert says DefaultDatabase does not exist. This is misleading and thus we need to suppress this alert. In the class SessionCatalog.scala, the method requireDbExists() is not handling the case when the db = defaultDB. This needs to be added to suppress this misleading alert. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45019) Make workflow scala213 on container & clean env
BingKun Pan created SPARK-45019: --- Summary: Make workflow scala213 on container & clean env Key: SPARK-45019 URL: https://issues.apache.org/jira/browse/SPARK-45019 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45018) Add CalendarIntervalType to Python Client
Ruifeng Zheng created SPARK-45018: - Summary: Add CalendarIntervalType to Python Client Key: SPARK-45018 URL: https://issues.apache.org/jira/browse/SPARK-45018 Project: Spark Issue Type: New Feature Components: Connect, PySpark Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45017) Add CalendarIntervalType to PySpark
Ruifeng Zheng created SPARK-45017: - Summary: Add CalendarIntervalType to PySpark Key: SPARK-45017 URL: https://issues.apache.org/jira/browse/SPARK-45017 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45015) Refine DocStrings of `try_{add, subtract, multiply, divide, avg, sum}`
[ https://issues.apache.org/jira/browse/SPARK-45015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760302#comment-17760302 ] ASF GitHub Bot commented on SPARK-45015: User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/42735 > Refine DocStrings of `try_{add, subtract, multiply, divide, avg, sum}` > -- > > Key: SPARK-45015 > URL: https://issues.apache.org/jira/browse/SPARK-45015 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45015) Refine DocStrings of `try_{add, subtract, multiply, divide, avg, sum}`
[ https://issues.apache.org/jira/browse/SPARK-45015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760301#comment-17760301 ] ASF GitHub Bot commented on SPARK-45015: User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/42735 > Refine DocStrings of `try_{add, subtract, multiply, divide, avg, sum}` > -- > > Key: SPARK-45015 > URL: https://issues.apache.org/jira/browse/SPARK-45015 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45015) Refine DocStrings of `try_{add, subtract, multiply, divide, avg, sum}`
[ https://issues.apache.org/jira/browse/SPARK-45015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-45015: -- Summary: Refine DocStrings of `try_{add, subtract, multiply, divide, avg, sum}` (was: Refine DocString of `try_*` functions) > Refine DocStrings of `try_{add, subtract, multiply, divide, avg, sum}` > -- > > Key: SPARK-45015 > URL: https://issues.apache.org/jira/browse/SPARK-45015 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45016) Add missing `try_remote_functions` annotations
[ https://issues.apache.org/jira/browse/SPARK-45016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760298#comment-17760298 ] ASF GitHub Bot commented on SPARK-45016: User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/42734 > Add missing `try_remote_functions` annotations > -- > > Key: SPARK-45016 > URL: https://issues.apache.org/jira/browse/SPARK-45016 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45005) Reducing the CI time for slow pyspark-pandas-connect tests
[ https://issues.apache.org/jira/browse/SPARK-45005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760291#comment-17760291 ] ASF GitHub Bot commented on SPARK-45005: User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/42719 > Reducing the CI time for slow pyspark-pandas-connect tests > -- > > Key: SPARK-45005 > URL: https://issues.apache.org/jira/browse/SPARK-45005 > Project: Spark > Issue Type: Bug > Components: Connect, Pandas API on Spark, Tests >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > pyspark-pandas-connect test takes more than 3 hours in Github Actions, so we > might need to reduce the execution time. See > https://github.com/apache/spark/actions/runs/5989124806/job/16245001034 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45016) Add missing `try_remote_functions` annotations
Ruifeng Zheng created SPARK-45016: - Summary: Add missing `try_remote_functions` annotations Key: SPARK-45016 URL: https://issues.apache.org/jira/browse/SPARK-45016 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.5.0, 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33628) Use the Hive.getPartitionsByNames method instead of Hive.getPartitions in the HiveClientImpl
[ https://issues.apache.org/jira/browse/SPARK-33628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760281#comment-17760281 ] Maxim Martynov edited comment on SPARK-33628 at 8/30/23 8:53 AM: - Fixed in SPARK-42480, issue can be closed was (Author: JIRAUSER283764): Fixed in SPARK-42480 > Use the Hive.getPartitionsByNames method instead of Hive.getPartitions in the > HiveClientImpl > > > Key: SPARK-33628 > URL: https://issues.apache.org/jira/browse/SPARK-33628 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: jinhai >Priority: Major > Attachments: image-2020-12-02-16-57-43-619.png, > image-2020-12-03-14-38-19-221.png > > > When partitions are tracked by the catalog, that will compute all custom > partition locations, especially when dynamic partitions, and the field > staticPartitions is empty. > The poor performance of the method listPartitions results in a long period > of no response at the Driver. > When read 12253 partitions, the method getPartitionsByNames takes 2 seconds, > and the getPartitions takes 457 seconds, nearly 8 minutes > !image-2020-12-02-16-57-43-619.png|width=783,height=54! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-33628) Use the Hive.getPartitionsByNames method instead of Hive.getPartitions in the HiveClientImpl
[ https://issues.apache.org/jira/browse/SPARK-33628 ] Maxim Martynov deleted comment on SPARK-33628: was (Author: JIRAUSER283764): Can anyone review this pull request? > Use the Hive.getPartitionsByNames method instead of Hive.getPartitions in the > HiveClientImpl > > > Key: SPARK-33628 > URL: https://issues.apache.org/jira/browse/SPARK-33628 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: jinhai >Priority: Major > Attachments: image-2020-12-02-16-57-43-619.png, > image-2020-12-03-14-38-19-221.png > > > When partitions are tracked by the catalog, that will compute all custom > partition locations, especially when dynamic partitions, and the field > staticPartitions is empty. > The poor performance of the method listPartitions results in a long period > of no response at the Driver. > When read 12253 partitions, the method getPartitionsByNames takes 2 seconds, > and the getPartitions takes 457 seconds, nearly 8 minutes > !image-2020-12-02-16-57-43-619.png|width=783,height=54! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33628) Use the Hive.getPartitionsByNames method instead of Hive.getPartitions in the HiveClientImpl
[ https://issues.apache.org/jira/browse/SPARK-33628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760281#comment-17760281 ] Maxim Martynov commented on SPARK-33628: Fixed in SPARK-42480 > Use the Hive.getPartitionsByNames method instead of Hive.getPartitions in the > HiveClientImpl > > > Key: SPARK-33628 > URL: https://issues.apache.org/jira/browse/SPARK-33628 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: jinhai >Priority: Major > Attachments: image-2020-12-02-16-57-43-619.png, > image-2020-12-03-14-38-19-221.png > > > When partitions are tracked by the catalog, that will compute all custom > partition locations, especially when dynamic partitions, and the field > staticPartitions is empty. > The poor performance of the method listPartitions results in a long period > of no response at the Driver. > When read 12253 partitions, the method getPartitionsByNames takes 2 seconds, > and the getPartitions takes 457 seconds, nearly 8 minutes > !image-2020-12-02-16-57-43-619.png|width=783,height=54! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45015) Refine DocString of `try_*` functions
Ruifeng Zheng created SPARK-45015: - Summary: Refine DocString of `try_*` functions Key: SPARK-45015 URL: https://issues.apache.org/jira/browse/SPARK-45015 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org