[jira] [Created] (SPARK-40513) SPIP: Support Docker Official Image for Spark
Yikun Jiang created SPARK-40513: --- Summary: SPIP: Support Docker Official Image for Spark Key: SPARK-40513 URL: https://issues.apache.org/jira/browse/SPARK-40513 Project: Spark Issue Type: Bug Components: Kubernetes, PySpark, SparkR Affects Versions: 3.4.0 Reporter: Yikun Jiang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40512) Upgrade pandas to 1.5.0
[ https://issues.apache.org/jira/browse/SPARK-40512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607527#comment-17607527 ] Apache Spark commented on SPARK-40512: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/37955 > Upgrade pandas to 1.5.0 > --- > > Key: SPARK-40512 > URL: https://issues.apache.org/jira/browse/SPARK-40512 > Project: Spark > Issue Type: Improvement > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Since pandas 1.5.0 is released in Sep 19, 2022. > > We should update our infra and docs to support it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns
[ https://issues.apache.org/jira/browse/SPARK-34805 ] Joost Farla deleted comment on SPARK-34805: - was (Author: JIRAUSER295969): [~cloud_fan] I was running into the exact same issue using Spark v3.3.0. It looks like the fix was merged into the 3.3 branch (on March 21st), but was not yet released as part of v3.3. It is also not mentioned in the release notes. Is that possible? Thanks in advance! > PySpark loses metadata in DataFrame fields when selecting nested columns > > > Key: SPARK-34805 > URL: https://issues.apache.org/jira/browse/SPARK-34805 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.1, 3.1.1 >Reporter: Mark Ressler >Priority: Major > Fix For: 3.3.0 > > Attachments: jsonMetadataTest.py, nested_columns_metadata.scala > > > For a DataFrame schema with nested StructTypes, where metadata is set for > fields in the schema, that metadata is lost when a DataFrame selects nested > fields. For example, suppose > {code:java} > df.schema.fields[0].dataType.fields[0].metadata > {code} > returns a non-empty dictionary, then > {code:java} > df.select('Field0.SubField0').schema.fields[0].metadata{code} > returns an empty dictionary, where "Field0" is the name of the first field in > the DataFrame and "SubField0" is the name of the first nested field under > "Field0". > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40512) Upgrade pandas to 1.5.0
[ https://issues.apache.org/jira/browse/SPARK-40512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40512: Assignee: Apache Spark > Upgrade pandas to 1.5.0 > --- > > Key: SPARK-40512 > URL: https://issues.apache.org/jira/browse/SPARK-40512 > Project: Spark > Issue Type: Improvement > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > Since pandas 1.5.0 is released in Sep 19, 2022. > > We should update our infra and docs to support it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40512) Upgrade pandas to 1.5.0
[ https://issues.apache.org/jira/browse/SPARK-40512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40512: Assignee: (was: Apache Spark) > Upgrade pandas to 1.5.0 > --- > > Key: SPARK-40512 > URL: https://issues.apache.org/jira/browse/SPARK-40512 > Project: Spark > Issue Type: Improvement > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Since pandas 1.5.0 is released in Sep 19, 2022. > > We should update our infra and docs to support it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40502) Support dataframe API use jdbc data source in PySpark
[ https://issues.apache.org/jira/browse/SPARK-40502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607523#comment-17607523 ] CaoYu edited comment on SPARK-40502 at 9/21/22 6:07 AM: I am a teacher Recently designed Python language basic course, big data direction PySpark is one of the practical cases, but it is only a simple use of RDD code to complete the basic data processing work, and the use of JDBC data source is a part of the course Because the course is very basic, simple rdd code is suitable as an example. But if you use DataFrame, you need to explain more content, which is not friendly to novice students DataFrames(SparkSQL) will be used in future design advanced courses. So I hope that the extraction of jdbc data may be completed through the api of rdd was (Author: javacaoyu): I am a teacher Recently designed Python language basic course, big data direction PySpark is one of the practical cases, but it is only a simple use of RDD code to complete the basic data processing work, and the use of JDBC data source is a part of the course DataFrames(SparkSQL) will be used in future design advanced courses. So I hope the datastream API to have the capability of jdbc datasource. > Support dataframe API use jdbc data source in PySpark > - > > Key: SPARK-40502 > URL: https://issues.apache.org/jira/browse/SPARK-40502 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.3.0 >Reporter: CaoYu >Priority: Major > > When i using pyspark, i wanna get data from mysql database. so i want use > JDBCRDD like java\scala. > But that is not be supported in PySpark. > > For some reasons, i can't using DataFrame API, only can use RDD(datastream) > API. Even i know the DataFrame can get data from jdbc source fairly well. > > So i want to implement functionality that can use rdd to get data from jdbc > source for PySpark. > > *But i don't know if that are necessary for PySpark. so we can discuss it.* > > {*}If it is necessary for PySpark{*}{*}, i want to contribute to Spark.{*} > *i hope this Jira task can assigned to me, so i can start working to > implement it.* > > *if not, please close this Jira task.* > > > *thanks a lot.* > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40502) Support dataframe API use jdbc data source in PySpark
[ https://issues.apache.org/jira/browse/SPARK-40502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607524#comment-17607524 ] CaoYu commented on SPARK-40502: --- When I designed the Python Flink course It is found that PyFlink does not have the operators sum\min\minby\max\maxby So I submitted a PR to the flink community and provided the python implementation code of these operators (FLINK-26609 FLINK-26728) So, again, if jdbc datasource is what pyspark needs, I'd love and have the time to implement it > Support dataframe API use jdbc data source in PySpark > - > > Key: SPARK-40502 > URL: https://issues.apache.org/jira/browse/SPARK-40502 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.3.0 >Reporter: CaoYu >Priority: Major > > When i using pyspark, i wanna get data from mysql database. so i want use > JDBCRDD like java\scala. > But that is not be supported in PySpark. > > For some reasons, i can't using DataFrame API, only can use RDD(datastream) > API. Even i know the DataFrame can get data from jdbc source fairly well. > > So i want to implement functionality that can use rdd to get data from jdbc > source for PySpark. > > *But i don't know if that are necessary for PySpark. so we can discuss it.* > > {*}If it is necessary for PySpark{*}{*}, i want to contribute to Spark.{*} > *i hope this Jira task can assigned to me, so i can start working to > implement it.* > > *if not, please close this Jira task.* > > > *thanks a lot.* > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40512) Upgrade pandas to 1.5.0
Haejoon Lee created SPARK-40512: --- Summary: Upgrade pandas to 1.5.0 Key: SPARK-40512 URL: https://issues.apache.org/jira/browse/SPARK-40512 Project: Spark Issue Type: Improvement Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee Since pandas 1.5.0 is released in Sep 19, 2022. We should update our infra and docs to support it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40502) Support dataframe API use jdbc data source in PySpark
[ https://issues.apache.org/jira/browse/SPARK-40502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607523#comment-17607523 ] CaoYu commented on SPARK-40502: --- I am a teacher Recently designed Python language basic course, big data direction PySpark is one of the practical cases, but it is only a simple use of RDD code to complete the basic data processing work, and the use of JDBC data source is a part of the course DataFrames(SparkSQL) will be used in future design advanced courses. So I hope the datastream API to have the capability of jdbc datasource. > Support dataframe API use jdbc data source in PySpark > - > > Key: SPARK-40502 > URL: https://issues.apache.org/jira/browse/SPARK-40502 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.3.0 >Reporter: CaoYu >Priority: Major > > When i using pyspark, i wanna get data from mysql database. so i want use > JDBCRDD like java\scala. > But that is not be supported in PySpark. > > For some reasons, i can't using DataFrame API, only can use RDD(datastream) > API. Even i know the DataFrame can get data from jdbc source fairly well. > > So i want to implement functionality that can use rdd to get data from jdbc > source for PySpark. > > *But i don't know if that are necessary for PySpark. so we can discuss it.* > > {*}If it is necessary for PySpark{*}{*}, i want to contribute to Spark.{*} > *i hope this Jira task can assigned to me, so i can start working to > implement it.* > > *if not, please close this Jira task.* > > > *thanks a lot.* > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40511) Upgrade slf4j to 2.x
[ https://issues.apache.org/jira/browse/SPARK-40511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40511: Assignee: Apache Spark > Upgrade slf4j to 2.x > > > Key: SPARK-40511 > URL: https://issues.apache.org/jira/browse/SPARK-40511 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > > https://www.slf4j.org/news.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40511) Upgrade slf4j to 2.x
[ https://issues.apache.org/jira/browse/SPARK-40511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40511: Assignee: (was: Apache Spark) > Upgrade slf4j to 2.x > > > Key: SPARK-40511 > URL: https://issues.apache.org/jira/browse/SPARK-40511 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > https://www.slf4j.org/news.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40511) Upgrade slf4j to 2.x
[ https://issues.apache.org/jira/browse/SPARK-40511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607522#comment-17607522 ] Apache Spark commented on SPARK-40511: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/37844 > Upgrade slf4j to 2.x > > > Key: SPARK-40511 > URL: https://issues.apache.org/jira/browse/SPARK-40511 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > https://www.slf4j.org/news.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40511) Upgrade slf4j to 2.x
Yang Jie created SPARK-40511: Summary: Upgrade slf4j to 2.x Key: SPARK-40511 URL: https://issues.apache.org/jira/browse/SPARK-40511 Project: Spark Issue Type: Improvement Components: Build, Spark Core Affects Versions: 3.4.0 Reporter: Yang Jie https://www.slf4j.org/news.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40496) Configs to control "enableDateTimeParsingFallback" are incorrectly swapped
[ https://issues.apache.org/jira/browse/SPARK-40496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-40496: --- Assignee: Ivan Sadikov > Configs to control "enableDateTimeParsingFallback" are incorrectly swapped > -- > > Key: SPARK-40496 > URL: https://issues.apache.org/jira/browse/SPARK-40496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Assignee: Ivan Sadikov >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40496) Configs to control "enableDateTimeParsingFallback" are incorrectly swapped
[ https://issues.apache.org/jira/browse/SPARK-40496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-40496. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37942 [https://github.com/apache/spark/pull/37942] > Configs to control "enableDateTimeParsingFallback" are incorrectly swapped > -- > > Key: SPARK-40496 > URL: https://issues.apache.org/jira/browse/SPARK-40496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Assignee: Ivan Sadikov >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40506) Spark Streaming metrics name don't need application name
[ https://issues.apache.org/jira/browse/SPARK-40506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 王俊博 updated SPARK-40506: Description: Spark StreamingSource Metrics sourceName is inappropriate.The label now looks like `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime `, instead of `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`, the Spark app name is not need. This makes it hard to use metrics for different Spark applications over time. And this makes the metrics sourceName standard inconsistent. {code:java} //代码占位符 private[streaming] class StreamingSource(ssc: StreamingContext) extends Source { override val metricRegistry = new MetricRegistry override val sourceName = "%s.StreamingMetrics".format(ssc.sparkContext.appName) }{code} And for example, other metrics sourceName don't have appName. {code:java} //代码占位符 private[spark] class LiveListenerBusMetrics(conf: SparkConf) extends Source with Logging { override val sourceName: String = "LiveListenerBus" override val metricRegistry: MetricRegistry = new MetricRegistry ... } {code} was: Spark StreamingSource Metrics sourceName is inappropriate.The label now looks like `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime `, instead of `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`, the Spark app name is not need. This makes it hard to use metrics for different Spark applications over time. And this makes the metrics sourceName standard inconsistent {code:java} //代码占位符 private[streaming] class StreamingSource(ssc: StreamingContext) extends Source { override val metricRegistry = new MetricRegistry override val sourceName = "%s.StreamingMetrics".format(ssc.sparkContext.appName) }{code} And for example, other metrics sourceName don't have appName. {code:java} //代码占位符 private[spark] class LiveListenerBusMetrics(conf: SparkConf) extends Source with Logging { override val sourceName: String = "LiveListenerBus" override val metricRegistry: MetricRegistry = new MetricRegistry ... } {code} > Spark Streaming metrics name don't need application name > > > Key: SPARK-40506 > URL: https://issues.apache.org/jira/browse/SPARK-40506 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 3.2.2 >Reporter: 王俊博 >Priority: Trivial > > Spark StreamingSource Metrics sourceName is inappropriate.The label now > looks like > `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime > `, instead of > `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`, > the Spark app name is not need. > This makes it hard to use metrics for different Spark applications over time. > And this makes the metrics sourceName standard inconsistent. > {code:java} > //代码占位符 > private[streaming] class StreamingSource(ssc: StreamingContext) extends > Source { > override val metricRegistry = new MetricRegistry > override val sourceName = > "%s.StreamingMetrics".format(ssc.sparkContext.appName) > > }{code} > And for example, other metrics sourceName don't have appName. > {code:java} > //代码占位符 > private[spark] class LiveListenerBusMetrics(conf: SparkConf) > extends Source with Logging { > override val sourceName: String = "LiveListenerBus" > override val metricRegistry: MetricRegistry = new MetricRegistry > ... > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40332) Implement `GroupBy.quantile`.
[ https://issues.apache.org/jira/browse/SPARK-40332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607489#comment-17607489 ] Apache Spark commented on SPARK-40332: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37954 > Implement `GroupBy.quantile`. > - > > Key: SPARK-40332 > URL: https://issues.apache.org/jira/browse/SPARK-40332 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > We should implement `GroupBy.quantile` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40332) Implement `GroupBy.quantile`.
[ https://issues.apache.org/jira/browse/SPARK-40332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607488#comment-17607488 ] Apache Spark commented on SPARK-40332: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37954 > Implement `GroupBy.quantile`. > - > > Key: SPARK-40332 > URL: https://issues.apache.org/jira/browse/SPARK-40332 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > We should implement `GroupBy.quantile` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40501) Enhance 'SpecialLimits' to support project(..., limit(...))
[ https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-40501: Summary: Enhance 'SpecialLimits' to support project(..., limit(...)) (was: Add PushProjectionThroughLimit for Optimizer) > Enhance 'SpecialLimits' to support project(..., limit(...)) > --- > > Key: SPARK-40501 > URL: https://issues.apache.org/jira/browse/SPARK-40501 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > > h4. It took a long time to fetch out, still running after 20 minutes... > when run as follow code in spark-shell: > spark.sql("select * from xxx where event_day = '20220919' limit 1").show() > [!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png] > [!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40510) Implement `ddof` in `Series.cov`
[ https://issues.apache.org/jira/browse/SPARK-40510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607483#comment-17607483 ] Apache Spark commented on SPARK-40510: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37953 > Implement `ddof` in `Series.cov` > > > Key: SPARK-40510 > URL: https://issues.apache.org/jira/browse/SPARK-40510 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40510) Implement `ddof` in `Series.cov`
[ https://issues.apache.org/jira/browse/SPARK-40510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40510: Assignee: Apache Spark > Implement `ddof` in `Series.cov` > > > Key: SPARK-40510 > URL: https://issues.apache.org/jira/browse/SPARK-40510 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40510) Implement `ddof` in `Series.cov`
[ https://issues.apache.org/jira/browse/SPARK-40510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607481#comment-17607481 ] Apache Spark commented on SPARK-40510: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37953 > Implement `ddof` in `Series.cov` > > > Key: SPARK-40510 > URL: https://issues.apache.org/jira/browse/SPARK-40510 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40510) Implement `ddof` in `Series.cov`
[ https://issues.apache.org/jira/browse/SPARK-40510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40510: Assignee: (was: Apache Spark) > Implement `ddof` in `Series.cov` > > > Key: SPARK-40510 > URL: https://issues.apache.org/jira/browse/SPARK-40510 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40510) Implement `ddof` in `Series.cov`
Ruifeng Zheng created SPARK-40510: - Summary: Implement `ddof` in `Series.cov` Key: SPARK-40510 URL: https://issues.apache.org/jira/browse/SPARK-40510 Project: Spark Issue Type: Sub-task Components: ps Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40491) Remove too old TODO for JdbcRDD
[ https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40491. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37937 [https://github.com/apache/spark/pull/37937] > Remove too old TODO for JdbcRDD > --- > > Key: SPARK-40491 > URL: https://issues.apache.org/jira/browse/SPARK-40491 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Trivial > Fix For: 3.4.0 > > > According to the legacy document of JdbcRDD, we need to expose a jdbcRDD > function in SparkContext. > In fact, this is a very old TODO and we need to revisit if this is still > necessary. Since Spark SQL is the new core, I'm not sure if anyone is > interested in a new API to create jdbc RDD. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40491) Remove too old TODO for JdbcRDD
[ https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40491: Assignee: jiaan.geng > Remove too old TODO for JdbcRDD > --- > > Key: SPARK-40491 > URL: https://issues.apache.org/jira/browse/SPARK-40491 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Trivial > > According to the legacy document of JdbcRDD, we need to expose a jdbcRDD > function in SparkContext. > In fact, this is a very old TODO and we need to revisit if this is still > necessary. Since Spark SQL is the new core, I'm not sure if anyone is > interested in a new API to create jdbc RDD. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40500) Use `pd.items` instead of `pd.iteritems`
[ https://issues.apache.org/jira/browse/SPARK-40500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40500. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37947 [https://github.com/apache/spark/pull/37947] > Use `pd.items` instead of `pd.iteritems` > > > Key: SPARK-40500 > URL: https://issues.apache.org/jira/browse/SPARK-40500 > Project: Spark > Issue Type: Improvement > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40500) Use `pd.items` instead of `pd.iteritems`
[ https://issues.apache.org/jira/browse/SPARK-40500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40500: Assignee: Ruifeng Zheng > Use `pd.items` instead of `pd.iteritems` > > > Key: SPARK-40500 > URL: https://issues.apache.org/jira/browse/SPARK-40500 > Project: Spark > Issue Type: Improvement > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40499: - Priority: Major (was: Blocker) > Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0 > > > Key: SPARK-40499 > URL: https://issues.apache.org/jira/browse/SPARK-40499 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 > Environment: hadoop: 3.0.0 > spark: 2.4.0 / 3.2.1 > shuffle:spark 2.4.0 >Reporter: xuanzhiang >Priority: Major > Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png > > > spark.sql( > s""" > |SELECT > | Info , > | PERCENTILE_APPROX(cost,0.5) cost_p50, > | PERCENTILE_APPROX(cost,0.9) cost_p90, > | PERCENTILE_APPROX(cost,0.95) cost_p95, > | PERCENTILE_APPROX(cost,0.99) cost_p99, > | PERCENTILE_APPROX(cost,0.999) cost_p999 > |FROM > | textData > |""".stripMargin) > * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 > pull shuffle data very quick . but , when we use spark 3.2.1 and use old > shuffle , 140M shuffle data cost 3 hours. > * If we upgrade the Shuffle, will we get performance regression? > * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40502) Support dataframe API use jdbc data source in PySpark
[ https://issues.apache.org/jira/browse/SPARK-40502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607447#comment-17607447 ] Hyukjin Kwon commented on SPARK-40502: -- {quote} For some reasons, i can't using DataFrame API, only can use RDD(datastream) API. {quote} What's the reason? > Support dataframe API use jdbc data source in PySpark > - > > Key: SPARK-40502 > URL: https://issues.apache.org/jira/browse/SPARK-40502 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.3.0 >Reporter: CaoYu >Priority: Major > > When i using pyspark, i wanna get data from mysql database. so i want use > JDBCRDD like java\scala. > But that is not be supported in PySpark. > > For some reasons, i can't using DataFrame API, only can use RDD(datastream) > API. Even i know the DataFrame can get data from jdbc source fairly well. > > So i want to implement functionality that can use rdd to get data from jdbc > source for PySpark. > > *But i don't know if that are necessary for PySpark. so we can discuss it.* > > {*}If it is necessary for PySpark{*}{*}, i want to contribute to Spark.{*} > *i hope this Jira task can assigned to me, so i can start working to > implement it.* > > *if not, please close this Jira task.* > > > *thanks a lot.* > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40509) Construct an example of applyInPandasWithState in examples directory
Jungtaek Lim created SPARK-40509: Summary: Construct an example of applyInPandasWithState in examples directory Key: SPARK-40509 URL: https://issues.apache.org/jira/browse/SPARK-40509 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 3.4.0 Reporter: Jungtaek Lim Since we introduce a new API (applyInPandasWithState) in PySpark, it worths to have a separate full example of the API. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40508) Treat unknown partitioning as UnknownPartitioning
[ https://issues.apache.org/jira/browse/SPARK-40508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607384#comment-17607384 ] Apache Spark commented on SPARK-40508: -- User 'tedyu' has created a pull request for this issue: https://github.com/apache/spark/pull/37952 > Treat unknown partitioning as UnknownPartitioning > - > > Key: SPARK-40508 > URL: https://issues.apache.org/jira/browse/SPARK-40508 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Ted Yu >Priority: Major > > When running spark application against spark 3.3, I see the following : > {code} > java.lang.IllegalArgumentException: Unsupported data source V2 partitioning > type: CustomPartitioning > at > org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46) > at > org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) > {code} > The CustomPartitioning works fine with Spark 3.2.1 > This PR proposes to relax the code and treat all unknown partitioning the > same way as that for UnknownPartitioning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40508) Treat unknown partitioning as UnknownPartitioning
[ https://issues.apache.org/jira/browse/SPARK-40508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40508: Assignee: (was: Apache Spark) > Treat unknown partitioning as UnknownPartitioning > - > > Key: SPARK-40508 > URL: https://issues.apache.org/jira/browse/SPARK-40508 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Ted Yu >Priority: Major > > When running spark application against spark 3.3, I see the following : > {code} > java.lang.IllegalArgumentException: Unsupported data source V2 partitioning > type: CustomPartitioning > at > org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46) > at > org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) > {code} > The CustomPartitioning works fine with Spark 3.2.1 > This PR proposes to relax the code and treat all unknown partitioning the > same way as that for UnknownPartitioning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40508) Treat unknown partitioning as UnknownPartitioning
[ https://issues.apache.org/jira/browse/SPARK-40508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40508: Assignee: Apache Spark > Treat unknown partitioning as UnknownPartitioning > - > > Key: SPARK-40508 > URL: https://issues.apache.org/jira/browse/SPARK-40508 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Ted Yu >Assignee: Apache Spark >Priority: Major > > When running spark application against spark 3.3, I see the following : > {code} > java.lang.IllegalArgumentException: Unsupported data source V2 partitioning > type: CustomPartitioning > at > org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46) > at > org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) > {code} > The CustomPartitioning works fine with Spark 3.2.1 > This PR proposes to relax the code and treat all unknown partitioning the > same way as that for UnknownPartitioning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40508) Treat unknown partitioning as UnknownPartitioning
[ https://issues.apache.org/jira/browse/SPARK-40508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated SPARK-40508: --- Description: When running spark application against spark 3.3, I see the following : {code} java.lang.IllegalArgumentException: Unsupported data source V2 partitioning type: CustomPartitioning at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46) at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) {code} The CustomPartitioning works fine with Spark 3.2.1 This PR proposes to relax the code and treat all unknown partitioning the same way as that for UnknownPartitioning. was: When running spark application against spark 3.3, I see the following : ``` java.lang.IllegalArgumentException: Unsupported data source V2 partitioning type: CustomPartitioning at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46) at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) ``` The CustomPartitioning works fine with Spark 3.2.1 This PR proposes to relax the code and treat all unknown partitioning the same way as that for UnknownPartitioning. > Treat unknown partitioning as UnknownPartitioning > - > > Key: SPARK-40508 > URL: https://issues.apache.org/jira/browse/SPARK-40508 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Ted Yu >Priority: Major > > When running spark application against spark 3.3, I see the following : > {code} > java.lang.IllegalArgumentException: Unsupported data source V2 partitioning > type: CustomPartitioning > at > org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46) > at > org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) > {code} > The CustomPartitioning works fine with Spark 3.2.1 > This PR proposes to relax the code and treat all unknown partitioning the > same way as that for UnknownPartitioning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40508) Treat unknown partitioning as UnknownPartitioning
Ted Yu created SPARK-40508: -- Summary: Treat unknown partitioning as UnknownPartitioning Key: SPARK-40508 URL: https://issues.apache.org/jira/browse/SPARK-40508 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.0 Reporter: Ted Yu When running spark application against spark 3.3, I see the following : ``` java.lang.IllegalArgumentException: Unsupported data source V2 partitioning type: CustomPartitioning at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46) at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) ``` The CustomPartitioning works fine with Spark 3.2.1 This PR proposes to relax the code and treat all unknown partitioning the same way as that for UnknownPartitioning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40477) Support `NullType` in `ColumnarBatchRow`
[ https://issues.apache.org/jira/browse/SPARK-40477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuyuki Tanimura resolved SPARK-40477. --- Resolution: Won't Fix gave another thought and decided to close this one not to be fixed. There is no natural code path of calling ColumnarBatchRow.get() for NullType columns, especially NullType cannot be stored as partition in columnar format like Parquet. > Support `NullType` in `ColumnarBatchRow` > > > Key: SPARK-40477 > URL: https://issues.apache.org/jira/browse/SPARK-40477 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Priority: Minor > > `ColumnarBatchRow.get()` does not support `NullType` currently. Support > `NullType` in `ColumnarBatchRow` so that `NullType` can be partition column > type. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40416) Add error classes for subquery expression CheckAnalysis failures
[ https://issues.apache.org/jira/browse/SPARK-40416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-40416. Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37840 [https://github.com/apache/spark/pull/37840] > Add error classes for subquery expression CheckAnalysis failures > > > Key: SPARK-40416 > URL: https://issues.apache.org/jira/browse/SPARK-40416 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40416) Add error classes for subquery expression CheckAnalysis failures
[ https://issues.apache.org/jira/browse/SPARK-40416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-40416: -- Assignee: Daniel > Add error classes for subquery expression CheckAnalysis failures > > > Key: SPARK-40416 > URL: https://issues.apache.org/jira/browse/SPARK-40416 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40507) Spark creates an optional columns in hive table for fields that are not null
[ https://issues.apache.org/jira/browse/SPARK-40507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anil Dasari updated SPARK-40507: Description: Dataframe saveAsTable sets all columns as optional/nullable while creating the table here [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L531] (`outputColumns.toStructType.asNullable`) This makes source parquet schema and hive table schema doesn't match and is problematic when large dataframe(s) process uses hive as temporary storage to avoid the memory pressure. Hive 3.x supports non null constraints on table columns. Please add support for non null constraints on Spark sql hive table. was: Dataframe saveAsTable sets all columns as optional/nullable while creating the table here [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L531] (`outputColumns.toStructType.asNullable`) This makes source parquet schema and hive table schema doesn't match and is problematic when large dataframe(s) process uses hive as temporary storage to avoid the memory pressure. Hive 3.x supports non null constraints on table columns. Please add support non null constraints on Spark sql hive table. > Spark creates an optional columns in hive table for fields that are not null > > > Key: SPARK-40507 > URL: https://issues.apache.org/jira/browse/SPARK-40507 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Anil Dasari >Priority: Major > > Dataframe saveAsTable sets all columns as optional/nullable while creating > the table here > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L531] > (`outputColumns.toStructType.asNullable`) > This makes source parquet schema and hive table schema doesn't match and is > problematic when large dataframe(s) process uses hive as temporary storage to > avoid the memory pressure. > Hive 3.x supports non null constraints on table columns. Please add support > for non null constraints on Spark sql hive table. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40507) Spark creates an optional columns in hive table for fields that are not null
Anil Dasari created SPARK-40507: --- Summary: Spark creates an optional columns in hive table for fields that are not null Key: SPARK-40507 URL: https://issues.apache.org/jira/browse/SPARK-40507 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Anil Dasari Dataframe saveAsTable sets all columns as optional/nullable while creating the table here [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L531] (`outputColumns.toStructType.asNullable`) This makes source parquet schema and hive table schema doesn't match and is problematic when large dataframe(s) process uses hive as temporary storage to avoid the memory pressure. Hive 3.x supports non null constraints on table columns. Please add support non null constraints on Spark sql hive table. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607314#comment-17607314 ] xsys edited comment on SPARK-40439 at 9/20/22 5:23 PM: --- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought nullOnOverflow is controlled by {{spark.sql.ansi.enabled. }}I tried to achieve the desired behavior by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? was (Author: JIRAUSER288838): [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought that nullOnOverflow is controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > precision: In
[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607314#comment-17607314 ] xsys edited comment on SPARK-40439 at 9/20/22 5:23 PM: --- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought nullOnOverflow is controlled by \{{spark.sql.ansi.enabled.}} I tried to achieve the desired behavior by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? was (Author: JIRAUSER288838): [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought nullOnOverflow is controlled by {{spark.sql.ansi.enabled. }}I tried to achieve the desired behavior by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > precision: Int, >
[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607314#comment-17607314 ] xsys edited comment on SPARK-40439 at 9/20/22 5:22 PM: --- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought that nullOnOverflow is controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? was (Author: JIRAUSER288838): [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy }}would work. For instance, after inspecting the code, I thought that nullOnOverflow is controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > precisi
[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607314#comment-17607314 ] xsys edited comment on SPARK-40439 at 9/20/22 5:22 PM: --- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy }}would work. For instance, after inspecting the code, I thought that nullOnOverflow is controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? was (Author: JIRAUSER288838): [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy would work.}} For instance, after inspecting the code, I thought that nullOnOverflow is controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > precisi
[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607314#comment-17607314 ] xsys edited comment on SPARK-40439 at 9/20/22 5:21 PM: --- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy would work.}} {{ For instance, after inspecting the code, I thought that nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? was (Author: JIRAUSER288838): [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy would work. For instance, after inspecting the code, I thought that nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > precision:
[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607314#comment-17607314 ] xsys edited comment on SPARK-40439 at 9/20/22 5:21 PM: --- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy would work.}} For instance, after inspecting the code, I thought that nullOnOverflow is controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? was (Author: JIRAUSER288838): [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. However, I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy would work.}} {{ For instance, after inspecting the code, I thought that nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > pr
[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607314#comment-17607314 ] xsys edited comment on SPARK-40439 at 9/20/22 5:20 PM: --- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to LEGACY works. I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy would work. For instance, after inspecting the code, I thought that nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled. I}} tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? was (Author: JIRAUSER288838): [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. }} I believe it could get non trivial for users to discover that {{spark.sql.storeAssignmentPolicy }}would work. For instance, after inspecting the code, I thought that {{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > precision: I
[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607314#comment-17607314 ] xsys edited comment on SPARK-40439 at 9/20/22 5:20 PM: --- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. }} I believe it could get non trivial for users to discover that {{spark.sql.storeAssignmentPolicy }}would work. For instance, after inspecting the code, I thought that {{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? was (Author: JIRAUSER288838): [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. }}I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought that {{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > pre
[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607314#comment-17607314 ] xsys edited comment on SPARK-40439 at 9/20/22 5:18 PM: --- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. }}I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought that {{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? was (Author: JIRAUSER288838): [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. I believe it could get non-trivial for users to discover that spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought that {{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > precision
[jira] [Commented] (SPARK-31404) file source backward compatibility after calendar switch
[ https://issues.apache.org/jira/browse/SPARK-31404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607326#comment-17607326 ] Sachit commented on SPARK-31404: Hi [~cloud_fan] , Could you please confirm if we need to use below properties to ensure it can read data written by spark2.4x spark.conf.set("spark.sql.parquet.int96RebaseModeInRead","CORRECTED") spark.conf.set("spark.sql.parquet.int96RebaseModeInWrite","CORRECTED") Regards Sachit > file source backward compatibility after calendar switch > > > Key: SPARK-31404 > URL: https://issues.apache.org/jira/browse/SPARK-31404 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Blocker > Fix For: 3.0.0 > > Attachments: Switch to Java 8 time API in Spark 3.0.pdf > > > In Spark 3.0, we switch to the Proleptic Gregorian calendar by using the Java > 8 datetime APIs. This makes Spark follow the ISO and SQL standard, but > introduces some backward compatibility problems: > 1. may read wrong data from the data files written by Spark 2.4 > 2. may have perf regression -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607314#comment-17607314 ] xsys edited comment on SPARK-40439 at 9/20/22 5:17 PM: --- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to {{LEGACY works. I believe it could get non-trivial for users to discover that spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought that {{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? was (Author: JIRAUSER288838): [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to {{LEGACY }}works. I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought that {{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > precision
[jira] [Commented] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607314#comment-17607314 ] xsys commented on SPARK-40439: -- [~hyukjin.kwon]: Thank you for your response! Setting {{spark.sql.storeAssignmentPolicy}} to {{LEGACY }}works. I believe it could get non-trivial for users to discover that {{spark.sql.storeAssignmentPolicy}} would work. For instance, after inspecting the code, I thought that {{nullOnOverflow}} is controlled by {{spark.sql.ansi.enabled and}} I tried to achieve the desired behaviour by altering it (but to no avail). Could we add the usage of {{spark.sql.storeAssignmentPolicy}} to {{LEGACY}} to the error message? > DECIMAL value with more precision than what is defined in the schema raises > exception in SparkSQL but evaluates to NULL for DataFrame > - > > Key: SPARK-40439 > URL: https://issues.apache.org/jira/browse/SPARK-40439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to store a DECIMAL value {{333.22}} with more > precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This > leads to a {{NULL}} value being stored if the table is created using > DataFrames via {{{}spark-shell{}}}. However, it leads to the following > exception if the table is created via {{{}spark-sql{}}}: > {code:java} > Failed in [insert into decimal_extra_precision select 333.22] > java.lang.ArithmeticException: > Decimal(expanded,333.22,21,10}) cannot be represented as > Decimal(20, 10){code} > h3. Step to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > Execute the following: > {code:java} > create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; > insert into decimal_extra_precision select 333.22;{code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). > Here is a simplified example in {{{}spark-shell{}}}, where insertion of the > aforementioned decimal value evaluates to a {{{}NULL{}}}: > {code:java} > scala> import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.{Row, SparkSession} > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> val rdd = > sc.parallelize(Seq(Row(BigDecimal("333.22" > rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > ParallelCollectionRDD[0] at parallelize at :27 > scala> val schema = new StructType().add(StructField("c1", DecimalType(20, > 10), true)) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(c1,DecimalType(20,10),true)) > scala> val df = spark.createDataFrame(rdd, schema) > df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > scala> df.show() > ++ > | c1| > ++ > |null| > ++ > scala> > df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") > 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > scala> spark.sql("select * from decimal_extra_precision;") > res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] > {code} > h3. Root Cause > The exception is being raised from > [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] > ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in > [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): > {code:java} > private[sql] def toPrecision( > precision: Int, > scale: Int, > roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP, > nullOnOverflow: Boolean = true, > context: SQLQueryContext = null): Decimal = { > val copy = clone() > if (copy.changePrecision(precision, scale, roundMode)) { > copy > } else { > if (nullOnOverflow) { > null > } else { > throw QueryExecutionErrors.cannotChangeDecimalPrecisionError( > this, precision, scale, context) > } > } > }{code} > The above function is invoked from > [toPrecision|https://github.com/apache/spark/blob/
[jira] [Assigned] (SPARK-39494) Support `createDataFrame` from a list of scalars when schema is not provided
[ https://issues.apache.org/jira/browse/SPARK-39494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39494: Assignee: Apache Spark > Support `createDataFrame` from a list of scalars when schema is not provided > > > Key: SPARK-39494 > URL: https://issues.apache.org/jira/browse/SPARK-39494 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > Currently, DataFrame creation from a list of native Python scalars is > unsupported in PySpark, for example, > {{>>> spark.createDataFrame([1, 2]).collect()}} > {{Traceback (most recent call last):}} > {{...}} > {{TypeError: Can not infer schema for type: }} > {{However, Spark DataFrame Scala API supports that:}} > {{scala> Seq(1, 2).toDF().collect()}} > {{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}} > To maintain API consistency, we propose to support DataFrame creation from a > list of scalars. > See more > [here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39494) Support `createDataFrame` from a list of scalars when schema is not provided
[ https://issues.apache.org/jira/browse/SPARK-39494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39494: Assignee: (was: Apache Spark) > Support `createDataFrame` from a list of scalars when schema is not provided > > > Key: SPARK-39494 > URL: https://issues.apache.org/jira/browse/SPARK-39494 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Currently, DataFrame creation from a list of native Python scalars is > unsupported in PySpark, for example, > {{>>> spark.createDataFrame([1, 2]).collect()}} > {{Traceback (most recent call last):}} > {{...}} > {{TypeError: Can not infer schema for type: }} > {{However, Spark DataFrame Scala API supports that:}} > {{scala> Seq(1, 2).toDF().collect()}} > {{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}} > To maintain API consistency, we propose to support DataFrame creation from a > list of scalars. > See more > [here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40357) Migrate window type check failures onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607277#comment-17607277 ] Max Gekk commented on SPARK-40357: -- [~lvshaokang] Sure, go ahead. > Migrate window type check failures onto error classes > - > > Key: SPARK-40357 > URL: https://issues.apache.org/jira/browse/SPARK-40357 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in window > expressions: > 1. WindowSpecDefinition (4): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L68-L85 > 2. SpecifiedWindowFrame (3): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L216-L231 > 3. checkBoundary (2): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L264-L269 > 4. FrameLessOffsetWindowFunction (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L424 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40357) Migrate window type check failures onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607275#comment-17607275 ] Shaokang Lv commented on SPARK-40357: - Hi, [~maxgekk] , I would like to do some work and pick up this ** if possible. > Migrate window type check failures onto error classes > - > > Key: SPARK-40357 > URL: https://issues.apache.org/jira/browse/SPARK-40357 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in window > expressions: > 1. WindowSpecDefinition (4): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L68-L85 > 2. SpecifiedWindowFrame (3): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L216-L231 > 3. checkBoundary (2): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L264-L269 > 4. FrameLessOffsetWindowFunction (1): > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L424 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns
[ https://issues.apache.org/jira/browse/SPARK-34805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607270#comment-17607270 ] Joost Farla commented on SPARK-34805: - [~cloud_fan] I was running into the exact same issue using Spark v3.3.0. It looks like the fix was merged into the 3.3 branch (on March 21st), but was not yet released as part of v3.3. It is also not mentioned in the release notes. Is that possible? Thanks in advance! > PySpark loses metadata in DataFrame fields when selecting nested columns > > > Key: SPARK-34805 > URL: https://issues.apache.org/jira/browse/SPARK-34805 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.1, 3.1.1 >Reporter: Mark Ressler >Priority: Major > Fix For: 3.3.0 > > Attachments: jsonMetadataTest.py, nested_columns_metadata.scala > > > For a DataFrame schema with nested StructTypes, where metadata is set for > fields in the schema, that metadata is lost when a DataFrame selects nested > fields. For example, suppose > {code:java} > df.schema.fields[0].dataType.fields[0].metadata > {code} > returns a non-empty dictionary, then > {code:java} > df.select('Field0.SubField0').schema.fields[0].metadata{code} > returns an empty dictionary, where "Field0" is the name of the first field in > the DataFrame and "SubField0" is the name of the first nested field under > "Field0". > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40479) Migrate unexpected input type error to an error class
[ https://issues.apache.org/jira/browse/SPARK-40479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40479. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37921 [https://github.com/apache/spark/pull/37921] > Migrate unexpected input type error to an error class > - > > Key: SPARK-40479 > URL: https://issues.apache.org/jira/browse/SPARK-40479 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > Migrate the function ExpectsInputTypes.checkInputDataTypes onto > DataTypeMismatch and introduce new error class. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40479) Migrate unexpected input type error to an error class
[ https://issues.apache.org/jira/browse/SPARK-40479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-40479: Assignee: Max Gekk > Migrate unexpected input type error to an error class > - > > Key: SPARK-40479 > URL: https://issues.apache.org/jira/browse/SPARK-40479 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Migrate the function ExpectsInputTypes.checkInputDataTypes onto > DataTypeMismatch and introduce new error class. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40491) Remove too old TODO for JdbcRDD
[ https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-40491: - Issue Type: Task (was: New Feature) Priority: Trivial (was: Major) This didn't need a JIRA - it was not Major. Please set the fields appropriately > Remove too old TODO for JdbcRDD > --- > > Key: SPARK-40491 > URL: https://issues.apache.org/jira/browse/SPARK-40491 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Trivial > > According to the legacy document of JdbcRDD, we need to expose a jdbcRDD > function in SparkContext. > In fact, this is a very old TODO and we need to revisit if this is still > necessary. Since Spark SQL is the new core, I'm not sure if anyone is > interested in a new API to create jdbc RDD. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.
[ https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607188#comment-17607188 ] Garret Wilson edited comment on SPARK-40489 at 9/20/22 1:19 PM: {quote}It sounds like the new major version upgrade is done in a month ago and we don't quite know about stability.{quote} You're missing the point. This ticket is not a request for Spark to move to SLF4J 2.x. The ticket is a bug report for Spark to stop access {{StaticLoggerBinder}}, which isn't part of the SLF4J API in the first place, so it won't break for people who do move to SLF4J 2.x Spark can keep using SLF4J 1.x as long as it wants. {quote}The comment about log4j1 is moot as recent version of Spark uses log4j2.{quote} This too is missing the point. You shouldn't be using Log4J2 directly. You should be coding to the SLF4J public API. Directly accessing one particular SLF4J implementation is just making it cumbersome for everybody else because you're not playing by the rules. was (Author: garretwilson): {quote}It sounds like the new major version upgrade is done in a month ago and we don't quite know about stability.{quote} You're missing the point. This ticket is not a request for Spark to move to SLF4J 2.x. The ticket is a bug report for Spark to stop access {{StaticLoggerBinder}}, which isn't part of the SLF4J API in the first place, so it won't break for people who do move to SLF4J 2.x Spark can keep using SLF4J 1.x as long as it wants. {quote}The comment about log4j1 is moot as recent version of Spark uses log4j2.{quote} This too is missing the point. You shouldn't be using Log4J2 directly. You should be coding to the SLF4J public API. Directly accessing one particular SLF4J is just making it cumbersome for everybody else because you're not playing by the rules. > Spark 3.3.0 breaks with SFL4J 2. > > > Key: SPARK-40489 > URL: https://issues.apache.org/jira/browse/SPARK-40489 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Garret Wilson >Priority: Major > > Spark breaks fundamentally with SLF4J 2.x because it uses > {{StaticLoggerBinder}}. > SLF4J is the logging facade that is meant to shield the application from the > implementation, whether it be Log4J or Logback or whatever. Historically > SLF4J 1.x used a bad approach to configuration: it used a > {{StaticLoggerBinder}} (a global static singleton instance) rather than the > Java {{ServiceLoader}} mechanism. > SLF4J 2.x, which has been in development for years, has been released. It > finally switches to use the {{ServiceLoader}} mechanism. As [described in the > FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be > compatible; an application just needs to use the latest Log4J/Logback > implementation which has the service loader. > *Above all the application must _not_ use the low-level > {{StaticLoggerBinder}} method, because it has been removed!* > Unfortunately > [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala] > uses {{StaticLoggerBinder}} and completely breaks any environment using > SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API > and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark > breaks completely just trying to get a Spark session: > {noformat} > Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder > at > org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232) > at > org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109) > at > org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105) > at > org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) > at org.apache.spark.internal.Logging.log(Logging.scala:53) > at org.apache.spark.internal.Logging.log$(Logging.scala:51) > at org.apache.spark.SparkContext.log(SparkContext.scala:84) > at org.apache.spark.internal.Logging.logInfo(Logging.scala:61) > at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60) > at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84) > at org.apache.spark.SparkContext.(SparkContext
[jira] [Comment Edited] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.
[ https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607188#comment-17607188 ] Garret Wilson edited comment on SPARK-40489 at 9/20/22 1:18 PM: {quote}It sounds like the new major version upgrade is done in a month ago and we don't quite know about stability.{quote} You're missing the point. This ticket is not a request for Spark to move to SLF4J 2.x. The ticket is a bug report for Spark to stop access {{StaticLoggerBinder}}, which isn't part of the SLF4J API in the first place, so it won't break for people who do move to SLF4J 2.x Spark can keep using SLF4J 1.x as long as it wants. {quote}The comment about log4j1 is moot as recent version of Spark uses log4j2.{quote} This too is missing the point. You shouldn't be using Log4J2 directly. You should be coding to the SLF4J public API. Directly accessing one particular SLF4J is just making it cumbersome for everybody else because you're not playing by the rules. was (Author: garretwilson): {quote}It sounds like the new major version upgrade is done in a month ago and we don't quite know about stability.{quote} You're missing the point. This ticket is not a request for Spark to move to SLF4J 2.x. The ticket is a bug report for Spark to stop access {{StaticLoggerBinder}}, which isn't part of the SLF4J API in the first place, so it won't break for people who do move to SLF4J 2.x Spark can keep using SLF4J 1.x as long as it wants. {quote}The comment about log4j1 is moot as recent version of Spark uses log4j2.{quote} This too is missing the point. You shouldn't be using Log4J2 directly. You should be coding to the SLF4J public API. Directly accessing one particular SLF4J is just making it cumbersome for everybody else because you're playing by the rules. > Spark 3.3.0 breaks with SFL4J 2. > > > Key: SPARK-40489 > URL: https://issues.apache.org/jira/browse/SPARK-40489 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Garret Wilson >Priority: Major > > Spark breaks fundamentally with SLF4J 2.x because it uses > {{StaticLoggerBinder}}. > SLF4J is the logging facade that is meant to shield the application from the > implementation, whether it be Log4J or Logback or whatever. Historically > SLF4J 1.x used a bad approach to configuration: it used a > {{StaticLoggerBinder}} (a global static singleton instance) rather than the > Java {{ServiceLoader}} mechanism. > SLF4J 2.x, which has been in development for years, has been released. It > finally switches to use the {{ServiceLoader}} mechanism. As [described in the > FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be > compatible; an application just needs to use the latest Log4J/Logback > implementation which has the service loader. > *Above all the application must _not_ use the low-level > {{StaticLoggerBinder}} method, because it has been removed!* > Unfortunately > [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala] > uses {{StaticLoggerBinder}} and completely breaks any environment using > SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API > and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark > breaks completely just trying to get a Spark session: > {noformat} > Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder > at > org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232) > at > org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109) > at > org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105) > at > org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) > at org.apache.spark.internal.Logging.log(Logging.scala:53) > at org.apache.spark.internal.Logging.log$(Logging.scala:51) > at org.apache.spark.SparkContext.log(SparkContext.scala:84) > at org.apache.spark.internal.Logging.logInfo(Logging.scala:61) > at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60) > at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84) > at org.apache.spark.SparkContext.(SparkContext.scala:195) >
[jira] [Commented] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.
[ https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607188#comment-17607188 ] Garret Wilson commented on SPARK-40489: --- {quote}It sounds like the new major version upgrade is done in a month ago and we don't quite know about stability.{quote} You're missing the point. This ticket is not a request for Spark to move to SLF4J 2.x. The ticket is a bug report for Spark to stop access {{StaticLoggerBinder}}, which isn't part of the SLF4J API in the first place, so it won't break for people who do move to SLF4J 2.x Spark can keep using SLF4J 1.x as long as it wants. {quote}The comment about log4j1 is moot as recent version of Spark uses log4j2.{quote} This too is missing the point. You shouldn't be using Log4J2 directly. You should be coding to the SLF4J public API. Directly accessing one particular SLF4J is just making it cumbersome for everybody else because you're playing by the rules. > Spark 3.3.0 breaks with SFL4J 2. > > > Key: SPARK-40489 > URL: https://issues.apache.org/jira/browse/SPARK-40489 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Garret Wilson >Priority: Major > > Spark breaks fundamentally with SLF4J 2.x because it uses > {{StaticLoggerBinder}}. > SLF4J is the logging facade that is meant to shield the application from the > implementation, whether it be Log4J or Logback or whatever. Historically > SLF4J 1.x used a bad approach to configuration: it used a > {{StaticLoggerBinder}} (a global static singleton instance) rather than the > Java {{ServiceLoader}} mechanism. > SLF4J 2.x, which has been in development for years, has been released. It > finally switches to use the {{ServiceLoader}} mechanism. As [described in the > FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be > compatible; an application just needs to use the latest Log4J/Logback > implementation which has the service loader. > *Above all the application must _not_ use the low-level > {{StaticLoggerBinder}} method, because it has been removed!* > Unfortunately > [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala] > uses {{StaticLoggerBinder}} and completely breaks any environment using > SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API > and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark > breaks completely just trying to get a Spark session: > {noformat} > Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder > at > org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232) > at > org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109) > at > org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105) > at > org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) > at org.apache.spark.internal.Logging.log(Logging.scala:53) > at org.apache.spark.internal.Logging.log$(Logging.scala:51) > at org.apache.spark.SparkContext.log(SparkContext.scala:84) > at org.apache.spark.internal.Logging.logInfo(Logging.scala:61) > at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60) > at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84) > at org.apache.spark.SparkContext.(SparkContext.scala:195) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704) > at > org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953) > at scala.Option.getOrElse(Option.scala:201) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947) > {noformat} > This is because Spark is playing low-level tricks to find out if the logging > platform is Log4J, and relying on {{StaticLoggerBinder}} to do it. > {code} > private def isLog4j2(): Boolean = { > // This distinguishes the log4j 1.2 binding, currently > // org.slf4j.impl.Log4jLoggerFactory, from the log4j 2.0 binding, > currently > // org.apache.logging.slf4j.Log4jLoggerFactory > val binderClass = StaticLoggerBinder.getSingleton.getLoggerFactoryClassStr > "org.apache.logging.slf4j.Log4jLoggerFactory".
[jira] [Comment Edited] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.
[ https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17606852#comment-17606852 ] Garret Wilson edited comment on SPARK-40489 at 9/20/22 1:13 PM: # Dropping explicit Log4J 1.x support is certainly one of the things that needs to be done immediately. Not only is it full of vulnerabilities, it [reached end of life|https://logging.apache.org/log4j/1.2/] over five years ago! # The Log4J implementation dependencies should be removed from Spark as well. See my question [Correctly fixing multiple `StaticLoggerBinder` bindings in Spark|https://stackoverflow.com/q/73615263] on Stack Overflow (which few people seem to have given any thought or care about, given the zero responses I have received so far). # And of course {{StaticLoggerBinder}} references should be abandoned. All this should have been done years ago. I mention this to give it some sense of urgency, in light of what I will say next. I hesitate to even mention the following, because it might lower the priority of the ticket, but for those who might be in a pickle, I just released {{io.clogr:slf4j1-shim:0.8.3}} to Maven Central, which is a [shim|https://github.com/globalmentor/clogr/tree/master/slf4j1-shim] that will keep Spark from breaking in the face of SLF4J 2.x. Just include it as a dependency and Spark will stop breaking. *But this is a stop-gap measure! Please fix this bug!* :) was (Author: garretwilson): # Dropping explicit Log4J 1.x support is certainly one of the things that needs to be done immediately. Not only is it full of vulnerabilities, it [reached end of life|https://logging.apache.org/log4j/1.2/] over five years ago! # The Log4J implementation dependencies should be removed from Spark as well. See my question [Correctly fixing multiple `StaticLoggerBinder` bindings in Spark|https://stackoverflow.com/q/73615263] on Stack Overflow (which few people seem to have given any thought or care about, given the zero responses I have received so far). # And of course {{StaticLoggerBinder}} references should be abandoned. All this should have been done years ago. I mention this to give it some sense of urgency, in light of what I will say next. I hesitate to even mention the following, because it might lower the priority of the ticket, but for those who might be in a pickle, I just released {{io.clogr:clogr-slf4j1-adapter:0.8.2}} to Maven Central, which is an [adapter|https://github.com/globalmentor/clogr/tree/master/clogr-slf4j1-adapter] (a shim, really) that will keep Spark from breaking in the face of SLF4J 2.x. Just include it as a dependency and Spark will stop breaking. *But this is a stop-gap measure! Please fix this bug!* :) > Spark 3.3.0 breaks with SFL4J 2. > > > Key: SPARK-40489 > URL: https://issues.apache.org/jira/browse/SPARK-40489 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Garret Wilson >Priority: Major > > Spark breaks fundamentally with SLF4J 2.x because it uses > {{StaticLoggerBinder}}. > SLF4J is the logging facade that is meant to shield the application from the > implementation, whether it be Log4J or Logback or whatever. Historically > SLF4J 1.x used a bad approach to configuration: it used a > {{StaticLoggerBinder}} (a global static singleton instance) rather than the > Java {{ServiceLoader}} mechanism. > SLF4J 2.x, which has been in development for years, has been released. It > finally switches to use the {{ServiceLoader}} mechanism. As [described in the > FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be > compatible; an application just needs to use the latest Log4J/Logback > implementation which has the service loader. > *Above all the application must _not_ use the low-level > {{StaticLoggerBinder}} method, because it has been removed!* > Unfortunately > [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala] > uses {{StaticLoggerBinder}} and completely breaks any environment using > SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API > and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark > breaks completely just trying to get a Spark session: > {noformat} > Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder > at > org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232) > at > org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115) > at > org.apache.spark.internal.Logging.initialize
[jira] [Assigned] (SPARK-40506) Spark Streaming metrics name don't need application name
[ https://issues.apache.org/jira/browse/SPARK-40506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40506: Assignee: (was: Apache Spark) > Spark Streaming metrics name don't need application name > > > Key: SPARK-40506 > URL: https://issues.apache.org/jira/browse/SPARK-40506 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 3.2.2 >Reporter: 王俊博 >Priority: Trivial > > Spark StreamingSource Metrics sourceName is inappropriate.The label now > looks like > `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime > `, instead of > `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`, > the Spark app name is not need. > This makes it hard to use metrics for different Spark applications over time. > And this makes the metrics sourceName standard inconsistent > {code:java} > //代码占位符 > private[streaming] class StreamingSource(ssc: StreamingContext) extends > Source { > override val metricRegistry = new MetricRegistry > override val sourceName = > "%s.StreamingMetrics".format(ssc.sparkContext.appName) > > }{code} > And for example, other metrics sourceName don't have appName. > {code:java} > //代码占位符 > private[spark] class LiveListenerBusMetrics(conf: SparkConf) > extends Source with Logging { > override val sourceName: String = "LiveListenerBus" > override val metricRegistry: MetricRegistry = new MetricRegistry > ... > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40506) Spark Streaming metrics name don't need application name
[ https://issues.apache.org/jira/browse/SPARK-40506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40506: Assignee: Apache Spark > Spark Streaming metrics name don't need application name > > > Key: SPARK-40506 > URL: https://issues.apache.org/jira/browse/SPARK-40506 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 3.2.2 >Reporter: 王俊博 >Assignee: Apache Spark >Priority: Trivial > > Spark StreamingSource Metrics sourceName is inappropriate.The label now > looks like > `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime > `, instead of > `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`, > the Spark app name is not need. > This makes it hard to use metrics for different Spark applications over time. > And this makes the metrics sourceName standard inconsistent > {code:java} > //代码占位符 > private[streaming] class StreamingSource(ssc: StreamingContext) extends > Source { > override val metricRegistry = new MetricRegistry > override val sourceName = > "%s.StreamingMetrics".format(ssc.sparkContext.appName) > > }{code} > And for example, other metrics sourceName don't have appName. > {code:java} > //代码占位符 > private[spark] class LiveListenerBusMetrics(conf: SparkConf) > extends Source with Logging { > override val sourceName: String = "LiveListenerBus" > override val metricRegistry: MetricRegistry = new MetricRegistry > ... > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40506) Spark Streaming metrics name don't need application name
[ https://issues.apache.org/jira/browse/SPARK-40506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607167#comment-17607167 ] Apache Spark commented on SPARK-40506: -- User 'Kwafoor' has created a pull request for this issue: https://github.com/apache/spark/pull/37951 > Spark Streaming metrics name don't need application name > > > Key: SPARK-40506 > URL: https://issues.apache.org/jira/browse/SPARK-40506 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 3.2.2 >Reporter: 王俊博 >Priority: Trivial > > Spark StreamingSource Metrics sourceName is inappropriate.The label now > looks like > `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime > `, instead of > `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`, > the Spark app name is not need. > This makes it hard to use metrics for different Spark applications over time. > And this makes the metrics sourceName standard inconsistent > {code:java} > //代码占位符 > private[streaming] class StreamingSource(ssc: StreamingContext) extends > Source { > override val metricRegistry = new MetricRegistry > override val sourceName = > "%s.StreamingMetrics".format(ssc.sparkContext.appName) > > }{code} > And for example, other metrics sourceName don't have appName. > {code:java} > //代码占位符 > private[spark] class LiveListenerBusMetrics(conf: SparkConf) > extends Source with Logging { > override val sourceName: String = "LiveListenerBus" > override val metricRegistry: MetricRegistry = new MetricRegistry > ... > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40506) Spark Streaming metrics name don't need application name
[ https://issues.apache.org/jira/browse/SPARK-40506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 王俊博 updated SPARK-40506: Summary: Spark Streaming metrics name don't need application name (was: Spark Streaming Metrics SourceName is unsuitable) > Spark Streaming metrics name don't need application name > > > Key: SPARK-40506 > URL: https://issues.apache.org/jira/browse/SPARK-40506 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 3.2.2 >Reporter: 王俊博 >Priority: Trivial > > Spark StreamingSource Metrics sourceName is inappropriate.The label now > looks like > `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime > `, instead of > `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`, > the Spark app name is not need. > This makes it hard to use metrics for different Spark applications over time. > And this makes the metrics sourceName standard inconsistent > {code:java} > //代码占位符 > private[streaming] class StreamingSource(ssc: StreamingContext) extends > Source { > override val metricRegistry = new MetricRegistry > override val sourceName = > "%s.StreamingMetrics".format(ssc.sparkContext.appName) > > }{code} > And for example, other metrics sourceName don't have appName. > {code:java} > //代码占位符 > private[spark] class LiveListenerBusMetrics(conf: SparkConf) > extends Source with Logging { > override val sourceName: String = "LiveListenerBus" > override val metricRegistry: MetricRegistry = new MetricRegistry > ... > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40506) Spark Streaming Metrics SourceName is unsuitable
[ https://issues.apache.org/jira/browse/SPARK-40506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 王俊博 updated SPARK-40506: Description: Spark StreamingSource Metrics sourceName is inappropriate.The label now looks like `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime `, instead of `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`, the Spark app name is not need. This makes it hard to use metrics for different Spark applications over time. And this makes the metrics sourceName standard inconsistent {code:java} //代码占位符 private[streaming] class StreamingSource(ssc: StreamingContext) extends Source { override val metricRegistry = new MetricRegistry override val sourceName = "%s.StreamingMetrics".format(ssc.sparkContext.appName) }{code} And for example, other metrics sourceName don't have appName. {code:java} //代码占位符 private[spark] class LiveListenerBusMetrics(conf: SparkConf) extends Source with Logging { override val sourceName: String = "LiveListenerBus" override val metricRegistry: MetricRegistry = new MetricRegistry ... } {code} was: Spark StreamingSource Metrics sourceName is inappropriate.The label now looks like `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime `, instead of `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`, the Spark app name is not need. {code:java} //代码占位符 private[streaming] class StreamingSource(ssc: StreamingContext) extends Source { override val metricRegistry = new MetricRegistry override val sourceName = "%s.StreamingMetrics".format(ssc.sparkContext.appName) }{code} And for example, other metrics sourceName don't have appName. {code:java} //代码占位符 private[spark] class LiveListenerBusMetrics(conf: SparkConf) extends Source with Logging { override val sourceName: String = "LiveListenerBus" override val metricRegistry: MetricRegistry = new MetricRegistry ... } {code} > Spark Streaming Metrics SourceName is unsuitable > > > Key: SPARK-40506 > URL: https://issues.apache.org/jira/browse/SPARK-40506 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 3.2.2 >Reporter: 王俊博 >Priority: Trivial > > Spark StreamingSource Metrics sourceName is inappropriate.The label now > looks like > `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime > `, instead of > `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`, > the Spark app name is not need. > This makes it hard to use metrics for different Spark applications over time. > And this makes the metrics sourceName standard inconsistent > {code:java} > //代码占位符 > private[streaming] class StreamingSource(ssc: StreamingContext) extends > Source { > override val metricRegistry = new MetricRegistry > override val sourceName = > "%s.StreamingMetrics".format(ssc.sparkContext.appName) > > }{code} > And for example, other metrics sourceName don't have appName. > {code:java} > //代码占位符 > private[spark] class LiveListenerBusMetrics(conf: SparkConf) > extends Source with Logging { > override val sourceName: String = "LiveListenerBus" > override val metricRegistry: MetricRegistry = new MetricRegistry > ... > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40506) Spark Streaming Metrics SourceName is unsuitable
王俊博 created SPARK-40506: --- Summary: Spark Streaming Metrics SourceName is unsuitable Key: SPARK-40506 URL: https://issues.apache.org/jira/browse/SPARK-40506 Project: Spark Issue Type: Improvement Components: DStreams Affects Versions: 3.2.2 Reporter: 王俊博 Spark StreamingSource Metrics sourceName is inappropriate.The label now looks like `application_x__driver_NetworkWordCount_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime `, instead of `application_x__driver_StreamingMetrics_streaming_lastCompletedBatch_processingEndTime`, the Spark app name is not need. {code:java} //代码占位符 private[streaming] class StreamingSource(ssc: StreamingContext) extends Source { override val metricRegistry = new MetricRegistry override val sourceName = "%s.StreamingMetrics".format(ssc.sparkContext.appName) }{code} And for example, other metrics sourceName don't have appName. {code:java} //代码占位符 private[spark] class LiveListenerBusMetrics(conf: SparkConf) extends Source with Logging { override val sourceName: String = "LiveListenerBus" override val metricRegistry: MetricRegistry = new MetricRegistry ... } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40505) Remove min heap setting in Kubernetes Dockerfile entrypoint
[ https://issues.apache.org/jira/browse/SPARK-40505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607151#comment-17607151 ] Apache Spark commented on SPARK-40505: -- User 'bryanck' has created a pull request for this issue: https://github.com/apache/spark/pull/37950 > Remove min heap setting in Kubernetes Dockerfile entrypoint > --- > > Key: SPARK-40505 > URL: https://issues.apache.org/jira/browse/SPARK-40505 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0, 3.2.2 >Reporter: Bryan Keller >Priority: Major > > The entrypoint script for the Kubernetes Dockerfile sets the Java min heap > setting (-Xms) to be the same as the max setting (-Xmx) for the executor > process. This prevents the JVM from shrinking the heap and can lead to > excessive memory usage in some scenarios. Removing the min heap setting is > consistent with YARN executor startup. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40505) Remove min heap setting in Kubernetes Dockerfile entrypoint
[ https://issues.apache.org/jira/browse/SPARK-40505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40505: Assignee: Apache Spark > Remove min heap setting in Kubernetes Dockerfile entrypoint > --- > > Key: SPARK-40505 > URL: https://issues.apache.org/jira/browse/SPARK-40505 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0, 3.2.2 >Reporter: Bryan Keller >Assignee: Apache Spark >Priority: Major > > The entrypoint script for the Kubernetes Dockerfile sets the Java min heap > setting (-Xms) to be the same as the max setting (-Xmx) for the executor > process. This prevents the JVM from shrinking the heap and can lead to > excessive memory usage in some scenarios. Removing the min heap setting is > consistent with YARN executor startup. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40505) Remove min heap setting in Kubernetes Dockerfile entrypoint
[ https://issues.apache.org/jira/browse/SPARK-40505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40505: Assignee: (was: Apache Spark) > Remove min heap setting in Kubernetes Dockerfile entrypoint > --- > > Key: SPARK-40505 > URL: https://issues.apache.org/jira/browse/SPARK-40505 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0, 3.2.2 >Reporter: Bryan Keller >Priority: Major > > The entrypoint script for the Kubernetes Dockerfile sets the Java min heap > setting (-Xms) to be the same as the max setting (-Xmx) for the executor > process. This prevents the JVM from shrinking the heap and can lead to > excessive memory usage in some scenarios. Removing the min heap setting is > consistent with YARN executor startup. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40505) Remove min heap setting in Kubernetes Dockerfile entrypoint
[ https://issues.apache.org/jira/browse/SPARK-40505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607149#comment-17607149 ] Apache Spark commented on SPARK-40505: -- User 'bryanck' has created a pull request for this issue: https://github.com/apache/spark/pull/37950 > Remove min heap setting in Kubernetes Dockerfile entrypoint > --- > > Key: SPARK-40505 > URL: https://issues.apache.org/jira/browse/SPARK-40505 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0, 3.2.2 >Reporter: Bryan Keller >Priority: Major > > The entrypoint script for the Kubernetes Dockerfile sets the Java min heap > setting (-Xms) to be the same as the max setting (-Xmx) for the executor > process. This prevents the JVM from shrinking the heap and can lead to > excessive memory usage in some scenarios. Removing the min heap setting is > consistent with YARN executor startup. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40505) Remove min heap setting in Kubernetes Dockerfile entrypoint
Bryan Keller created SPARK-40505: Summary: Remove min heap setting in Kubernetes Dockerfile entrypoint Key: SPARK-40505 URL: https://issues.apache.org/jira/browse/SPARK-40505 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.2.2, 3.3.0 Reporter: Bryan Keller The entrypoint script for the Kubernetes Dockerfile sets the Java min heap setting (-Xms) to be the same as the max setting (-Xmx) for the executor process. This prevents the JVM from shrinking the heap and can lead to excessive memory usage in some scenarios. Removing the min heap setting is consistent with YARN executor startup. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40501) Add PushProjectionThroughLimit for Optimizer
[ https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-40501: Description: h4. It took a long time to fetch out, still running after 20 minutes... when run as follow code in spark-shell: spark.sql("select * from xxx where event_day = '20220919' limit 1").show() [!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png] [!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png] was: h4. It took a long time to fetch out when run as follow code in spark-shell: spark.sql("select * from xxx where event_day = '20220919' limit 1").show() [!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png] [!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png] > Add PushProjectionThroughLimit for Optimizer > > > Key: SPARK-40501 > URL: https://issues.apache.org/jira/browse/SPARK-40501 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > > h4. It took a long time to fetch out, still running after 20 minutes... > when run as follow code in spark-shell: > spark.sql("select * from xxx where event_day = '20220919' limit 1").show() > [!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png] > [!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40501) Add PushProjectionThroughLimit for Optimizer
[ https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-40501: Summary: Add PushProjectionThroughLimit for Optimizer (was: add PushProjectionThroughLimit for Optimizer) > Add PushProjectionThroughLimit for Optimizer > > > Key: SPARK-40501 > URL: https://issues.apache.org/jira/browse/SPARK-40501 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > > h4. It took a long time to fetch out > when run as follow code in spark-shell: > spark.sql("select * from xxx where event_day = '20220919' limit 1").show() > [!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png] > [!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40504) Make yarn appmaster load config from client
[ https://issues.apache.org/jira/browse/SPARK-40504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607092#comment-17607092 ] Apache Spark commented on SPARK-40504: -- User 'zhengchenyu' has created a pull request for this issue: https://github.com/apache/spark/pull/37949 > Make yarn appmaster load config from client > --- > > Key: SPARK-40504 > URL: https://issues.apache.org/jira/browse/SPARK-40504 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.0.1 >Reporter: zhengchenyu >Priority: Major > > In yarn federation mode, config in client side and nm side may be different. > AppMaster should override config from client side. > For example: > In client side, yarn.resourcemanager.ha.rm-ids are yarn routers. > In nm side, yarn.resourcemanager.ha.rm-ids are the rms of subcluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40504) Make yarn appmaster load config from client
[ https://issues.apache.org/jira/browse/SPARK-40504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40504: Assignee: Apache Spark > Make yarn appmaster load config from client > --- > > Key: SPARK-40504 > URL: https://issues.apache.org/jira/browse/SPARK-40504 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.0.1 >Reporter: zhengchenyu >Assignee: Apache Spark >Priority: Major > > In yarn federation mode, config in client side and nm side may be different. > AppMaster should override config from client side. > For example: > In client side, yarn.resourcemanager.ha.rm-ids are yarn routers. > In nm side, yarn.resourcemanager.ha.rm-ids are the rms of subcluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40504) Make yarn appmaster load config from client
[ https://issues.apache.org/jira/browse/SPARK-40504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40504: Assignee: (was: Apache Spark) > Make yarn appmaster load config from client > --- > > Key: SPARK-40504 > URL: https://issues.apache.org/jira/browse/SPARK-40504 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.0.1 >Reporter: zhengchenyu >Priority: Major > > In yarn federation mode, config in client side and nm side may be different. > AppMaster should override config from client side. > For example: > In client side, yarn.resourcemanager.ha.rm-ids are yarn routers. > In nm side, yarn.resourcemanager.ha.rm-ids are the rms of subcluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40457) upgrade jackson data mapper to latest
[ https://issues.apache.org/jira/browse/SPARK-40457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607082#comment-17607082 ] Bilna commented on SPARK-40457: --- [~hyukjin.kwon] it is org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13 > upgrade jackson data mapper to latest > -- > > Key: SPARK-40457 > URL: https://issues.apache.org/jira/browse/SPARK-40457 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Bilna >Priority: Major > > Upgrade jackson-mapper-asl to the latest to resolve CVE-2019-10172 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40504) Make yarn appmaster load config from client
[ https://issues.apache.org/jira/browse/SPARK-40504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengchenyu updated SPARK-40504: Description: In yarn federation mode, config in client side and nm side may be different. AppMaster should override config from client side. For example: In client side, yarn.resourcemanager.ha.rm-ids are yarn routers. In nm side, yarn.resourcemanager.ha.rm-ids are the rms of subcluster. was:In yarn federation mode, config in client side and nm side may be different. AppMaster should override config from client side. > Make yarn appmaster load config from client > --- > > Key: SPARK-40504 > URL: https://issues.apache.org/jira/browse/SPARK-40504 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.0.1 >Reporter: zhengchenyu >Priority: Major > > In yarn federation mode, config in client side and nm side may be different. > AppMaster should override config from client side. > For example: > In client side, yarn.resourcemanager.ha.rm-ids are yarn routers. > In nm side, yarn.resourcemanager.ha.rm-ids are the rms of subcluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40327) Increase pandas API coverage for pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607079#comment-17607079 ] Apache Spark commented on SPARK-40327: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37948 > Increase pandas API coverage for pandas API on Spark > > > Key: SPARK-40327 > URL: https://issues.apache.org/jira/browse/SPARK-40327 > Project: Spark > Issue Type: Umbrella > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Increasing the pandas API coverage for Apache Spark 3.4.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40504) Make yarn appmaster load config from client
zhengchenyu created SPARK-40504: --- Summary: Make yarn appmaster load config from client Key: SPARK-40504 URL: https://issues.apache.org/jira/browse/SPARK-40504 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 3.0.1 Reporter: zhengchenyu In yarn federation mode, config in client side and nm side may be different. AppMaster should override config from client side. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40327) Increase pandas API coverage for pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40327: Assignee: (was: Apache Spark) > Increase pandas API coverage for pandas API on Spark > > > Key: SPARK-40327 > URL: https://issues.apache.org/jira/browse/SPARK-40327 > Project: Spark > Issue Type: Umbrella > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Increasing the pandas API coverage for Apache Spark 3.4.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40327) Increase pandas API coverage for pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40327: Assignee: Apache Spark > Increase pandas API coverage for pandas API on Spark > > > Key: SPARK-40327 > URL: https://issues.apache.org/jira/browse/SPARK-40327 > Project: Spark > Issue Type: Umbrella > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > Increasing the pandas API coverage for Apache Spark 3.4.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40327) Increase pandas API coverage for pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607077#comment-17607077 ] Apache Spark commented on SPARK-40327: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37948 > Increase pandas API coverage for pandas API on Spark > > > Key: SPARK-40327 > URL: https://issues.apache.org/jira/browse/SPARK-40327 > Project: Spark > Issue Type: Umbrella > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Increasing the pandas API coverage for Apache Spark 3.4.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40503) Add resampling to API references
Ruifeng Zheng created SPARK-40503: - Summary: Add resampling to API references Key: SPARK-40503 URL: https://issues.apache.org/jira/browse/SPARK-40503 Project: Spark Issue Type: Sub-task Components: Documentation, ps Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40491) Remove too old TODO for JdbcRDD
[ https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607045#comment-17607045 ] CaoYu commented on SPARK-40491: --- Maybe we can just not remove these. I have already created https://issues.apache.org/jira/browse/SPARK-40502, please take a look. i want try to implement jdbc data source in pyspark. Also I'm interested in this task for scala. if possible, Please assign me this task, I want to try to get it done > Remove too old TODO for JdbcRDD > --- > > Key: SPARK-40491 > URL: https://issues.apache.org/jira/browse/SPARK-40491 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > According to the legacy document of JdbcRDD, we need to expose a jdbcRDD > function in SparkContext. > In fact, this is a very old TODO and we need to revisit if this is still > necessary. Since Spark SQL is the new core, I'm not sure if anyone is > interested in a new API to create jdbc RDD. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40502) Support dataframe API use jdbc data source in PySpark
CaoYu created SPARK-40502: - Summary: Support dataframe API use jdbc data source in PySpark Key: SPARK-40502 URL: https://issues.apache.org/jira/browse/SPARK-40502 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 3.3.0 Reporter: CaoYu When i using pyspark, i wanna get data from mysql database. so i want use JDBCRDD like java\scala. But that is not be supported in PySpark. For some reasons, i can't using DataFrame API, only can use RDD(datastream) API. Even i know the DataFrame can get data from jdbc source fairly well. So i want to implement functionality that can use rdd to get data from jdbc source for PySpark. *But i don't know if that are necessary for PySpark. so we can discuss it.* {*}If it is necessary for PySpark{*}{*}, i want to contribute to Spark.{*} *i hope this Jira task can assigned to me, so i can start working to implement it.* *if not, please close this Jira task.* *thanks a lot.* -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40500) Use `pd.items` instead of `pd.iteritems`
[ https://issues.apache.org/jira/browse/SPARK-40500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607034#comment-17607034 ] Apache Spark commented on SPARK-40500: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37947 > Use `pd.items` instead of `pd.iteritems` > > > Key: SPARK-40500 > URL: https://issues.apache.org/jira/browse/SPARK-40500 > Project: Spark > Issue Type: Improvement > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40500) Use `pd.items` instead of `pd.iteritems`
[ https://issues.apache.org/jira/browse/SPARK-40500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40500: Assignee: Apache Spark > Use `pd.items` instead of `pd.iteritems` > > > Key: SPARK-40500 > URL: https://issues.apache.org/jira/browse/SPARK-40500 > Project: Spark > Issue Type: Improvement > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40500) Use `pd.items` instead of `pd.iteritems`
[ https://issues.apache.org/jira/browse/SPARK-40500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40500: Assignee: (was: Apache Spark) > Use `pd.items` instead of `pd.iteritems` > > > Key: SPARK-40500 > URL: https://issues.apache.org/jira/browse/SPARK-40500 > Project: Spark > Issue Type: Improvement > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40501) add PushProjectionThroughLimit for Optimizer
[ https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-40501: Description: h4. It took a long time to fetch out when run as follow code in spark-shell: spark.sql("select * from xxx where event_day = '20220919' limit 1").show() [!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png] [!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png] > add PushProjectionThroughLimit for Optimizer > > > Key: SPARK-40501 > URL: https://issues.apache.org/jira/browse/SPARK-40501 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > > h4. It took a long time to fetch out > when run as follow code in spark-shell: > spark.sql("select * from xxx where event_day = '20220919' limit 1").show() > [!https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png|width=557!|https://user-images.githubusercontent.com/15246973/191211875-c29c3bae-1339-414b-84bc-2195545b8c35.png] > [!https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png|width=1419!|https://user-images.githubusercontent.com/15246973/191212244-22108810-dd66-46bd-bea7-a7dab70a1a06.png] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40501) add PushProjectionThroughLimit for Optimizer
[ https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607033#comment-17607033 ] Apache Spark commented on SPARK-40501: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/37941 > add PushProjectionThroughLimit for Optimizer > > > Key: SPARK-40501 > URL: https://issues.apache.org/jira/browse/SPARK-40501 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40501) add PushProjectionThroughLimit for Optimizer
[ https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40501: Assignee: (was: Apache Spark) > add PushProjectionThroughLimit for Optimizer > > > Key: SPARK-40501 > URL: https://issues.apache.org/jira/browse/SPARK-40501 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40501) add PushProjectionThroughLimit for Optimizer
[ https://issues.apache.org/jira/browse/SPARK-40501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40501: Assignee: Apache Spark > add PushProjectionThroughLimit for Optimizer > > > Key: SPARK-40501 > URL: https://issues.apache.org/jira/browse/SPARK-40501 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-40499: --- Priority: Blocker (was: Major) > Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0 > > > Key: SPARK-40499 > URL: https://issues.apache.org/jira/browse/SPARK-40499 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 > Environment: hadoop: 3.0.0 > spark: 2.4.0 / 3.2.1 > shuffle:spark 2.4.0 >Reporter: xuanzhiang >Priority: Blocker > Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png > > > spark.sql( > s""" > |SELECT > | Info , > | PERCENTILE_APPROX(cost,0.5) cost_p50, > | PERCENTILE_APPROX(cost,0.9) cost_p90, > | PERCENTILE_APPROX(cost,0.95) cost_p95, > | PERCENTILE_APPROX(cost,0.99) cost_p99, > | PERCENTILE_APPROX(cost,0.999) cost_p999 > |FROM > | textData > |""".stripMargin) > * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 > pull shuffle data very quick . but , when we use spark 3.2.1 and use old > shuffle , 140M shuffle data cost 3 hours. > * If we upgrade the Shuffle, will we get performance regression? > * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40501) add PushProjectionThroughLimit for Optimizer
BingKun Pan created SPARK-40501: --- Summary: add PushProjectionThroughLimit for Optimizer Key: SPARK-40501 URL: https://issues.apache.org/jira/browse/SPARK-40501 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org