[jira] [Assigned] (SPARK-33092) Support subexpression elimination in ProjectExec
[ https://issues.apache.org/jira/browse/SPARK-33092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-33092: --- Assignee: L. C. Hsieh (was: Apache Spark) > Support subexpression elimination in ProjectExec > > > Key: SPARK-33092 > URL: https://issues.apache.org/jira/browse/SPARK-33092 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Users frequently write repeatedly expression in projection. Currently in > ProjectExec, we don't support subexpression elimination in Whole-stage > codegen. We can support it to reduce redundant evaluation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33092) Support subexpression elimination in ProjectExec
[ https://issues.apache.org/jira/browse/SPARK-33092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33092: Assignee: Apache Spark (was: L. C. Hsieh) > Support subexpression elimination in ProjectExec > > > Key: SPARK-33092 > URL: https://issues.apache.org/jira/browse/SPARK-33092 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Major > > Users frequently write repeatedly expression in projection. Currently in > ProjectExec, we don't support subexpression elimination in Whole-stage > codegen. We can support it to reduce redundant evaluation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33092) Support subexpression elimination in ProjectExec
[ https://issues.apache.org/jira/browse/SPARK-33092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33092: Assignee: Apache Spark (was: L. C. Hsieh) > Support subexpression elimination in ProjectExec > > > Key: SPARK-33092 > URL: https://issues.apache.org/jira/browse/SPARK-33092 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Major > > Users frequently write repeatedly expression in projection. Currently in > ProjectExec, we don't support subexpression elimination in Whole-stage > codegen. We can support it to reduce redundant evaluation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33092) Support subexpression elimination in ProjectExec
[ https://issues.apache.org/jira/browse/SPARK-33092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33092: Assignee: L. C. Hsieh (was: Apache Spark) > Support subexpression elimination in ProjectExec > > > Key: SPARK-33092 > URL: https://issues.apache.org/jira/browse/SPARK-33092 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Users frequently write repeatedly expression in projection. Currently in > ProjectExec, we don't support subexpression elimination in Whole-stage > codegen. We can support it to reduce redundant evaluation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33092) Support subexpression elimination in ProjectExec
[ https://issues.apache.org/jira/browse/SPARK-33092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210024#comment-17210024 ] Apache Spark commented on SPARK-33092: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/29975 > Support subexpression elimination in ProjectExec > > > Key: SPARK-33092 > URL: https://issues.apache.org/jira/browse/SPARK-33092 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Users frequently write repeatedly expression in projection. Currently in > ProjectExec, we don't support subexpression elimination in Whole-stage > codegen. We can support it to reduce redundant evaluation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33092) Support subexpression elimination in ProjectExec
L. C. Hsieh created SPARK-33092: --- Summary: Support subexpression elimination in ProjectExec Key: SPARK-33092 URL: https://issues.apache.org/jira/browse/SPARK-33092 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: L. C. Hsieh Assignee: L. C. Hsieh Users frequently write repeatedly expression in projection. Currently in ProjectExec, we don't support subexpression elimination in Whole-stage codegen. We can support it to reduce redundant evaluation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33074) Classify dialect exceptions in JDBC v2 Table Catalog
[ https://issues.apache.org/jira/browse/SPARK-33074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33074. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29952 [https://github.com/apache/spark/pull/29952] > Classify dialect exceptions in JDBC v2 Table Catalog > > > Key: SPARK-33074 > URL: https://issues.apache.org/jira/browse/SPARK-33074 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.1.0 > > > The current implementation of v2.jdbc.JDBCTableCatalog don't care of > exceptions defined by org.apache.spark.sql.connector.catalog.TableCatalog at > all like > * NoSuchNamespaceException > * NoSuchTableException > * TableAlreadyExistsException > it either throw dialect exception or generic exception AnalysisException. > Since we split forming of dialect specific statements and their execution, we > should extend dialect APIs and ask them how to convert their exceptions to > TableCatalog exceptions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33074) Classify dialect exceptions in JDBC v2 Table Catalog
[ https://issues.apache.org/jira/browse/SPARK-33074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33074: --- Assignee: Maxim Gekk > Classify dialect exceptions in JDBC v2 Table Catalog > > > Key: SPARK-33074 > URL: https://issues.apache.org/jira/browse/SPARK-33074 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > The current implementation of v2.jdbc.JDBCTableCatalog don't care of > exceptions defined by org.apache.spark.sql.connector.catalog.TableCatalog at > all like > * NoSuchNamespaceException > * NoSuchTableException > * TableAlreadyExistsException > it either throw dialect exception or generic exception AnalysisException. > Since we split forming of dialect specific statements and their execution, we > should extend dialect APIs and ask them how to convert their exceptions to > TableCatalog exceptions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33091) Avoid using map instead of foreach to avoid potential side effect at callers of OrcUtils.readCatalystSchema
[ https://issues.apache.org/jira/browse/SPARK-33091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210015#comment-17210015 ] Apache Spark commented on SPARK-33091: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/29974 > Avoid using map instead of foreach to avoid potential side effect at callers > of OrcUtils.readCatalystSchema > --- > > Key: SPARK-33091 > URL: https://issues.apache.org/jira/browse/SPARK-33091 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Minor > > This is a kind of a followup of SPARK-32646. New JIRA was filed to control > the fixed versions properly. > When you use {{map}}, it might be lazily evaluated and not executed. To avoid > this, we should better use {{foreach}}. > See also SPARK-16694 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33091) Avoid using map instead of foreach to avoid potential side effect at callers of OrcUtils.readCatalystSchema
[ https://issues.apache.org/jira/browse/SPARK-33091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33091: Assignee: Apache Spark > Avoid using map instead of foreach to avoid potential side effect at callers > of OrcUtils.readCatalystSchema > --- > > Key: SPARK-33091 > URL: https://issues.apache.org/jira/browse/SPARK-33091 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > This is a kind of a followup of SPARK-32646. New JIRA was filed to control > the fixed versions properly. > When you use {{map}}, it might be lazily evaluated and not executed. To avoid > this, we should better use {{foreach}}. > See also SPARK-16694 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33091) Avoid using map instead of foreach to avoid potential side effect at callers of OrcUtils.readCatalystSchema
[ https://issues.apache.org/jira/browse/SPARK-33091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33091: Assignee: (was: Apache Spark) > Avoid using map instead of foreach to avoid potential side effect at callers > of OrcUtils.readCatalystSchema > --- > > Key: SPARK-33091 > URL: https://issues.apache.org/jira/browse/SPARK-33091 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Minor > > This is a kind of a followup of SPARK-32646. New JIRA was filed to control > the fixed versions properly. > When you use {{map}}, it might be lazily evaluated and not executed. To avoid > this, we should better use {{foreach}}. > See also SPARK-16694 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33091) Avoid using map instead of foreach to avoid potential side effect at callers of OrcUtils.readCatalystSchema
[ https://issues.apache.org/jira/browse/SPARK-33091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210014#comment-17210014 ] Apache Spark commented on SPARK-33091: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/29974 > Avoid using map instead of foreach to avoid potential side effect at callers > of OrcUtils.readCatalystSchema > --- > > Key: SPARK-33091 > URL: https://issues.apache.org/jira/browse/SPARK-33091 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Minor > > This is a kind of a followup of SPARK-32646. New JIRA was filed to control > the fixed versions properly. > When you use {{map}}, it might be lazily evaluated and not executed. To avoid > this, we should better use {{foreach}}. > See also SPARK-16694 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33091) Avoid using map instead of foreach to avoid potential side effect at callers of OrcUtils.readCatalystSchema
[ https://issues.apache.org/jira/browse/SPARK-33091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33091: - Summary: Avoid using map instead of foreach to avoid potential side effect at callers of OrcUtils.readCatalystSchema (was: Avoid using map instead of foreach to avoid potential side effect at callee of OrcUtils.readCatalystSchema) > Avoid using map instead of foreach to avoid potential side effect at callers > of OrcUtils.readCatalystSchema > --- > > Key: SPARK-33091 > URL: https://issues.apache.org/jira/browse/SPARK-33091 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Minor > > This is a kind of a followup of SPARK-32646. New JIRA was filed to control > the fixed versions properly. > When you use {{map}}, it might be lazily evaluated and not executed. To avoid > this, we should better use {{foreach}}. > See also SPARK-16694 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33091) Avoid using map instead of foreach to avoid potential side effect at callee of OrcUtils.readCatalystSchema
[ https://issues.apache.org/jira/browse/SPARK-33091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33091: - Summary: Avoid using map instead of foreach to avoid potential side effect at callee of OrcUtils.readCatalystSchema (was: Avoid using map instead of foreach to avoid potential side effect at OrcUtils.readCatalystSchema) > Avoid using map instead of foreach to avoid potential side effect at callee > of OrcUtils.readCatalystSchema > -- > > Key: SPARK-33091 > URL: https://issues.apache.org/jira/browse/SPARK-33091 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Minor > > This is a kind of a followup of SPARK-32646. New JIRA was filed to control > the fixed versions properly. > When you use {{map}}, it might be lazily evaluated and not executed. To avoid > this, we should better use {{foreach}}. > See also SPARK-16694 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33091) Avoid using map instead of foreach to avoid potential side effect at OrcUtils.readCatalystSchema
[ https://issues.apache.org/jira/browse/SPARK-33091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33091: - Priority: Minor (was: Major) > Avoid using map instead of foreach to avoid potential side effect at > OrcUtils.readCatalystSchema > > > Key: SPARK-33091 > URL: https://issues.apache.org/jira/browse/SPARK-33091 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Minor > > This is a kind of a followup of SPARK-32646. New JIRA was filed to control > the fixed versions properly. > When you use {{map}}, it might be lazily evaluated and not executed. To avoid > this, we should better use {{foreach}}. > See also SPARK-16694 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33091) Avoid using map instead of foreach to avoid potential side effect at OrcUtils.readCatalystSchema
Hyukjin Kwon created SPARK-33091: Summary: Avoid using map instead of foreach to avoid potential side effect at OrcUtils.readCatalystSchema Key: SPARK-33091 URL: https://issues.apache.org/jira/browse/SPARK-33091 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1, 3.1.0 Reporter: Hyukjin Kwon This is a kind of a followup of SPARK-32646. New JIRA was filed to control the fixed versions properly. When you use {{map}}, it might be lazily evaluated and not executed. To avoid this, we should better use {{foreach}}. See also SPARK-16694 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33091) Avoid using map instead of foreach to avoid potential side effect at OrcUtils.readCatalystSchema
[ https://issues.apache.org/jira/browse/SPARK-33091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33091: - Issue Type: Improvement (was: Bug) > Avoid using map instead of foreach to avoid potential side effect at > OrcUtils.readCatalystSchema > > > Key: SPARK-33091 > URL: https://issues.apache.org/jira/browse/SPARK-33091 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > This is a kind of a followup of SPARK-32646. New JIRA was filed to control > the fixed versions properly. > When you use {{map}}, it might be lazily evaluated and not executed. To avoid > this, we should better use {{foreach}}. > See also SPARK-16694 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32282) Improve EnsureRquirement.reorderJoinKeys to handle more scenarios such as PartitioningCollection
[ https://issues.apache.org/jira/browse/SPARK-32282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32282. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29074 [https://github.com/apache/spark/pull/29074] > Improve EnsureRquirement.reorderJoinKeys to handle more scenarios such as > PartitioningCollection > > > Key: SPARK-32282 > URL: https://issues.apache.org/jira/browse/SPARK-32282 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Apache Spark >Priority: Major > Fix For: 3.1.0 > > > The EnsureRquirement.reorderJoinKeys can be improved to handle the following > scenarios: > # If the keys cannot be reordered to match the left-side HashPartitioning, > consider the right-side HashPartitioning. > # Handle PartitioningCollection, which may contain HashPartitioning -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32989) Performance regression when selecting from str_to_map
[ https://issues.apache.org/jira/browse/SPARK-32989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209990#comment-17209990 ] Yang Jie edited comment on SPARK-32989 at 10/8/20, 3:59 AM: I found that if stringToMap use codegen, the optimization of `spark.sql.subexpressionElimination.enabled` will be ignored. was (Author: luciferyang): I found that if stringToMap use codegen, the optimization of `spark.sql.subexpressionElimination.enabled` will be ignored. > Performance regression when selecting from str_to_map > - > > Key: SPARK-32989 > URL: https://issues.apache.org/jira/browse/SPARK-32989 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Ondrej Kokes >Priority: Minor > > When I create a map using str_to_map and select more than a single value, I > notice a notable performance regression in 3.0.1 compared to 2.4.7. When > selecting a single value, the performance is the same. Plans are identical > between versions. > It seems like in 2.x the map from str_to_map is preserved for a given row, > but in 3.x it's recalculated for each column. One hint that it might be the > case is that when I tried forcing materialisation of said map in 3.x (by a > coalesce, don't know if there's a better way), I got the performance roughly > to 2.x levels. > Here's a reproducer (the csv in question gets autogenerated by the python > code): > {code:java} > $ head regression.csv > foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > ... (10M more rows) > {code} > {code:python} > import time > import os > import pyspark > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > if __name__ == '__main__': > print(pyspark.__version__) > spark = SparkSession.builder.getOrCreate() > filename = 'regression.csv' > if not os.path.isfile(filename): > with open(filename, 'wt') as fw: > fw.write('foo\n') > for _ in range(10_000_000): > fw.write('foo=bar&baz=bak&bar=foo\n') > df = spark.read.option('header', True).csv(filename) > t = time.time() > dd = (df > .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")')) > .select( > f.col('my_map')['foo'], > ) > ) > dd.write.mode('overwrite').csv('tmp') > t2 = time.time() > print('selected one', t2 - t) > dd = (df > .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")')) > # .coalesce(100) # forcing evaluation before selection speeds it > up in 3.0.1 > .select( > f.col('my_map')['foo'], > f.col('my_map')['bar'], > f.col('my_map')['baz'], > ) > ) > dd.explain(True) > dd.write.mode('overwrite').csv('tmp') > t3 = time.time() > print('selected three', t3 - t2) > {code} > Results for 2.4.7 and 3.0.1, both installed from PyPI, Python 3.7, macOS > (times are in seconds) > {code:java} > # 3.0.1 > # selected one 6.375471830368042 > > # selected three 14.847578048706055 > # 2.4.7 > # selected one 6.679579019546509 > > # selected three 6.5622029304504395 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32989) Performance regression when selecting from str_to_map
[ https://issues.apache.org/jira/browse/SPARK-32989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209990#comment-17209990 ] Yang Jie commented on SPARK-32989: -- I found that if stringToMap use codegen, the optimization of `spark.sql.subexpressionElimination.enabled` will be ignored. > Performance regression when selecting from str_to_map > - > > Key: SPARK-32989 > URL: https://issues.apache.org/jira/browse/SPARK-32989 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Ondrej Kokes >Priority: Minor > > When I create a map using str_to_map and select more than a single value, I > notice a notable performance regression in 3.0.1 compared to 2.4.7. When > selecting a single value, the performance is the same. Plans are identical > between versions. > It seems like in 2.x the map from str_to_map is preserved for a given row, > but in 3.x it's recalculated for each column. One hint that it might be the > case is that when I tried forcing materialisation of said map in 3.x (by a > coalesce, don't know if there's a better way), I got the performance roughly > to 2.x levels. > Here's a reproducer (the csv in question gets autogenerated by the python > code): > {code:java} > $ head regression.csv > foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > ... (10M more rows) > {code} > {code:python} > import time > import os > import pyspark > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > if __name__ == '__main__': > print(pyspark.__version__) > spark = SparkSession.builder.getOrCreate() > filename = 'regression.csv' > if not os.path.isfile(filename): > with open(filename, 'wt') as fw: > fw.write('foo\n') > for _ in range(10_000_000): > fw.write('foo=bar&baz=bak&bar=foo\n') > df = spark.read.option('header', True).csv(filename) > t = time.time() > dd = (df > .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")')) > .select( > f.col('my_map')['foo'], > ) > ) > dd.write.mode('overwrite').csv('tmp') > t2 = time.time() > print('selected one', t2 - t) > dd = (df > .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")')) > # .coalesce(100) # forcing evaluation before selection speeds it > up in 3.0.1 > .select( > f.col('my_map')['foo'], > f.col('my_map')['bar'], > f.col('my_map')['baz'], > ) > ) > dd.explain(True) > dd.write.mode('overwrite').csv('tmp') > t3 = time.time() > print('selected three', t3 - t2) > {code} > Results for 2.4.7 and 3.0.1, both installed from PyPI, Python 3.7, macOS > (times are in seconds) > {code:java} > # 3.0.1 > # selected one 6.375471830368042 > > # selected three 14.847578048706055 > # 2.4.7 > # selected one 6.679579019546509 > > # selected three 6.5622029304504395 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32989) Performance regression when selecting from str_to_map
[ https://issues.apache.org/jira/browse/SPARK-32989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209979#comment-17209979 ] Yang Jie edited comment on SPARK-32989 at 10/8/20, 3:24 AM: [~ondrej] You're right, It will execute N times with codegen(SPARK-30356.) when selecting N columns use stringToMap expression compared to selecting One column, cc [~Qin Yao] [~cloud_fan] was (Author: luciferyang): [~ondrej] You're right, It will execute n times with codegen(SPARK-30356.) when select n columns use stringToMap expression, cc [~Qin Yao] [~cloud_fan] > Performance regression when selecting from str_to_map > - > > Key: SPARK-32989 > URL: https://issues.apache.org/jira/browse/SPARK-32989 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Ondrej Kokes >Priority: Minor > > When I create a map using str_to_map and select more than a single value, I > notice a notable performance regression in 3.0.1 compared to 2.4.7. When > selecting a single value, the performance is the same. Plans are identical > between versions. > It seems like in 2.x the map from str_to_map is preserved for a given row, > but in 3.x it's recalculated for each column. One hint that it might be the > case is that when I tried forcing materialisation of said map in 3.x (by a > coalesce, don't know if there's a better way), I got the performance roughly > to 2.x levels. > Here's a reproducer (the csv in question gets autogenerated by the python > code): > {code:java} > $ head regression.csv > foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > ... (10M more rows) > {code} > {code:python} > import time > import os > import pyspark > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > if __name__ == '__main__': > print(pyspark.__version__) > spark = SparkSession.builder.getOrCreate() > filename = 'regression.csv' > if not os.path.isfile(filename): > with open(filename, 'wt') as fw: > fw.write('foo\n') > for _ in range(10_000_000): > fw.write('foo=bar&baz=bak&bar=foo\n') > df = spark.read.option('header', True).csv(filename) > t = time.time() > dd = (df > .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")')) > .select( > f.col('my_map')['foo'], > ) > ) > dd.write.mode('overwrite').csv('tmp') > t2 = time.time() > print('selected one', t2 - t) > dd = (df > .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")')) > # .coalesce(100) # forcing evaluation before selection speeds it > up in 3.0.1 > .select( > f.col('my_map')['foo'], > f.col('my_map')['bar'], > f.col('my_map')['baz'], > ) > ) > dd.explain(True) > dd.write.mode('overwrite').csv('tmp') > t3 = time.time() > print('selected three', t3 - t2) > {code} > Results for 2.4.7 and 3.0.1, both installed from PyPI, Python 3.7, macOS > (times are in seconds) > {code:java} > # 3.0.1 > # selected one 6.375471830368042 > > # selected three 14.847578048706055 > # 2.4.7 > # selected one 6.679579019546509 > > # selected three 6.5622029304504395 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32989) Performance regression when selecting from str_to_map
[ https://issues.apache.org/jira/browse/SPARK-32989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209979#comment-17209979 ] Yang Jie edited comment on SPARK-32989 at 10/8/20, 3:23 AM: [~ondrej] You're right, It will execute n times with codegen(SPARK-30356.) when select n columns use stringToMap expression, cc [~Qin Yao] [~cloud_fan] was (Author: luciferyang): [~ondrej] You're right, It will execute n times with codegen(SPARK-30356.) when select n columns use stringToMap expression, cc [~Qin Yao] > Performance regression when selecting from str_to_map > - > > Key: SPARK-32989 > URL: https://issues.apache.org/jira/browse/SPARK-32989 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Ondrej Kokes >Priority: Minor > > When I create a map using str_to_map and select more than a single value, I > notice a notable performance regression in 3.0.1 compared to 2.4.7. When > selecting a single value, the performance is the same. Plans are identical > between versions. > It seems like in 2.x the map from str_to_map is preserved for a given row, > but in 3.x it's recalculated for each column. One hint that it might be the > case is that when I tried forcing materialisation of said map in 3.x (by a > coalesce, don't know if there's a better way), I got the performance roughly > to 2.x levels. > Here's a reproducer (the csv in question gets autogenerated by the python > code): > {code:java} > $ head regression.csv > foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > ... (10M more rows) > {code} > {code:python} > import time > import os > import pyspark > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > if __name__ == '__main__': > print(pyspark.__version__) > spark = SparkSession.builder.getOrCreate() > filename = 'regression.csv' > if not os.path.isfile(filename): > with open(filename, 'wt') as fw: > fw.write('foo\n') > for _ in range(10_000_000): > fw.write('foo=bar&baz=bak&bar=foo\n') > df = spark.read.option('header', True).csv(filename) > t = time.time() > dd = (df > .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")')) > .select( > f.col('my_map')['foo'], > ) > ) > dd.write.mode('overwrite').csv('tmp') > t2 = time.time() > print('selected one', t2 - t) > dd = (df > .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")')) > # .coalesce(100) # forcing evaluation before selection speeds it > up in 3.0.1 > .select( > f.col('my_map')['foo'], > f.col('my_map')['bar'], > f.col('my_map')['baz'], > ) > ) > dd.explain(True) > dd.write.mode('overwrite').csv('tmp') > t3 = time.time() > print('selected three', t3 - t2) > {code} > Results for 2.4.7 and 3.0.1, both installed from PyPI, Python 3.7, macOS > (times are in seconds) > {code:java} > # 3.0.1 > # selected one 6.375471830368042 > > # selected three 14.847578048706055 > # 2.4.7 > # selected one 6.679579019546509 > > # selected three 6.5622029304504395 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32989) Performance regression when selecting from str_to_map
[ https://issues.apache.org/jira/browse/SPARK-32989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209979#comment-17209979 ] Yang Jie edited comment on SPARK-32989 at 10/8/20, 3:22 AM: [~ondrej] You're right, It will execute n times with codegen(SPARK-30356.) when select n columns use stringToMap expression, cc [~Qin Yao] was (Author: luciferyang): [~ondrej] You're right, It will execute n times with codegen(SPARK-30356.) when select n columns use stringToMap expression. > Performance regression when selecting from str_to_map > - > > Key: SPARK-32989 > URL: https://issues.apache.org/jira/browse/SPARK-32989 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Ondrej Kokes >Priority: Minor > > When I create a map using str_to_map and select more than a single value, I > notice a notable performance regression in 3.0.1 compared to 2.4.7. When > selecting a single value, the performance is the same. Plans are identical > between versions. > It seems like in 2.x the map from str_to_map is preserved for a given row, > but in 3.x it's recalculated for each column. One hint that it might be the > case is that when I tried forcing materialisation of said map in 3.x (by a > coalesce, don't know if there's a better way), I got the performance roughly > to 2.x levels. > Here's a reproducer (the csv in question gets autogenerated by the python > code): > {code:java} > $ head regression.csv > foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > ... (10M more rows) > {code} > {code:python} > import time > import os > import pyspark > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > if __name__ == '__main__': > print(pyspark.__version__) > spark = SparkSession.builder.getOrCreate() > filename = 'regression.csv' > if not os.path.isfile(filename): > with open(filename, 'wt') as fw: > fw.write('foo\n') > for _ in range(10_000_000): > fw.write('foo=bar&baz=bak&bar=foo\n') > df = spark.read.option('header', True).csv(filename) > t = time.time() > dd = (df > .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")')) > .select( > f.col('my_map')['foo'], > ) > ) > dd.write.mode('overwrite').csv('tmp') > t2 = time.time() > print('selected one', t2 - t) > dd = (df > .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")')) > # .coalesce(100) # forcing evaluation before selection speeds it > up in 3.0.1 > .select( > f.col('my_map')['foo'], > f.col('my_map')['bar'], > f.col('my_map')['baz'], > ) > ) > dd.explain(True) > dd.write.mode('overwrite').csv('tmp') > t3 = time.time() > print('selected three', t3 - t2) > {code} > Results for 2.4.7 and 3.0.1, both installed from PyPI, Python 3.7, macOS > (times are in seconds) > {code:java} > # 3.0.1 > # selected one 6.375471830368042 > > # selected three 14.847578048706055 > # 2.4.7 > # selected one 6.679579019546509 > > # selected three 6.5622029304504395 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32989) Performance regression when selecting from str_to_map
[ https://issues.apache.org/jira/browse/SPARK-32989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209979#comment-17209979 ] Yang Jie commented on SPARK-32989: -- [~ondrej] You're right, It will execute n times with codegen(SPARK-30356.) when select n columns use stringToMap expression. > Performance regression when selecting from str_to_map > - > > Key: SPARK-32989 > URL: https://issues.apache.org/jira/browse/SPARK-32989 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Ondrej Kokes >Priority: Minor > > When I create a map using str_to_map and select more than a single value, I > notice a notable performance regression in 3.0.1 compared to 2.4.7. When > selecting a single value, the performance is the same. Plans are identical > between versions. > It seems like in 2.x the map from str_to_map is preserved for a given row, > but in 3.x it's recalculated for each column. One hint that it might be the > case is that when I tried forcing materialisation of said map in 3.x (by a > coalesce, don't know if there's a better way), I got the performance roughly > to 2.x levels. > Here's a reproducer (the csv in question gets autogenerated by the python > code): > {code:java} > $ head regression.csv > foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > foo=bar&baz=bak&bar=foo > ... (10M more rows) > {code} > {code:python} > import time > import os > import pyspark > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > if __name__ == '__main__': > print(pyspark.__version__) > spark = SparkSession.builder.getOrCreate() > filename = 'regression.csv' > if not os.path.isfile(filename): > with open(filename, 'wt') as fw: > fw.write('foo\n') > for _ in range(10_000_000): > fw.write('foo=bar&baz=bak&bar=foo\n') > df = spark.read.option('header', True).csv(filename) > t = time.time() > dd = (df > .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")')) > .select( > f.col('my_map')['foo'], > ) > ) > dd.write.mode('overwrite').csv('tmp') > t2 = time.time() > print('selected one', t2 - t) > dd = (df > .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")')) > # .coalesce(100) # forcing evaluation before selection speeds it > up in 3.0.1 > .select( > f.col('my_map')['foo'], > f.col('my_map')['bar'], > f.col('my_map')['baz'], > ) > ) > dd.explain(True) > dd.write.mode('overwrite').csv('tmp') > t3 = time.time() > print('selected three', t3 - t2) > {code} > Results for 2.4.7 and 3.0.1, both installed from PyPI, Python 3.7, macOS > (times are in seconds) > {code:java} > # 3.0.1 > # selected one 6.375471830368042 > > # selected three 14.847578048706055 > # 2.4.7 > # selected one 6.679579019546509 > > # selected three 6.5622029304504395 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33089) avro format does not propagate Hadoop config from DS options to underlying HDFS file system
[ https://issues.apache.org/jira/browse/SPARK-33089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33089. -- Fix Version/s: 3.1.0 3.0.2 Assignee: Yuning Zhang Resolution: Fixed Fixed in https://github.com/apache/spark/pull/29971 > avro format does not propagate Hadoop config from DS options to underlying > HDFS file system > --- > > Key: SPARK-33089 > URL: https://issues.apache.org/jira/browse/SPARK-33089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuning Zhang >Assignee: Yuning Zhang >Priority: Major > Fix For: 3.0.2, 3.1.0 > > > When running: > {code:java} > spark.read.format("avro").options(conf).load(path) > {code} > The underlying file system will not receive the `conf` options. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32793) Expose assert_true in Python/Scala APIs and add error message parameter
[ https://issues.apache.org/jira/browse/SPARK-32793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32793. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29947 [https://github.com/apache/spark/pull/29947] > Expose assert_true in Python/Scala APIs and add error message parameter > --- > > Key: SPARK-32793 > URL: https://issues.apache.org/jira/browse/SPARK-32793 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Karen Feng >Assignee: Karen Feng >Priority: Minor > Fix For: 3.1.0 > > > # Add RAISEERROR() (or RAISE_ERROR()) to the API > # Add Scala/Python/R version of API for ASSERT_TRUE() > # Add an extra parameter to ASSERT_TRUE() as (cond, message), and for which > the `message` parameter is only lazily evaluated when the condition is not > true > # Change the implementation of ASSERT_TRUE() to be rewritten during > optimization to IF() instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32793) Expose assert_true in Python/Scala APIs and add error message parameter
[ https://issues.apache.org/jira/browse/SPARK-32793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32793: Assignee: Karen Feng > Expose assert_true in Python/Scala APIs and add error message parameter > --- > > Key: SPARK-32793 > URL: https://issues.apache.org/jira/browse/SPARK-32793 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Karen Feng >Assignee: Karen Feng >Priority: Minor > > # Add RAISEERROR() (or RAISE_ERROR()) to the API > # Add Scala/Python/R version of API for ASSERT_TRUE() > # Add an extra parameter to ASSERT_TRUE() as (cond, message), and for which > the `message` parameter is only lazily evaluated when the condition is not > true > # Change the implementation of ASSERT_TRUE() to be rewritten during > optimization to IF() instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209970#comment-17209970 ] Apache Spark commented on SPARK-20202: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/29973 > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1, 2.2.3, 2.3.4, 2.4.4, 3.0.0, 3.1.0 >Reporter: Owen O'Malley >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.1.0 > > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209967#comment-17209967 ] Apache Spark commented on SPARK-20202: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/29973 > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1, 2.2.3, 2.3.4, 2.4.4, 3.0.0, 3.1.0 >Reporter: Owen O'Malley >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.1.0 > > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209969#comment-17209969 ] Apache Spark commented on SPARK-20202: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/29973 > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1, 2.2.3, 2.3.4, 2.4.4, 3.0.0, 3.1.0 >Reporter: Owen O'Malley >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.1.0 > > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33090) Upgrade Google Guava
[ https://issues.apache.org/jira/browse/SPARK-33090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209968#comment-17209968 ] Stephen Coy commented on SPARK-33090: - I can create a PR for this if you like... > Upgrade Google Guava > > > Key: SPARK-33090 > URL: https://issues.apache.org/jira/browse/SPARK-33090 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.1 >Reporter: Stephen Coy >Priority: Major > > Hadoop versions newer than 3.2.0 (such as 3.2.1 and 3.3.0) have started using > features from newer versions of Google Guava. > This leads to MethodNotFound exceptions, etc in Spark builds that specify > newer versions of Hadoop. I believe this is due to the use of new methods in > com.google.common.base.Preconditions. > The above versions of Hadoop use guava-27.0-jre, whereas Spark is currently > glued to guava-14.0.1. > I have been running a Spark cluster with the version bumped to guava-29.0-jre > without issue. > Partly due to the way Spark is built, this change is a little more > complicated that just changing the version, because newer versions of guava > have a new dependency on com.google.guava:failureaccess:1.0. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33082) Remove hive-1.2 workaround code
[ https://issues.apache.org/jira/browse/SPARK-33082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209966#comment-17209966 ] Apache Spark commented on SPARK-33082: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/29973 > Remove hive-1.2 workaround code > --- > > Key: SPARK-33082 > URL: https://issues.apache.org/jira/browse/SPARK-33082 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33082) Remove hive-1.2 workaround code
[ https://issues.apache.org/jira/browse/SPARK-33082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209965#comment-17209965 ] Apache Spark commented on SPARK-33082: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/29973 > Remove hive-1.2 workaround code > --- > > Key: SPARK-33082 > URL: https://issues.apache.org/jira/browse/SPARK-33082 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33090) Upgrade Google Guava
Stephen Coy created SPARK-33090: --- Summary: Upgrade Google Guava Key: SPARK-33090 URL: https://issues.apache.org/jira/browse/SPARK-33090 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.0.1 Reporter: Stephen Coy Hadoop versions newer than 3.2.0 (such as 3.2.1 and 3.3.0) have started using features from newer versions of Google Guava. This leads to MethodNotFound exceptions, etc in Spark builds that specify newer versions of Hadoop. I believe this is due to the use of new methods in com.google.common.base.Preconditions. The above versions of Hadoop use guava-27.0-jre, whereas Spark is currently glued to guava-14.0.1. I have been running a Spark cluster with the version bumped to guava-29.0-jre without issue. Partly due to the way Spark is built, this change is a little more complicated that just changing the version, because newer versions of guava have a new dependency on com.google.guava:failureaccess:1.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32960) Provide better exception on temporary view against DataFrameWriterV2
[ https://issues.apache.org/jira/browse/SPARK-32960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-32960. -- Resolution: Won't Fix Superceded by SPARK-33087 > Provide better exception on temporary view against DataFrameWriterV2 > > > Key: SPARK-32960 > URL: https://issues.apache.org/jira/browse/SPARK-32960 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Jungtaek Lim >Priority: Minor > > DataFrameWriterV2 doesn't handle fail-back if catalog.loadTable doesn't > provide any Table instance. This ends up leading temp view to > NoSuchTableException. > It's OK to fail for such case unless we want to resolve it later like > DataFrameWriter.insertInto, but throwing NoSuchTableException is probably > confusing, as view is loaded via catalog.loadTable and fails with capability > check, not NoSuchTableException. > We could check in prior whether the table identifier refers temp view, and > provide better exception. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33086) Provide static annotatiions for pyspark.resource.* modules
[ https://issues.apache.org/jira/browse/SPARK-33086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33086. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29969 [https://github.com/apache/spark/pull/29969] > Provide static annotatiions for pyspark.resource.* modules > -- > > Key: SPARK-33086 > URL: https://issues.apache.org/jira/browse/SPARK-33086 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > Fix For: 3.1.0 > > > At the point of port {{pyspark.resource}} had only dynamic annotations > generated using {{stubgen}}. > Since they are a part of a public API, we should provide static annotations > instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33086) Provide static annotatiions for pyspark.resource.* modules
[ https://issues.apache.org/jira/browse/SPARK-33086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33086: Assignee: Maciej Szymkiewicz > Provide static annotatiions for pyspark.resource.* modules > -- > > Key: SPARK-33086 > URL: https://issues.apache.org/jira/browse/SPARK-33086 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > > At the point of port {{pyspark.resource}} had only dynamic annotations > generated using {{stubgen}}. > Since they are a part of a public API, we should provide static annotations > instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31913) StackOverflowError in FileScanRDD
[ https://issues.apache.org/jira/browse/SPARK-31913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-31913. -- Resolution: Cannot Reproduce > StackOverflowError in FileScanRDD > - > > Key: SPARK-31913 > URL: https://issues.apache.org/jira/browse/SPARK-31913 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: Genmao Yu >Priority: Minor > > Reading from FileScanRDD may failed with a StackOverflowError in my > environment: > - There are a mass of empty files in table partition。 > - Set `spark.sql.files.maxPartitionBytes` with a large value: 1024MB > A quick workaround is set `spark.sql.files.maxPartitionBytes` with a small > value, like default 128MB. > A better way is resolve the recursive calls in FileScanRDD. > {code} > java.lang.StackOverflowError > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.getSubject(Subject.java:297) > at > org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:648) > at > org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2828) > at > org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2818) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2684) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) > at > org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:38) > at > org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:640) > at > org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:148) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:143) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:326) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31913) StackOverflowError in FileScanRDD
[ https://issues.apache.org/jira/browse/SPARK-31913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209946#comment-17209946 ] Takeshi Yamamuro commented on SPARK-31913: -- Since this issue looks env-dependent and the PR was automatically closed, I will close this. > StackOverflowError in FileScanRDD > - > > Key: SPARK-31913 > URL: https://issues.apache.org/jira/browse/SPARK-31913 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: Genmao Yu >Priority: Minor > > Reading from FileScanRDD may failed with a StackOverflowError in my > environment: > - There are a mass of empty files in table partition。 > - Set `spark.sql.files.maxPartitionBytes` with a large value: 1024MB > A quick workaround is set `spark.sql.files.maxPartitionBytes` with a small > value, like default 128MB. > A better way is resolve the recursive calls in FileScanRDD. > {code} > java.lang.StackOverflowError > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.getSubject(Subject.java:297) > at > org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:648) > at > org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2828) > at > org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2818) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2684) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) > at > org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:38) > at > org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:640) > at > org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:148) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:143) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:326) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28067) Incorrect results in decimal aggregation with whole-stage code gen enabled
[ https://issues.apache.org/jira/browse/SPARK-28067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209936#comment-17209936 ] Dongjoon Hyun commented on SPARK-28067: --- [~anuragmantri] For this one, this is not backported to 3.0.0, too. > Incorrect results in decimal aggregation with whole-stage code gen enabled > -- > > Key: SPARK-28067 > URL: https://issues.apache.org/jira/browse/SPARK-28067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: Mark Sirek >Assignee: Sunitha Kambhampati >Priority: Critical > Labels: correctness > Fix For: 3.1.0 > > > The following test case involving a join followed by a sum aggregation > returns the wrong answer for the sum: > > {code:java} > val df = Seq( > (BigDecimal("1000"), 1), > (BigDecimal("1000"), 1), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2)).toDF("decNum", "intNum") > val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, > "intNum").agg(sum("decNum")) > scala> df2.show(40,false) > --- > sum(decNum) > --- > 4000.00 > --- > > {code} > > The result should be 104000.. > It appears a partial sum is computed for each join key, as the result > returned would be the answer for all rows matching intNum === 1. > If only the rows with intNum === 2 are included, the answer given is null: > > {code:java} > scala> val df3 = df.filter($"intNum" === lit(2)) > df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [decNum: > decimal(38,18), intNum: int] > scala> val df4 = df3.withColumnRenamed("decNum", "decNum2").join(df3, > "intNum").agg(sum("decNum")) > df4: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)] > scala> df4.show(40,false) > --- > sum(decNum) > --- > null > --- > > {code} > > The correct answer, 10., doesn't fit in > the DataType picked for the result, decimal(38,18), so an overflow occurs, > which Spark then converts to null. > The first example, which doesn't filter out the intNum === 1 values should > also return null, indicating overflow, but it doesn't. This may mislead the > user to think a valid sum was computed. > If whole-stage code gen is turned off: > spark.conf.set("spark.sql.codegen.wholeStage", false) > ... incorrect results are not returned because the overflow is caught as an > exception: > java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 > exceeds max precision 38 > > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32978) Incorrect number of dynamic part metric
[ https://issues.apache.org/jira/browse/SPARK-32978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aoyuan Liao updated SPARK-32978: Description: How to reproduce this issue: {code:sql} create table dynamic_partition(i bigint, part bigint) using parquet partitioned by (part); insert overwrite table dynamic_partition partition(part) select id, id % 50 as part from range(1); {code} The number of dynamic part should be 50, but it is 800 on web UI. was: How to reproduce this issue: {code:sql} create table dynamic_partition(i bigint, part bigint) using parquet partitioned by (part); insert overwrite table dynamic_partition partition(part) select id, id % 50 as part from range(1); {code} The number of dynamic part should be 50, but it is 800. > Incorrect number of dynamic part metric > --- > > Key: SPARK-32978 > URL: https://issues.apache.org/jira/browse/SPARK-32978 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: screenshot-1.png > > > How to reproduce this issue: > {code:sql} > create table dynamic_partition(i bigint, part bigint) using parquet > partitioned by (part); > insert overwrite table dynamic_partition partition(part) select id, id % 50 > as part from range(1); > {code} > The number of dynamic part should be 50, but it is 800 on web UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16859) History Server storage information is missing
[ https://issues.apache.org/jira/browse/SPARK-16859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209932#comment-17209932 ] Aoyuan Liao commented on SPARK-16859: - "spark.eventLog.logBlockUpdates.enabled=true" works for me on Spark 3.0.1 > History Server storage information is missing > - > > Key: SPARK-16859 > URL: https://issues.apache.org/jira/browse/SPARK-16859 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 >Reporter: Andrei Ivanov >Priority: Major > Labels: historyserver, newbie > > It looks like job history storage tab in history server is broken for > completed jobs since *1.6.2*. > More specifically it's broken since > [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845]. > I've fixed for my installation by effectively reverting the above patch > ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]). > IMHO, the most straightforward fix would be to implement > _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ making > sure it works from _ReplayListenerBus_. > The downside will be that it will still work incorrectly with pre patch job > histories. But then, it doesn't work since *1.6.2* anyhow. > PS: I'd really love to have this fixed eventually. But I'm pretty new to > Apache Spark and missing hands on Scala experience. So I'd prefer that it be > fixed by someone experienced with roadmap vision. If nobody volunteers I'll > try to patch myself. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28067) Incorrect results in decimal aggregation with whole-stage code gen enabled
[ https://issues.apache.org/jira/browse/SPARK-28067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209923#comment-17209923 ] Anurag Mantripragada commented on SPARK-28067: -- I just checked the issue exists in branch-2.4. Since this is a `correctness` issue, should we backport it to branch-2.4? cc: [~cloud_fan], [~dongjoon] > Incorrect results in decimal aggregation with whole-stage code gen enabled > -- > > Key: SPARK-28067 > URL: https://issues.apache.org/jira/browse/SPARK-28067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: Mark Sirek >Assignee: Sunitha Kambhampati >Priority: Critical > Labels: correctness > Fix For: 3.1.0 > > > The following test case involving a join followed by a sum aggregation > returns the wrong answer for the sum: > > {code:java} > val df = Seq( > (BigDecimal("1000"), 1), > (BigDecimal("1000"), 1), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2)).toDF("decNum", "intNum") > val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, > "intNum").agg(sum("decNum")) > scala> df2.show(40,false) > --- > sum(decNum) > --- > 4000.00 > --- > > {code} > > The result should be 104000.. > It appears a partial sum is computed for each join key, as the result > returned would be the answer for all rows matching intNum === 1. > If only the rows with intNum === 2 are included, the answer given is null: > > {code:java} > scala> val df3 = df.filter($"intNum" === lit(2)) > df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [decNum: > decimal(38,18), intNum: int] > scala> val df4 = df3.withColumnRenamed("decNum", "decNum2").join(df3, > "intNum").agg(sum("decNum")) > df4: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)] > scala> df4.show(40,false) > --- > sum(decNum) > --- > null > --- > > {code} > > The correct answer, 10., doesn't fit in > the DataType picked for the result, decimal(38,18), so an overflow occurs, > which Spark then converts to null. > The first example, which doesn't filter out the intNum === 1 values should > also return null, indicating overflow, but it doesn't. This may mislead the > user to think a valid sum was computed. > If whole-stage code gen is turned off: > spark.conf.set("spark.sql.codegen.wholeStage", false) > ... incorrect results are not returned because the overflow is caught as an > exception: > java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 > exceeds max precision 38 > > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33081) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect)
[ https://issues.apache.org/jira/browse/SPARK-33081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209919#comment-17209919 ] Apache Spark commented on SPARK-33081: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/29972 > Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of > columns (DB2 dialect) > -- > > Key: SPARK-33081 > URL: https://issues.apache.org/jira/browse/SPARK-33081 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Major > > Override the default SQL strings for: > * ALTER TABLE UPDATE COLUMN TYPE > * ALTER TABLE UPDATE COLUMN NULLABILITY > in the following DB2 JDBC dialect according to official documentation. > Write DB2 integration tests for JDBC. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33081) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect)
[ https://issues.apache.org/jira/browse/SPARK-33081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33081: Assignee: (was: Apache Spark) > Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of > columns (DB2 dialect) > -- > > Key: SPARK-33081 > URL: https://issues.apache.org/jira/browse/SPARK-33081 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Major > > Override the default SQL strings for: > * ALTER TABLE UPDATE COLUMN TYPE > * ALTER TABLE UPDATE COLUMN NULLABILITY > in the following DB2 JDBC dialect according to official documentation. > Write DB2 integration tests for JDBC. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33081) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect)
[ https://issues.apache.org/jira/browse/SPARK-33081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33081: Assignee: Apache Spark > Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of > columns (DB2 dialect) > -- > > Key: SPARK-33081 > URL: https://issues.apache.org/jira/browse/SPARK-33081 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Major > > Override the default SQL strings for: > * ALTER TABLE UPDATE COLUMN TYPE > * ALTER TABLE UPDATE COLUMN NULLABILITY > in the following DB2 JDBC dialect according to official documentation. > Write DB2 integration tests for JDBC. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21708) use sbt 1.x
[ https://issues.apache.org/jira/browse/SPARK-21708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-21708: - Assignee: Denis Pyshev > use sbt 1.x > --- > > Key: SPARK-21708 > URL: https://issues.apache.org/jira/browse/SPARK-21708 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: PJ Fanning >Assignee: Denis Pyshev >Priority: Minor > Fix For: 3.1.0 > > > Should improve sbt build times. > http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html > According to https://github.com/sbt/sbt/issues/3424, we will need to change > the HTTP location where we get the sbt-launch jar. > Other related issues: > SPARK-14401 > https://github.com/typesafehub/sbteclipse/issues/343 > https://github.com/jrudolph/sbt-dependency-graph/issues/134 > https://github.com/AlpineNow/junit_xml_listener/issues/6 > https://github.com/spray/sbt-revolver/issues/62 > https://github.com/ihji/sbt-antlr4/issues/14 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21708) use sbt 1.x
[ https://issues.apache.org/jira/browse/SPARK-21708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-21708. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29286 [https://github.com/apache/spark/pull/29286] > use sbt 1.x > --- > > Key: SPARK-21708 > URL: https://issues.apache.org/jira/browse/SPARK-21708 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: PJ Fanning >Priority: Minor > Fix For: 3.1.0 > > > Should improve sbt build times. > http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html > According to https://github.com/sbt/sbt/issues/3424, we will need to change > the HTTP location where we get the sbt-launch jar. > Other related issues: > SPARK-14401 > https://github.com/typesafehub/sbteclipse/issues/343 > https://github.com/jrudolph/sbt-dependency-graph/issues/134 > https://github.com/AlpineNow/junit_xml_listener/issues/6 > https://github.com/spray/sbt-revolver/issues/62 > https://github.com/ihji/sbt-antlr4/issues/14 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209876#comment-17209876 ] Apache Spark commented on SPARK-32001: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/29968 > Create Kerberos authentication provider API in JDBC connector > - > > Key: SPARK-32001 > URL: https://issues.apache.org/jira/browse/SPARK-32001 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.1.0 > > > Adding embedded provider to all the possible databases would generate high > maintenance cost on Spark side. > Instead an API can be introduced which would allow to implement further > providers independently. > One important requirement what I suggest is: JDBC connection providers must > be loaded independently just like delegation token providers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209875#comment-17209875 ] Apache Spark commented on SPARK-32001: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/29968 > Create Kerberos authentication provider API in JDBC connector > - > > Key: SPARK-32001 > URL: https://issues.apache.org/jira/browse/SPARK-32001 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.1.0 > > > Adding embedded provider to all the possible databases would generate high > maintenance cost on Spark side. > Instead an API can be introduced which would allow to implement further > providers independently. > One important requirement what I suggest is: JDBC connection providers must > be loaded independently just like delegation token providers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33089) avro format does not propagate Hadoop config from DS options to underlying HDFS file system
[ https://issues.apache.org/jira/browse/SPARK-33089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209873#comment-17209873 ] Apache Spark commented on SPARK-33089: -- User 'yuningzh-db' has created a pull request for this issue: https://github.com/apache/spark/pull/29971 > avro format does not propagate Hadoop config from DS options to underlying > HDFS file system > --- > > Key: SPARK-33089 > URL: https://issues.apache.org/jira/browse/SPARK-33089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuning Zhang >Priority: Major > > When running: > {code:java} > spark.read.format("avro").options(conf).load(path) > {code} > The underlying file system will not receive the `conf` options. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33089) avro format does not propagate Hadoop config from DS options to underlying HDFS file system
[ https://issues.apache.org/jira/browse/SPARK-33089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33089: Assignee: (was: Apache Spark) > avro format does not propagate Hadoop config from DS options to underlying > HDFS file system > --- > > Key: SPARK-33089 > URL: https://issues.apache.org/jira/browse/SPARK-33089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuning Zhang >Priority: Major > > When running: > {code:java} > spark.read.format("avro").options(conf).load(path) > {code} > The underlying file system will not receive the `conf` options. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33089) avro format does not propagate Hadoop config from DS options to underlying HDFS file system
[ https://issues.apache.org/jira/browse/SPARK-33089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33089: Assignee: Apache Spark > avro format does not propagate Hadoop config from DS options to underlying > HDFS file system > --- > > Key: SPARK-33089 > URL: https://issues.apache.org/jira/browse/SPARK-33089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuning Zhang >Assignee: Apache Spark >Priority: Major > > When running: > {code:java} > spark.read.format("avro").options(conf).load(path) > {code} > The underlying file system will not receive the `conf` options. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33089) avro format does not propagate Hadoop config from DS options to underlying HDFS file system
[ https://issues.apache.org/jira/browse/SPARK-33089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuning Zhang updated SPARK-33089: - Description: When running: {code:java} spark.read.format("avro").options(conf).load(path) {code} The underlying file system will not receive the `conf` options. was: When running: {code:java} spark.read.format("avro").options(conf).load(path) {code} The > avro format does not propagate Hadoop config from DS options to underlying > HDFS file system > --- > > Key: SPARK-33089 > URL: https://issues.apache.org/jira/browse/SPARK-33089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuning Zhang >Priority: Major > > When running: > {code:java} > spark.read.format("avro").options(conf).load(path) > {code} > The underlying file system will not receive the `conf` options. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33089) avro format does not propagate Hadoop config from DS options to underlying HDFS file system
[ https://issues.apache.org/jira/browse/SPARK-33089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuning Zhang updated SPARK-33089: - Description: When running: {code:java} spark.read.format("avro").options(conf).load(path) {code} The > avro format does not propagate Hadoop config from DS options to underlying > HDFS file system > --- > > Key: SPARK-33089 > URL: https://issues.apache.org/jira/browse/SPARK-33089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuning Zhang >Priority: Major > > When running: > {code:java} > spark.read.format("avro").options(conf).load(path) > {code} > The -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33089) avro format does not propagate Hadoop config from DS options to underlying HDFS file system
Yuning Zhang created SPARK-33089: Summary: avro format does not propagate Hadoop config from DS options to underlying HDFS file system Key: SPARK-33089 URL: https://issues.apache.org/jira/browse/SPARK-33089 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Yuning Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33019) Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default
[ https://issues.apache.org/jira/browse/SPARK-33019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209864#comment-17209864 ] Dongjoon Hyun commented on SPARK-33019: --- [~ste...@apache.org]. The user sill use v2 committer if they already set the conf explicitly. In addition, the user still can use v2 committer if they want. > you can still use v2 committer We only prevent the users blindly expect the same behavior during migration from Apache Spark 3.0 to Apache Spark 3.1. > Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default > - > > Key: SPARK-33019 > URL: https://issues.apache.org/jira/browse/SPARK-33019 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Blocker > Labels: correctness > Fix For: 3.0.2, 3.1.0 > > > By default, Spark should use a safe file output committer algorithm to avoid > MAPREDUCE-7282. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-33042) Add a test case to ensure changes to spark.sql.optimizer.maxIterations take effect at runtime
[ https://issues.apache.org/jira/browse/SPARK-33042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuning Zhang closed SPARK-33042. > Add a test case to ensure changes to spark.sql.optimizer.maxIterations take > effect at runtime > - > > Key: SPARK-33042 > URL: https://issues.apache.org/jira/browse/SPARK-33042 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuning Zhang >Assignee: Yuning Zhang >Priority: Major > Fix For: 3.1.0 > > > **Add a test case to ensure changes to `spark.sql.optimizer.maxIterations` > take effect at runtime. > Currently, there is only one related test case: > [https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala#L156] > However, this test case only checks the value of the conf can be changed at > runtime. It does not check the updated value is actually used by the > Optimizer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33074) Classify dialect exceptions in JDBC v2 Table Catalog
[ https://issues.apache.org/jira/browse/SPARK-33074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-33074: --- Parent: SPARK-24907 Issue Type: Sub-task (was: Improvement) > Classify dialect exceptions in JDBC v2 Table Catalog > > > Key: SPARK-33074 > URL: https://issues.apache.org/jira/browse/SPARK-33074 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > The current implementation of v2.jdbc.JDBCTableCatalog don't care of > exceptions defined by org.apache.spark.sql.connector.catalog.TableCatalog at > all like > * NoSuchNamespaceException > * NoSuchTableException > * TableAlreadyExistsException > it either throw dialect exception or generic exception AnalysisException. > Since we split forming of dialect specific statements and their execution, we > should extend dialect APIs and ask them how to convert their exceptions to > TableCatalog exceptions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33082) Remove hive-1.2 workaround code
[ https://issues.apache.org/jira/browse/SPARK-33082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33082: - Assignee: Dongjoon Hyun > Remove hive-1.2 workaround code > --- > > Key: SPARK-33082 > URL: https://issues.apache.org/jira/browse/SPARK-33082 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33082) Remove hive-1.2 workaround code
[ https://issues.apache.org/jira/browse/SPARK-33082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33082. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29961 [https://github.com/apache/spark/pull/29961] > Remove hive-1.2 workaround code > --- > > Key: SPARK-33082 > URL: https://issues.apache.org/jira/browse/SPARK-33082 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33088) Enhance ExecutorPlugin API to include methods for task start and end events
[ https://issues.apache.org/jira/browse/SPARK-33088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Samuel Souza updated SPARK-33088: - Description: On [SPARK-24918|https://issues.apache.org/jira/browse/SPARK-24918]'s [SIPP|https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/view#|https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/edit#], it was raised to potentially add methods to ExecutorPlugin interface on task start and end: {quote}The basic interface can just be a marker trait, as that allows a plugin to monitor general characteristics of the JVM (eg. monitor memory or take thread dumps). Optionally, we could include methods for task start and end events. This would allow more control on monitoring – eg., you could start polling thread dumps only if there was a task from a particular stage that had been taking too long. But anything task related is a bit trickier to decide the right api. Should the task end event also get the failure reason? Should those events get called in the same thread as the task runner, or in another thread? {quote} The ask is to add exactly that. I've put up a draft PR [in our fork of spark|https://github.com/palantir/spark/pull/713] and I'm happy to push it upstream. Also happy to receive comments on what's the right interface to expose - not opinionated on that front, tried to expose the simplest interface for now. The main reason for this ask is to propagate tracing information from the driver to the executors ([SPARK-21962|https://issues.apache.org/jira/browse/SPARK-21962] has some context). On [HADOOP-15566|https://issues.apache.org/jira/browse/HADOOP-15566] I see we're discussing how to add tracing to the Apache ecosystem, but my problem is slightly different: I want to use this interface to propagate tracing information to my framework of choice. If the Hadoop issue gets solved we'll have a framework to communicate tracing information inside the Apache ecosystem, but it's highly unlikely that all Spark users will use the same common framework. Therefore we should still provide plugin interfaces where the tracing information can be propagated appropriately. To give more color, in our case the tracing information is [stored in a thread local|https://github.com/palantir/tracing-java/blob/4.9.0/tracing/src/main/java/com/palantir/tracing/Tracer.java#L61], therefore it needs to be set in the same thread which is executing the task. [*] While our framework is specific, I imagine such an interface could be useful in general. Happy to hear your thoughts about it. [*] Something I did not mention was how to propagate the tracing information from the driver to the executors. For that I intend to use 1. the driver's localProperties, which 2. will be eventually propagated to the executors' TaskContext, which 3. I'll be able to access from the methods above. was: On https://issues.apache.org/jira/browse/SPARK-24918's [SIPP|[https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/view#|https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/edit#]], it was raised to potentially add methods to ExecutorPlugin interface on task start and end: {quote}The basic interface can just be a marker trait, as that allows a plugin to monitor general characteristics of the JVM (eg. monitor memory or take thread dumps). Optionally, we could include methods for task start and end events. This would allow more control on monitoring -- eg., you could start polling thread dumps only if there was a task from a particular stage that had been taking too long. But anything task related is a bit trickier to decide the right api. Should the task end event also get the failure reason? Should those events get called in the same thread as the task runner, or in another thread? {quote} The ask is to add exactly that. I've put up a draft PR in our fork of spark [here| [https://github.com/palantir/spark/pull/713]] and I'm happy to push it upstream. Also happy to receive comments on what's the right interface to expose - not opinionated on that front, tried to expose the simplest interface for now. The main reason for this ask is to propagate tracing information from the driver to the executors (https://issues.apache.org/jira/browse/SPARK-21962 has some context). On https://issues.apache.org/jira/browse/HADOOP-15566 I see we're discussing how to add tracing to the Apache ecosystem, but my problem is slightly different: I want to use this interface to propagate tracing information to my framework of choice. If the Hadoop issue gets solved we'll have a framework to communicate tracing information inside the Apache ecosystem, but it's highly unlikely that all Spark users will use the same common framework. Therefore we should still provide plugin interfaces where the tracing inform
[jira] [Created] (SPARK-33088) Enhance ExecutorPlugin API to include methods for task start and end events
Samuel Souza created SPARK-33088: Summary: Enhance ExecutorPlugin API to include methods for task start and end events Key: SPARK-33088 URL: https://issues.apache.org/jira/browse/SPARK-33088 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 3.1.0 Reporter: Samuel Souza On https://issues.apache.org/jira/browse/SPARK-24918's [SIPP|[https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/view#|https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/edit#]], it was raised to potentially add methods to ExecutorPlugin interface on task start and end: {quote}The basic interface can just be a marker trait, as that allows a plugin to monitor general characteristics of the JVM (eg. monitor memory or take thread dumps). Optionally, we could include methods for task start and end events. This would allow more control on monitoring -- eg., you could start polling thread dumps only if there was a task from a particular stage that had been taking too long. But anything task related is a bit trickier to decide the right api. Should the task end event also get the failure reason? Should those events get called in the same thread as the task runner, or in another thread? {quote} The ask is to add exactly that. I've put up a draft PR in our fork of spark [here| [https://github.com/palantir/spark/pull/713]] and I'm happy to push it upstream. Also happy to receive comments on what's the right interface to expose - not opinionated on that front, tried to expose the simplest interface for now. The main reason for this ask is to propagate tracing information from the driver to the executors (https://issues.apache.org/jira/browse/SPARK-21962 has some context). On https://issues.apache.org/jira/browse/HADOOP-15566 I see we're discussing how to add tracing to the Apache ecosystem, but my problem is slightly different: I want to use this interface to propagate tracing information to my framework of choice. If the Hadoop issue gets solved we'll have a framework to communicate tracing information inside the Apache ecosystem, but it's highly unlikely that all Spark users will use the same common framework. Therefore we should still provide plugin interfaces where the tracing information can be propagated appropriately. To give more color, in our case the tracing information is [stored in a thread local|[https://github.com/palantir/tracing-java/blob/develop/tracing/src/main/java/com/palantir/tracing/Tracer.java#L61]], therefore it needs to be set in the same thread which is executing the task. [*] While our framework is specific, I imagine such an interface could be useful in general. Happy to hear your thoughts about it. [*] Something I did not mention was how to propagate the tracing information from the driver to the executors. For that I intend to use 1. the driver's localProperties, which 2. will be eventually propagated to the executors' TaskContext, which 3. I'll be able to access from the methods above. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33019) Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default
[ https://issues.apache.org/jira/browse/SPARK-33019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209766#comment-17209766 ] Steve Loughran commented on SPARK-33019: Related to this, I'm proposing we add a method which will let the MR engine and spark driver work out if a committer can be recovered from -and choose how to react if it says "no" - fail or warn + commit another attempt That way if you want full due diligence you can still use v2 committer, (or EMR committer), but get the ability to make failures during the commit phase something which triggers a failure. Most of the time, it won't. > Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default > - > > Key: SPARK-33019 > URL: https://issues.apache.org/jira/browse/SPARK-33019 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Blocker > Labels: correctness > Fix For: 3.0.2, 3.1.0 > > > By default, Spark should use a safe file output committer algorithm to avoid > MAPREDUCE-7282. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27484) Create the streaming writing logical plan node before query is analyzed
[ https://issues.apache.org/jira/browse/SPARK-27484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209755#comment-17209755 ] Dongjoon Hyun commented on SPARK-27484: --- It seems that [~kabhwan] also hits this issue and documents it in his SPARK-32896 PR like the following. - https://github.com/apache/spark/pull/29767/files#diff-d35e8fce09686073f81de598ed657de7R314-R319 {code} // Currently we don't create a logical streaming writer node in logical plan, so cannot rely // on analyzer to resolve it. Directly lookup only for temp view to provide clearer message. // TODO (SPARK-27484): we should add the writing node before the plan is analyzed. if (df.sparkSession.sessionState.catalog.isTempView(originalMultipartIdentifier)) { throw new AnalysisException(s"Temporary view $tableName doesn't support streaming write") } {code} > Create the streaming writing logical plan node before query is analyzed > --- > > Key: SPARK-27484 > URL: https://issues.apache.org/jira/browse/SPARK-27484 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27484) Create the streaming writing logical plan node before query is analyzed
[ https://issues.apache.org/jira/browse/SPARK-27484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209755#comment-17209755 ] Dongjoon Hyun edited comment on SPARK-27484 at 10/7/20, 6:17 PM: - It seems that [~kabhwan] also hits this issue and documents it in his SPARK-32896 PR like the following. - https://github.com/apache/spark/pull/29767/files#diff-d35e8fce09686073f81de598ed657de7R314-R319 {code} // Currently we don't create a logical streaming writer node in logical plan, so cannot rely // on analyzer to resolve it. Directly lookup only for temp view to provide clearer message. // TODO (SPARK-27484): we should add the writing node before the plan is analyzed. if (df.sparkSession.sessionState.catalog.isTempView(originalMultipartIdentifier)) { throw new AnalysisException(s"Temporary view $tableName doesn't support streaming write") } {code} was (Author: dongjoon): It seems that [~kabhwan] also hits this issue and documents it in his SPARK-32896 PR like the following. - https://github.com/apache/spark/pull/29767/files#diff-d35e8fce09686073f81de598ed657de7R314-R319 {code} // Currently we don't create a logical streaming writer node in logical plan, so cannot rely // on analyzer to resolve it. Directly lookup only for temp view to provide clearer message. // TODO (SPARK-27484): we should add the writing node before the plan is analyzed. if (df.sparkSession.sessionState.catalog.isTempView(originalMultipartIdentifier)) { throw new AnalysisException(s"Temporary view $tableName doesn't support streaming write") } {code} > Create the streaming writing logical plan node before query is analyzed > --- > > Key: SPARK-27484 > URL: https://issues.apache.org/jira/browse/SPARK-27484 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33087) DataFrameWriterV2 should delegate table resolution to the analyzer
[ https://issues.apache.org/jira/browse/SPARK-33087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209712#comment-17209712 ] Apache Spark commented on SPARK-33087: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/29970 > DataFrameWriterV2 should delegate table resolution to the analyzer > -- > > Key: SPARK-33087 > URL: https://issues.apache.org/jira/browse/SPARK-33087 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33087) DataFrameWriterV2 should delegate table resolution to the analyzer
[ https://issues.apache.org/jira/browse/SPARK-33087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33087: Assignee: Apache Spark (was: Wenchen Fan) > DataFrameWriterV2 should delegate table resolution to the analyzer > -- > > Key: SPARK-33087 > URL: https://issues.apache.org/jira/browse/SPARK-33087 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33087) DataFrameWriterV2 should delegate table resolution to the analyzer
[ https://issues.apache.org/jira/browse/SPARK-33087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33087: Assignee: Wenchen Fan (was: Apache Spark) > DataFrameWriterV2 should delegate table resolution to the analyzer > -- > > Key: SPARK-33087 > URL: https://issues.apache.org/jira/browse/SPARK-33087 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33087) DataFrameWriterV2 should delegate table resolution to the analyzer
Wenchen Fan created SPARK-33087: --- Summary: DataFrameWriterV2 should delegate table resolution to the analyzer Key: SPARK-33087 URL: https://issues.apache.org/jira/browse/SPARK-33087 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33005) Kubernetes GA Preparation
[ https://issues.apache.org/jira/browse/SPARK-33005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33005: - Assignee: Dongjoon Hyun > Kubernetes GA Preparation > - > > Key: SPARK-33005 > URL: https://issues.apache.org/jira/browse/SPARK-33005 > Project: Spark > Issue Type: Umbrella > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33005) Kubernetes GA Preparation
[ https://issues.apache.org/jira/browse/SPARK-33005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33005: -- Target Version/s: 3.1.0 > Kubernetes GA Preparation > - > > Key: SPARK-33005 > URL: https://issues.apache.org/jira/browse/SPARK-33005 > Project: Spark > Issue Type: Umbrella > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32067) Use unique ConfigMap name for executor pod template
[ https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-32067: - Assignee: Stijn De Haes > Use unique ConfigMap name for executor pod template > --- > > Key: SPARK-32067 > URL: https://issues.apache.org/jira/browse/SPARK-32067 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.0.1, 3.1.0 >Reporter: James Yu >Assignee: Stijn De Haes >Priority: Major > > THE BUG: > The bug is reproducible by spark-submit two different apps (app1 and app2) > with different executor pod templates (e.g., different labels) to K8s > sequentially, with app2 launching while app1 is still in the middle of > ramping up all its executor pods. The unwanted result is that some launched > executor pods of app1 end up having app2's executor pod template applied to > them. > The root cause appears to be that app1's podspec-configmap got overwritten by > app2 during the overlapping launching periods because both apps use the same > ConfigMap (name). This causes some app1's executor pods being ramped up after > app2 is launched to be inadvertently launched with the app2's pod template. > The issue can be seen as follows: > First, after submitting app1, you get these configmaps: > {code:java} > NAMESPACENAME DATAAGE > default app1--driver-conf-map 1 9m46s > default podspec-configmap 1 12m{code} > Then submit app2 while app1 is still ramping up its executors. The > podspec-confimap is modified by app2. > {code:java} > NAMESPACENAME DATAAGE > default app1--driver-conf-map 1 11m43s > default app2--driver-conf-map 1 10s > default podspec-configmap 1 13m57s{code} > > PROPOSED SOLUTION: > Properly prefix the podspec-configmap for each submitted app, ideally the > same way as the driver configmap: > {code:java} > NAMESPACENAME DATAAGE > default app1--driver-conf-map 1 11m43s > default app1--podspec-configmap1 13m57s > default app2--driver-conf-map 1 10s > default app2--podspec-configmap1 3m{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32067) Use unique ConfigMap name for executor pod template
[ https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32067. --- Fix Version/s: 3.0.2 3.1.0 Resolution: Fixed Issue resolved by pull request 29934 [https://github.com/apache/spark/pull/29934] > Use unique ConfigMap name for executor pod template > --- > > Key: SPARK-32067 > URL: https://issues.apache.org/jira/browse/SPARK-32067 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.0.1, 3.1.0 >Reporter: James Yu >Assignee: Stijn De Haes >Priority: Major > Fix For: 3.1.0, 3.0.2 > > > THE BUG: > The bug is reproducible by spark-submit two different apps (app1 and app2) > with different executor pod templates (e.g., different labels) to K8s > sequentially, with app2 launching while app1 is still in the middle of > ramping up all its executor pods. The unwanted result is that some launched > executor pods of app1 end up having app2's executor pod template applied to > them. > The root cause appears to be that app1's podspec-configmap got overwritten by > app2 during the overlapping launching periods because both apps use the same > ConfigMap (name). This causes some app1's executor pods being ramped up after > app2 is launched to be inadvertently launched with the app2's pod template. > The issue can be seen as follows: > First, after submitting app1, you get these configmaps: > {code:java} > NAMESPACENAME DATAAGE > default app1--driver-conf-map 1 9m46s > default podspec-configmap 1 12m{code} > Then submit app2 while app1 is still ramping up its executors. The > podspec-confimap is modified by app2. > {code:java} > NAMESPACENAME DATAAGE > default app1--driver-conf-map 1 11m43s > default app2--driver-conf-map 1 10s > default podspec-configmap 1 13m57s{code} > > PROPOSED SOLUTION: > Properly prefix the podspec-configmap for each submitted app, ideally the > same way as the driver configmap: > {code:java} > NAMESPACENAME DATAAGE > default app1--driver-conf-map 1 11m43s > default app1--podspec-configmap1 13m57s > default app2--driver-conf-map 1 10s > default app2--podspec-configmap1 3m{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-32714) Port pyspark-stubs
[ https://issues.apache.org/jira/browse/SPARK-32714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32714: - Comment: was deleted (was: User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/29969) > Port pyspark-stubs > -- > > Key: SPARK-32714 > URL: https://issues.apache.org/jira/browse/SPARK-32714 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.1.0 > > > Port https://github.com/zero323/pyspark-stubs into PySpark. This was being > discussed in dev mailing list. See also > http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PySpark-Revisiting-PySpark-type-annotations-td26232.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-32714) Port pyspark-stubs
[ https://issues.apache.org/jira/browse/SPARK-32714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32714: - Comment: was deleted (was: User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/29969) > Port pyspark-stubs > -- > > Key: SPARK-32714 > URL: https://issues.apache.org/jira/browse/SPARK-32714 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.1.0 > > > Port https://github.com/zero323/pyspark-stubs into PySpark. This was being > discussed in dev mailing list. See also > http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PySpark-Revisiting-PySpark-type-annotations-td26232.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-32714) Port pyspark-stubs
[ https://issues.apache.org/jira/browse/SPARK-32714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32714: - Comment: was deleted (was: https://github.com/apache/spark/pull/29591) > Port pyspark-stubs > -- > > Key: SPARK-32714 > URL: https://issues.apache.org/jira/browse/SPARK-32714 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.1.0 > > > Port https://github.com/zero323/pyspark-stubs into PySpark. This was being > discussed in dev mailing list. See also > http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PySpark-Revisiting-PySpark-type-annotations-td26232.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-32714) Port pyspark-stubs
[ https://issues.apache.org/jira/browse/SPARK-32714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32714: - Comment: was deleted (was: User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/29969) > Port pyspark-stubs > -- > > Key: SPARK-32714 > URL: https://issues.apache.org/jira/browse/SPARK-32714 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.1.0 > > > Port https://github.com/zero323/pyspark-stubs into PySpark. This was being > discussed in dev mailing list. See also > http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PySpark-Revisiting-PySpark-type-annotations-td26232.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33086) Provide static annotatiions for pyspark.resource.* modules
[ https://issues.apache.org/jira/browse/SPARK-33086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209640#comment-17209640 ] Apache Spark commented on SPARK-33086: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/29969 > Provide static annotatiions for pyspark.resource.* modules > -- > > Key: SPARK-33086 > URL: https://issues.apache.org/jira/browse/SPARK-33086 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > At the point of port {{pyspark.resource}} had only dynamic annotations > generated using {{stubgen}}. > Since they are a part of a public API, we should provide static annotations > instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33086) Provide static annotatiions for pyspark.resource.* modules
[ https://issues.apache.org/jira/browse/SPARK-33086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33086: Assignee: Apache Spark > Provide static annotatiions for pyspark.resource.* modules > -- > > Key: SPARK-33086 > URL: https://issues.apache.org/jira/browse/SPARK-33086 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark >Priority: Minor > > At the point of port {{pyspark.resource}} had only dynamic annotations > generated using {{stubgen}}. > Since they are a part of a public API, we should provide static annotations > instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33086) Provide static annotatiions for pyspark.resource.* modules
[ https://issues.apache.org/jira/browse/SPARK-33086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209638#comment-17209638 ] Apache Spark commented on SPARK-33086: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/29969 > Provide static annotatiions for pyspark.resource.* modules > -- > > Key: SPARK-33086 > URL: https://issues.apache.org/jira/browse/SPARK-33086 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > At the point of port {{pyspark.resource}} had only dynamic annotations > generated using {{stubgen}}. > Since they are a part of a public API, we should provide static annotations > instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33086) Provide static annotatiions for pyspark.resource.* modules
[ https://issues.apache.org/jira/browse/SPARK-33086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33086: Assignee: (was: Apache Spark) > Provide static annotatiions for pyspark.resource.* modules > -- > > Key: SPARK-33086 > URL: https://issues.apache.org/jira/browse/SPARK-33086 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > At the point of port {{pyspark.resource}} had only dynamic annotations > generated using {{stubgen}}. > Since they are a part of a public API, we should provide static annotations > instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33086) Provide static annotatiions for pyspark.resource.* modules
[ https://issues.apache.org/jira/browse/SPARK-33086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33086: - Parent: SPARK-32681 Issue Type: Sub-task (was: Improvement) > Provide static annotatiions for pyspark.resource.* modules > -- > > Key: SPARK-33086 > URL: https://issues.apache.org/jira/browse/SPARK-33086 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > At the point of port {{pyspark.resource}} had only dynamic annotations > generated using {{stubgen}}. > Since they are a part of a public API, we should provide static annotations > instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33086) Provide static annotatiions for pyspark.resource.* modules
Maciej Szymkiewicz created SPARK-33086: -- Summary: Provide static annotatiions for pyspark.resource.* modules Key: SPARK-33086 URL: https://issues.apache.org/jira/browse/SPARK-33086 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.1.0 Reporter: Maciej Szymkiewicz At the point of port {{pyspark.resource}} had only dynamic annotations generated using {{stubgen}}. Since they are a part of a public API, we should provide static annotations instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32714) Port pyspark-stubs
[ https://issues.apache.org/jira/browse/SPARK-32714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209635#comment-17209635 ] Apache Spark commented on SPARK-32714: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/29969 > Port pyspark-stubs > -- > > Key: SPARK-32714 > URL: https://issues.apache.org/jira/browse/SPARK-32714 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.1.0 > > > Port https://github.com/zero323/pyspark-stubs into PySpark. This was being > discussed in dev mailing list. See also > http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PySpark-Revisiting-PySpark-type-annotations-td26232.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32714) Port pyspark-stubs
[ https://issues.apache.org/jira/browse/SPARK-32714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209634#comment-17209634 ] Apache Spark commented on SPARK-32714: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/29969 > Port pyspark-stubs > -- > > Key: SPARK-32714 > URL: https://issues.apache.org/jira/browse/SPARK-32714 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.1.0 > > > Port https://github.com/zero323/pyspark-stubs into PySpark. This was being > discussed in dev mailing list. See also > http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PySpark-Revisiting-PySpark-type-annotations-td26232.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32982) Remove hive-1.2 profiles in PIP installation option
[ https://issues.apache.org/jira/browse/SPARK-32982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209633#comment-17209633 ] Hyukjin Kwon commented on SPARK-32982: -- https://github.com/apache/spark/pull/29878 was a followup. > Remove hive-1.2 profiles in PIP installation option > --- > > Key: SPARK-32982 > URL: https://issues.apache.org/jira/browse/SPARK-32982 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > Hive 1.2 is a fork that we should remove. It's best to don't expose this > distribution from pip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32714) Port pyspark-stubs
[ https://issues.apache.org/jira/browse/SPARK-32714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209628#comment-17209628 ] Apache Spark commented on SPARK-32714: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/29969 > Port pyspark-stubs > -- > > Key: SPARK-32714 > URL: https://issues.apache.org/jira/browse/SPARK-32714 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.1.0 > > > Port https://github.com/zero323/pyspark-stubs into PySpark. This was being > discussed in dev mailing list. See also > http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PySpark-Revisiting-PySpark-type-annotations-td26232.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26499) JdbcUtils.makeGetter does not handle ByteType
[ https://issues.apache.org/jira/browse/SPARK-26499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209612#comment-17209612 ] Apache Spark commented on SPARK-26499: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/29968 > JdbcUtils.makeGetter does not handle ByteType > - > > Key: SPARK-26499 > URL: https://issues.apache.org/jira/browse/SPARK-26499 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Thomas D'Silva >Assignee: Thomas D'Silva >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > I am trying to use the DataSource V2 API to read from a JDBC source. While > using {{JdbcUtils.resultSetToSparkInternalRows}} to create an internal row > from a ResultSet that has a column of type TINYINT I ran into the following > exception > {code:java} > java.lang.IllegalArgumentException: Unsupported type tinyint > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter(JdbcUtils.scala:502) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetters$1.apply(JdbcUtils.scala:379) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetters$1.apply(JdbcUtils.scala:379) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetters(JdbcUtils.scala:379) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.(JdbcUtils.scala:340) > {code} > This happens because ByteType is not handled in {{JdbcUtils.makeGetter}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33085) "Master removed our application" error leads to FAILED driver status instead of KILLED driver status
t oo created SPARK-33085: Summary: "Master removed our application" error leads to FAILED driver status instead of KILLED driver status Key: SPARK-33085 URL: https://issues.apache.org/jira/browse/SPARK-33085 Project: Spark Issue Type: Bug Components: Scheduler, Spark Core Affects Versions: 2.4.6 Reporter: t oo driver-20200930160855-0316 exited with status FAILED I am using Spark Standalone scheduler with spot ec2 workers. I confirmed that myip.87 EC2 instance was terminated at 2020-09-30 16:16 *I would expect the overall driver status to be KILLED but instead it was FAILED*, my goal is to interpret FAILED status as 'don't rerun as non-transient error faced' but KILLED/ERROR status as 'yes, rerun as transient error faced'. But it looks like FAILED status is being set in below case of transient error: Below are driver logs {code:java} 2020-09-30 16:12:41,183 [main] INFO com.yotpo.metorikku.output.writers.file.FileOutputWriter - Writing file to s3a://redacted2020-09-30 16:12:41,183 [main] INFO com.yotpo.metorikku.output.writers.file.FileOutputWriter - Writing file to s3a://redacted20-09-30 16:16:40,366 [dispatcher-event-loop-15] ERROR org.apache.spark.scheduler.TaskSchedulerImpl - Lost executor 0 on myip.87: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.2020-09-30 16:16:40,372 [dispatcher-event-loop-15] WARN org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 6.0 (TID 6, myip.87, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.2020-09-30 16:16:40,376 [dispatcher-event-loop-13] WARN org.apache.spark.storage.BlockManagerMasterEndpoint - No more replicas available for rdd_3_0 !2020-09-30 16:16:40,398 [dispatcher-event-loop-2] INFO org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/0 removed: Worker shutting down2020-09-30 16:16:40,399 [dispatcher-event-loop-2] INFO org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted executor ID app-20200930160902-0895/1 on hostPort myip.87:11647 with 2 core(s), 5.0 GB RAM2020-09-30 16:16:40,401 [dispatcher-event-loop-5] INFO org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/1 removed: java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.2020-09-30 16:16:40,402 [dispatcher-event-loop-5] INFO org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted executor ID app-20200930160902-0895/2 on hostPort myip.87:11647 with 2 core(s), 5.0 GB RAM2020-09-30 16:16:40,403 [dispatcher-event-loop-11] INFO org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/2 removed: java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.2020-09-30 16:16:40,404 [dispatcher-event-loop-11] INFO org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted executor ID app-20200930160902-0895/3 on hostPort myip.87:11647 with 2 core(s), 5.0 GB RAM2020-09-30 16:16:40,405 [dispatcher-event-loop-1] INFO org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/3 removed: java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.2020-09-30 16:16:40,406 [dispatcher-event-loop-1] INFO org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted executor ID app-20200930160902-0895/4 on hostPort myip.87:11647 with 2 core(s), 5.0 GB RAM2020-09-30 16:16:40,407 [dispatcher-event-loop-12] INFO org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/4 removed: java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.2020-09-30 16:16:40,408 [dispatcher-event-loop-12] INFO org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted executor ID app-20200930160902-0895/5 on hostPort myip.87:11647 with 2 core(s), 5.0 GB RAM2020-09-30 16:16:40,409 [dispatcher-event-loop-4] INFO org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/5 removed: java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.2020-09-30 16:16:40,410 [dispatcher-event-loop-5] INFO org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted executor ID app-20200930160902-0895/6 on hostPort myip.87:11647 with 2 core(s), 5.0 GB RAM2020-09-30 16:16:40,420 [dispatcher-event-loop-9] INFO org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/6 removed: java.lan
[jira] [Commented] (SPARK-32511) Add dropFields method to Column class
[ https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209504#comment-17209504 ] Apache Spark commented on SPARK-32511: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/29967 > Add dropFields method to Column class > - > > Key: SPARK-32511 > URL: https://issues.apache.org/jira/browse/SPARK-32511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: fqaiser94 >Assignee: fqaiser94 >Priority: Major > Fix For: 3.1.0 > > > Based on the discussions in the parent ticket (SPARK-22231), add a new > {{dropFields}} method to the {{Column}} class. > This method should allow users to drop a column nested inside a StructType > Column (with similar semantics to the existing {{drop}} method on > {{Dataset}}). > It should also be able to handle deeply nested columns through the same API. > This is similar to the {{withField}} method that was recently added in > SPARK-31317 and likely we can re-use some of that "infrastructure." > The public-facing method signature should be something along the following > lines: > {noformat} > def dropFields(fieldNames: String*): Column > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33084) Add jar support ivy path
[ https://issues.apache.org/jira/browse/SPARK-33084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209493#comment-17209493 ] Apache Spark commented on SPARK-33084: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/29966 > Add jar support ivy path > > > Key: SPARK-33084 > URL: https://issues.apache.org/jira/browse/SPARK-33084 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > > Support add jar with ivy path -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33084) Add jar support ivy path
[ https://issues.apache.org/jira/browse/SPARK-33084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33084: Assignee: Apache Spark > Add jar support ivy path > > > Key: SPARK-33084 > URL: https://issues.apache.org/jira/browse/SPARK-33084 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > > Support add jar with ivy path -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33084) Add jar support ivy path
[ https://issues.apache.org/jira/browse/SPARK-33084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33084: Assignee: (was: Apache Spark) > Add jar support ivy path > > > Key: SPARK-33084 > URL: https://issues.apache.org/jira/browse/SPARK-33084 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > > Support add jar with ivy path -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33036) Refactor RewriteCorrelatedScalarSubquery code to replace exprIds in a bottom-up manner
[ https://issues.apache.org/jira/browse/SPARK-33036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-33036. -- Fix Version/s: 3.1.0 Assignee: Takeshi Yamamuro Resolution: Fixed Resolved by https://github.com/apache/spark/pull/29913 > Refactor RewriteCorrelatedScalarSubquery code to replace exprIds in a > bottom-up manner > -- > > Key: SPARK-33036 > URL: https://issues.apache.org/jira/browse/SPARK-33036 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 3.1.0 > > > This PR aims at refactoring code in `RewriteCorrelatedScalarSubquery` for > replacing `ExprId`s in a bottom-up manner instead of doing in a top-down one. > This PR comes from the talk with @cloud-fan in > https://github.com/apache/spark/pull/29585#discussion_r490371252. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33084) Add jar support ivy path
angerszhu created SPARK-33084: - Summary: Add jar support ivy path Key: SPARK-33084 URL: https://issues.apache.org/jira/browse/SPARK-33084 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 3.1.0 Reporter: angerszhu Support add jar with ivy path -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33002) Post-port removal of non-API stubs
[ https://issues.apache.org/jira/browse/SPARK-33002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33002: Assignee: Maciej Szymkiewicz > Post-port removal of non-API stubs > -- > > Key: SPARK-33002 > URL: https://issues.apache.org/jira/browse/SPARK-33002 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > > To simplify initial port we merge all existing stubs. > However, some of these cover non-API components and are usually dynamically > annotated (generated with stubgen). > This includes modules like {{serializers}}, {{utils}}, {{shell}}, {{worker}}, > etc. > These can be safely removed as: > - MyPy can infer types from the source, where stub is not present. > - No longer provide value, when corresponding modules are present in the same > directory structure. > - Annotations are here primarily to help end users, not Spark developers and > many of the annotations cannot be meaningfully refined. > It should also reduce overhead of maintaining annotations (especially when > places where we don't guarantee signature stability). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33002) Post-port removal of non-API stubs
[ https://issues.apache.org/jira/browse/SPARK-33002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33002. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29879 [https://github.com/apache/spark/pull/29879] > Post-port removal of non-API stubs > -- > > Key: SPARK-33002 > URL: https://issues.apache.org/jira/browse/SPARK-33002 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.1.0 > > > To simplify initial port we merge all existing stubs. > However, some of these cover non-API components and are usually dynamically > annotated (generated with stubgen). > This includes modules like {{serializers}}, {{utils}}, {{shell}}, {{worker}}, > etc. > These can be safely removed as: > - MyPy can infer types from the source, where stub is not present. > - No longer provide value, when corresponding modules are present in the same > directory structure. > - Annotations are here primarily to help end users, not Spark developers and > many of the annotations cannot be meaningfully refined. > It should also reduce overhead of maintaining annotations (especially when > places where we don't guarantee signature stability). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33003) Add type hints guideliness to the documentation
[ https://issues.apache.org/jira/browse/SPARK-33003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209430#comment-17209430 ] Hyukjin Kwon commented on SPARK-33003: -- If you think it's useful to write some guides for users as well, it should likely be. Please feel free to go ahead as you go :-) > Add type hints guideliness to the documentation > --- > > Key: SPARK-33003 > URL: https://issues.apache.org/jira/browse/SPARK-33003 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33003) Add type hints guideliness to the documentation
[ https://issues.apache.org/jira/browse/SPARK-33003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209428#comment-17209428 ] Hyukjin Kwon commented on SPARK-33003: -- You mean the latter is a must have \(?\). Yeah, I think just doing it for dev is enough for now. > Add type hints guideliness to the documentation > --- > > Key: SPARK-33003 > URL: https://issues.apache.org/jira/browse/SPARK-33003 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org