[jira] [Created] (SPARK-38404) Spark does not find CTE inside nested CTE
Joan Heredia Rius created SPARK-38404: - Summary: Spark does not find CTE inside nested CTE Key: SPARK-38404 URL: https://issues.apache.org/jira/browse/SPARK-38404 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1, 3.2.0 Environment: Tested on: * MacOS Monterrey 12.2.1 (21D62) * python 3.9.10 * pip 22.0.3 * pyspark 3.2.0 & 3.2.1 (SQL query does not work) and pyspark 3.0.1 and 3.1.3 (SQL query works) Reporter: Joan Heredia Rius Hello! Seems that when defining CTEs and using them inside another CTE in Spark SQL, Spark thinks the inner call for the CTE is a table or view, which is not found and then it errors with `Table or view not found: ` h3. Steps to reproduce # `pip install pyspark==3.2.0` (also happens with 3.2.1) # start pyspark console by typing `pyspark` in the terminal # Try to run the following SQL with `spark.sql(sql)` {code:java} WITH mock_cte__usersAS ( SELECT 1 AS id ), model_under_test AS ( WITH usersAS ( SELECT * FROM mock_cte__users ) SELECT * FROM users ) SELECT * FROM model_under_test;{code} Spark will fail with {code:java} pyspark.sql.utils.AnalysisException: Table or view not found: mock_cte__users; line 8 pos 29; {code} I don't know if this is a regression or an expected behavior of the new 3.2.* versions. This fix introduced in 3.2.0 might be related: https://issues.apache.org/jira/browse/SPARK-36447 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38104) Use error classes in the parsing errors of windows
[ https://issues.apache.org/jira/browse/SPARK-38104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500556#comment-17500556 ] Apache Spark commented on SPARK-38104: -- User 'yutoacts' has created a pull request for this issue: https://github.com/apache/spark/pull/35718 > Use error classes in the parsing errors of windows > -- > > Key: SPARK-38104 > URL: https://issues.apache.org/jira/browse/SPARK-38104 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryParsingErrors: > * repetitiveWindowDefinitionError > * invalidWindowReferenceError > * cannotResolveWindowReferenceError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryParsingErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38104) Use error classes in the parsing errors of windows
[ https://issues.apache.org/jira/browse/SPARK-38104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38104: Assignee: (was: Apache Spark) > Use error classes in the parsing errors of windows > -- > > Key: SPARK-38104 > URL: https://issues.apache.org/jira/browse/SPARK-38104 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryParsingErrors: > * repetitiveWindowDefinitionError > * invalidWindowReferenceError > * cannotResolveWindowReferenceError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryParsingErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38104) Use error classes in the parsing errors of windows
[ https://issues.apache.org/jira/browse/SPARK-38104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500555#comment-17500555 ] Apache Spark commented on SPARK-38104: -- User 'yutoacts' has created a pull request for this issue: https://github.com/apache/spark/pull/35718 > Use error classes in the parsing errors of windows > -- > > Key: SPARK-38104 > URL: https://issues.apache.org/jira/browse/SPARK-38104 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryParsingErrors: > * repetitiveWindowDefinitionError > * invalidWindowReferenceError > * cannotResolveWindowReferenceError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryParsingErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38104) Use error classes in the parsing errors of windows
[ https://issues.apache.org/jira/browse/SPARK-38104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38104: Assignee: Apache Spark > Use error classes in the parsing errors of windows > -- > > Key: SPARK-38104 > URL: https://issues.apache.org/jira/browse/SPARK-38104 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Migrate the following errors in QueryParsingErrors: > * repetitiveWindowDefinitionError > * invalidWindowReferenceError > * cannotResolveWindowReferenceError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryParsingErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38403) `running-on-kubernetes` page render bad in v3.2.1(latest) website
[ https://issues.apache.org/jira/browse/SPARK-38403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500553#comment-17500553 ] Kent Yao commented on SPARK-38403: -- I guess it has been fixed via https://github.com/apache/spark/commit/3a179d762602243497c528bea3b3370e7548fa2d > `running-on-kubernetes` page render bad in v3.2.1(latest) website > - > > Key: SPARK-38403 > URL: https://issues.apache.org/jira/browse/SPARK-38403 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Yikun Jiang >Priority: Major > > Looks like the `running-on-kubernetes` page encounterd some problems when > published. > [1] > https://spark.apache.org/docs/latest/running-on-kubernetes.html#spark-properties > (You can see bad format after #spark-properties) > - I also check the master branch (setup local env) and also v3.2.0 > (https://spark.apache.org/docs/3.2.0/running-on-kubernetes.html) it works > well, .- But for v3.2.1 tag, I couldn't install doc deps due to deps conflict. > I'm not very familiar with doc infra tool. Can anyone help to take a look? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38403) `running-on-kubernetes` page render bad in v3.2.1(latest) website
Yikun Jiang created SPARK-38403: --- Summary: `running-on-kubernetes` page render bad in v3.2.1(latest) website Key: SPARK-38403 URL: https://issues.apache.org/jira/browse/SPARK-38403 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 3.2.1 Reporter: Yikun Jiang Looks like the `running-on-kubernetes` page encounterd some problems when published. [1] https://spark.apache.org/docs/latest/running-on-kubernetes.html#spark-properties (You can see bad format after #spark-properties) - I also check the master branch (setup local env) and also v3.2.0 (https://spark.apache.org/docs/3.2.0/running-on-kubernetes.html) it works well, .- But for v3.2.1 tag, I couldn't install doc deps due to deps conflict. I'm not very familiar with doc infra tool. Can anyone help to take a look? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38402) Improve user experience when working on data frames created from CSV and JSON in PERMISSIVE mode.
Dilip Biswal created SPARK-38402: Summary: Improve user experience when working on data frames created from CSV and JSON in PERMISSIVE mode. Key: SPARK-38402 URL: https://issues.apache.org/jira/browse/SPARK-38402 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.1 Reporter: Dilip Biswal In our data processing pipeline, we first process the user supplied data and eliminate invalid/corrupt records. So we parse JSON and CSV files in PERMISSIVE mode where all the invalid records are captured in "_corrupt_record". We then apply predicates on "_corrupt_record" to eliminate the bad records before subjecting the good records further in the processing pipeline. We encountered two issues. 1. The introduction of "predicate pushdown" for CSV, does not take into account this system generated "_corrupt_column" and tries to push this down to scan resulting in an exception as the column is not part of base schema. 2. Applying predicates on "_corrupt_column" results in a AnalysisException like following. {code:java} val schema = new StructType() .add("id",IntegerType,true) .add("weight",IntegerType,true) // The weight field is defined wrongly. The actual data contains floating point numbers, while the schema specifies an integer. .add("price",IntegerType,true) .add("_corrupt_record", StringType, true) // The schema contains a special column _corrupt_record, which does not exist in the data. This column captures rows that did not parse correctly. val csv_with_wrong_schema = spark.read.format("csv") .option("header", "true") .schema(schema) .load("/FileStore/tables/csv_corrupt_record.csv") val badRows = csv_with_wrong_schema.filter($"_corrupt_record".isNotNull) 7 val numBadRows = badRows.count() Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named _corrupt_record by default). For example: spark.read.schema(schema).csv(file).filter($"_corrupt_record".isNotNull).count() and spark.read.schema(schema).csv(file).select("_corrupt_record").show(). Instead, you can cache or save the parsed results and then send the same query. For example, val df = spark.read.schema(schema).csv(file).cache() and then df.filter($"_corrupt_record".isNotNull).count(). {code:java} For (1), we have disabled predicate pushdown. For (2), we currently cache the data frame before using it , however, its not convenient and we would like to see a better user experience. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38401) Unify get preferred locations for shuffle in AQE
XiDuo You created SPARK-38401: - Summary: Unify get preferred locations for shuffle in AQE Key: SPARK-38401 URL: https://issues.apache.org/jira/browse/SPARK-38401 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: XiDuo You It has several issues in the method `getPreferredLocations` of `ShuffledRowRDD`. * it does not respect the config `spark.shuffle.reduceLocality.enabled`, so we can not disable it. * it does not respect `REDUCER_PREF_LOCS_FRACTION`, so it has no effect if DAG schedule task to an executor who has less data. In worse, driver will take more memory to store the useless locations. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38104) Use error classes in the parsing errors of windows
[ https://issues.apache.org/jira/browse/SPARK-38104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500535#comment-17500535 ] Yuto Akutsu commented on SPARK-38104: - I'm working on this. > Use error classes in the parsing errors of windows > -- > > Key: SPARK-38104 > URL: https://issues.apache.org/jira/browse/SPARK-38104 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryParsingErrors: > * repetitiveWindowDefinitionError > * invalidWindowReferenceError > * cannotResolveWindowReferenceError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryParsingErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38400) Enable Series.rename to change index labels
[ https://issues.apache.org/jira/browse/SPARK-38400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500530#comment-17500530 ] Apache Spark commented on SPARK-38400: -- User 'xinrong-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/35717 > Enable Series.rename to change index labels > --- > > Key: SPARK-38400 > URL: https://issues.apache.org/jira/browse/SPARK-38400 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > Enable Series.rename to change index labels, with function `index` input. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38400) Enable Series.rename to change index labels
[ https://issues.apache.org/jira/browse/SPARK-38400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38400: Assignee: (was: Apache Spark) > Enable Series.rename to change index labels > --- > > Key: SPARK-38400 > URL: https://issues.apache.org/jira/browse/SPARK-38400 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > Enable Series.rename to change index labels, with function `index` input. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38400) Enable Series.rename to change index labels
[ https://issues.apache.org/jira/browse/SPARK-38400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38400: Assignee: Apache Spark > Enable Series.rename to change index labels > --- > > Key: SPARK-38400 > URL: https://issues.apache.org/jira/browse/SPARK-38400 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > Enable Series.rename to change index labels, with function `index` input. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38398) Add `priorityClassName` integration test case
[ https://issues.apache.org/jira/browse/SPARK-38398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500531#comment-17500531 ] Apache Spark commented on SPARK-38398: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/35716 > Add `priorityClassName` integration test case > - > > Key: SPARK-38398 > URL: https://issues.apache.org/jira/browse/SPARK-38398 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38383) Support APP_ID and EXECUTOR_ID placeholder in annotations
[ https://issues.apache.org/jira/browse/SPARK-38383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38383: -- Parent: SPARK-36057 Issue Type: Sub-task (was: Improvement) > Support APP_ID and EXECUTOR_ID placeholder in annotations > - > > Key: SPARK-38383 > URL: https://issues.apache.org/jira/browse/SPARK-38383 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38400) Enable Series.rename to change index labels
Xinrong Meng created SPARK-38400: Summary: Enable Series.rename to change index labels Key: SPARK-38400 URL: https://issues.apache.org/jira/browse/SPARK-38400 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3.0 Reporter: Xinrong Meng Enable Series.rename to change index labels, with function `index` input. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38398) Add `priorityClassName` integration test case
[ https://issues.apache.org/jira/browse/SPARK-38398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-38398: - Assignee: Dongjoon Hyun > Add `priorityClassName` integration test case > - > > Key: SPARK-38398 > URL: https://issues.apache.org/jira/browse/SPARK-38398 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38399) Why doesn't shuffle hash join support build left table for left-outer-join but full-outer-join?
mengdou created SPARK-38399: --- Summary: Why doesn't shuffle hash join support build left table for left-outer-join but full-outer-join? Key: SPARK-38399 URL: https://issues.apache.org/jira/browse/SPARK-38399 Project: Spark Issue Type: Question Components: Spark Core, SQL Affects Versions: 3.2.0 Reporter: mengdou Why doesn't shuffle hash join support building left table for left-outer-join, but it supports building right table for full-outer-join? !image-2022-03-03-13-53-50-520.png! IMO, if left table is the build table, similar to full-outer-table, we can first create a BitSet to record any mismatch of next joins, and iterate all rows from stream table iterator and look up the hash key in the hash relation. If no one from stream table can join with the built hash relation, then we iterate the hash relation and get relative value from the BitSet, so we can get the left-outer rows. Does anyone helps? Thx~ -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38398) Add `priorityClassName` integration test case
Dongjoon Hyun created SPARK-38398: - Summary: Add `priorityClassName` integration test case Key: SPARK-38398 URL: https://issues.apache.org/jira/browse/SPARK-38398 Project: Spark Issue Type: Sub-task Components: Kubernetes, Tests Affects Versions: 3.3.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38334) Implement support for DEFAULT values for columns in tables
[ https://issues.apache.org/jira/browse/SPARK-38334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38334: -- Affects Version/s: 3.3.0 (was: 3.2.1) > Implement support for DEFAULT values for columns in tables > --- > > Key: SPARK-38334 > URL: https://issues.apache.org/jira/browse/SPARK-38334 > Project: Spark > Issue Type: Story > Components: SQL >Affects Versions: 3.3.0 >Reporter: Daniel >Priority: Major > > This story tracks the implementation of DEFAULT values for columns in tables. > CREATE TABLE and ALTER TABLE invocations will support setting column default > values for future operations. Following INSERT, UPDATE, MERGE statements may > then reference the value using the DEFAULT keyword as needed. > Examples: > {code:sql} > CREATE TABLE T(a INT, b INT NOT NULL); > -- The default default is NULL > INSERT INTO T VALUES (DEFAULT, 0); > INSERT INTO T(b) VALUES (1); > SELECT * FROM T; > (NULL, 0) > (NULL, 1) > -- Adding a default to a table with rows, sets the values for the > -- existing rows (exist default) and new rows (current default). > ALTER TABLE T ADD COLUMN c INT DEFAULT 5; > INSERT INTO T VALUES (1, 2, DEFAULT); > SELECT * FROM T; > (NULL, 0, 5) > (NULL, 1, 5) > (1, 2, 5) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38334) Implement support for DEFAULT values for columns in tables
[ https://issues.apache.org/jira/browse/SPARK-38334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38334: -- Issue Type: Improvement (was: Story) > Implement support for DEFAULT values for columns in tables > --- > > Key: SPARK-38334 > URL: https://issues.apache.org/jira/browse/SPARK-38334 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Daniel >Priority: Major > > This story tracks the implementation of DEFAULT values for columns in tables. > CREATE TABLE and ALTER TABLE invocations will support setting column default > values for future operations. Following INSERT, UPDATE, MERGE statements may > then reference the value using the DEFAULT keyword as needed. > Examples: > {code:sql} > CREATE TABLE T(a INT, b INT NOT NULL); > -- The default default is NULL > INSERT INTO T VALUES (DEFAULT, 0); > INSERT INTO T(b) VALUES (1); > SELECT * FROM T; > (NULL, 0) > (NULL, 1) > -- Adding a default to a table with rows, sets the values for the > -- existing rows (exist default) and new rows (current default). > ALTER TABLE T ADD COLUMN c INT DEFAULT 5; > INSERT INTO T VALUES (1, 2, DEFAULT); > SELECT * FROM T; > (NULL, 0, 5) > (NULL, 1, 5) > (1, 2, 5) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38395) Pyspark issue in resolving column when there is dot (.)
[ https://issues.apache.org/jira/browse/SPARK-38395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krishna Sangeeth KS updated SPARK-38395: Description: Pyspark apply in pandas have some difficulty in resolving columns when there is dot in the column name. Here is an example that I have which reproduces the issue. Example taken by modifying doctest example [here|https://github.com/apache/spark/blob/branch-3.0/python/pyspark/sql/pandas/group_ops.py#L237-L248] {code:python} df1 = spark.createDataFrame( [(2101, 1, 1.0), (2101, 2, 2.0), (2102, 1, 3.0), (2102, 2, 4.0)], ("abc|database|10.159.154|xef", "id", "v1")) df2 = spark.createDataFrame( [(2101, 1, "x"), (2101, 2, "y")], ("abc|database|10.159.154|xef", "id", "v2")) def asof_join(l, r): return pd.merge_asof(l, r, on="abc|database|10.159.154|xef", by="id") df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas( asof_join, schema="`abc|database|10.159.154|xef` int, id int, v1 double, v2 string").show(){code} This gives the below error {code:python} AnalysisException Traceback (most recent call last) in 8 return pd.merge_asof(l, r, on="abc|database|10.159.154|xef", by="id") 9 df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas( ---> 10 asof_join, schema="`abc|database|10.159.154|xef` int, id int, v1 double, v2 string").show() ~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/pandas/group_ops.py in applyInPandas(self, func, schema) 295 udf = pandas_udf( 296 func, returnType=schema, functionType=PythonEvalType.SQL_COGROUPED_MAP_PANDAS_UDF) --> 297 all_cols = self._extract_cols(self._gd1) + self._extract_cols(self._gd2) 298 udf_column = udf(*all_cols) 299 jdf = self._gd1._jgd.flatMapCoGroupsInPandas(self._gd2._jgd, udf_column._jc.expr()) ~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/pandas/group_ops.py in _extract_cols(gd) 303 def _extract_cols(gd): 304 df = gd._df --> 305 return [df[col] for col in df.columns] 306 307 ~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/pandas/group_ops.py in (.0) 303 def _extract_cols(gd): 304 df = gd._df --> 305 return [df[col] for col in df.columns] 306 307 ~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/dataframe.py in __getitem__(self, item) 1378 """ 1379 if isinstance(item, basestring): -> 1380 jc = self._jdf.apply(item) 1381 return Column(jc) 1382 elif isinstance(item, Column): ~/anaconda3/envs/py37/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args) 1303 answer = self.gateway_client.send_command(command) 1304 return_value = get_return_value( -> 1305 answer, self.gateway_client, self.target_id, self.name) 1306 1307 for temp_arg in temp_args: ~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw) 135 # Hide where the exception came from that shows a non-Pythonic 136 # JVM exception message. --> 137 raise_from(converted) 138 else: 139 raise ~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/utils.py in raise_from(e) AnalysisException: Cannot resolve column name "abc|database|10.159.154|xef" among (abc|database|10.159.154|xef, id, v1); did you mean to quote the `abc|database|10.159.154|xef` column?; {code} As we can see the column is present there in the `among` list. When i replace `.` (dot) with `_` (underscore) the code actually works. {code:python} df1 = spark.createDataFrame( [(2101, 1, 1.0), (2101, 2, 2.0), (2102, 1, 3.0), (2102, 2, 4.0)], ("abc|database|10_159_154|xef", "id", "v1")) df2 = spark.createDataFrame( [(2101, 1, "x"), (2101, 2, "y")], ("abc|database|10_159_154|xef", "id", "v2")) def asof_join(l, r): return pd.merge_asof(l, r, on="abc|database|10_159_154|xef", by="id") df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas( asof_join, schema="`abc|database|10_159_154|xef` int, id int, v1 double, v2 string").show() {code} {code:java} +---+---+---+---+ |abc|database|10_159_154|xef| id| v1| v2| +---+---+---+---+ | 2101| 1|1.0| x| | 2102| 1|3.0| x| | 2101| 2|2.0| y| | 2102| 2|4.0| y| +---+---+---+---+ {code} was: Pyspark apply in pandas have some difficult in resolving columns when there is dot in the column name. Here is an example that I have which reproduces the issue. Example taken by modifying doctest example [here|https://github.com/apache/spark/blob/branch-3.0/pytho
[jira] [Commented] (SPARK-38340) Upgrade protobuf-java from 2.5.0 to 3.16.1
[ https://issues.apache.org/jira/browse/SPARK-38340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500487#comment-17500487 ] Yang Jie commented on SPARK-38340: -- cc [~hyukjin.kwon] , do you know how GA tests code that doesn't merged with hadoop-2 profile? > Upgrade protobuf-java from 2.5.0 to 3.16.1 > -- > > Key: SPARK-38340 > URL: https://issues.apache.org/jira/browse/SPARK-38340 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3.0 >Reporter: Bjørn Jørgensen >Priority: Major > > > [CVE-2021-22569|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-22569] > To do this upgrade I have done > external/kinesis-asl-assembly/pom.xml change line 65 to > 3.16.1 > pom.xml change line 124 to 3.16.1 > run > ./dev/test-dependencies.sh --replace-manifest > witch change > dev/deps/spark-deps-hadoop-2-hive-2.3 line 235 to > protobuf-java/3.16.1//protobuf-java-3.16.1.jar > and > dev/deps/spark-deps-hadoop-3-hive-2.3 to > protobuf-java/3.16.1//protobuf-java-3.16.1.jar > My branch > [protobuf-java-from-2.5.0-to-3.16.1|https://github.com/bjornjorgensen/spark/tree/protobuf-java-from-2.5.0-to-3.16.1] > is OK with testes, but when I run > ./build/mvn -DskipTests clean package && ./build/mvn -e package > > I get this error: > 01:01:41.381 WARN > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReaderSuite: > = POSSIBLE THREAD LEAK IN SUITE > o.a.s.sql.execution.datasources.orc.OrcColumnarBatchReaderSuite, threads: > rpc-boss-3348-1 (daemon=true), shuffle-boss-3351-1 (daemon=true) = > Run completed in 1 hour, 7 minutes, 35 seconds. > Total number of tests run: 11260 > Suites: completed 505, aborted 0 > Tests: succeeded 11259, failed 1, canceled 5, ignored 57, pending 0 > *** 1 TEST FAILED *** > [INFO] > > [INFO] Reactor Summary for Spark Project Parent POM 3.3.0-SNAPSHOT: > [INFO] > [INFO] Spark Project Parent POM ... SUCCESS [ 3.396 > s] > [INFO] Spark Project Tags . SUCCESS [ 7.374 > s] > [INFO] Spark Project Sketch ... SUCCESS [ 9.324 > s] > [INFO] Spark Project Local DB . SUCCESS [ 4.097 > s] > [INFO] Spark Project Networking ... SUCCESS [ 47.468 > s] > [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 10.478 > s] > [INFO] Spark Project Unsafe ... SUCCESS [ 2.425 > s] > [INFO] Spark Project Launcher . SUCCESS [ 2.767 > s] > [INFO] Spark Project Core . SUCCESS [30:56 > min] > [INFO] Spark Project ML Local Library . SUCCESS [ 29.105 > s] > [INFO] Spark Project GraphX ... SUCCESS [02:09 > min] > [INFO] Spark Project Streaming SUCCESS [05:21 > min] > [INFO] Spark Project Catalyst . SUCCESS [08:15 > min] > [INFO] Spark Project SQL .. FAILURE [ 01:11 > h] > [INFO] Spark Project ML Library ... SKIPPED > [INFO] Spark Project Tools SKIPPED > [INFO] Spark Project Hive . SKIPPED > [INFO] Spark Project REPL . SKIPPED > [INFO] Spark Project Assembly . SKIPPED > [INFO] Kafka 0.10+ Token Provider for Streaming ... SKIPPED > [INFO] Spark Integration for Kafka 0.10 ... SKIPPED > [INFO] Kafka 0.10+ Source for Structured Streaming SKIPPED > [INFO] Spark Project Examples . SKIPPED > [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED > [INFO] Spark Avro . SKIPPED > [INFO] > > [INFO] BUILD FAILURE > [INFO] > > [INFO] Total time: 02:00 h > [INFO] Finished at: 2022-02-27T01:01:44+01:00 > [INFO] > > [ERROR] Failed to execute goal > org.scalatest:scalatest-maven-plugin:2.0.2:test (test) on project > spark-sql_2.12: There are test failures -> [Help 1] > org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute > goal org.scalatest:scalatest-maven-plugin:2.0.2:test (test) on project > spark-sql_2.12: There are test failures > at org.apache.maven.lifecycle.internal.MojoExecutor.execute > (MojoExecutor.java:215) > at org.apache.maven.lifecycle.internal.MojoExecutor.execute > (MojoExecutor
[jira] [Updated] (SPARK-38397) Support Kueue: K8s-native Job Queueing
[ https://issues.apache.org/jira/browse/SPARK-38397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38397: -- Description: There are several ways to run Spark on K8s including vanilla `spark-submit` with built-in `KubernetesClusterManager`, `spark-submit` with custom `ExternalClusterManager`, CRD-based operators (like spark-on-k8s-operator), custom K8s `schedulers`, custom `standalone pod definitions`, and so on. This issue is tracking K8s-native Job Queueing related work. * [https://github.com/kubernetes-sigs/kueue] {code} metadata: generateName: sample-job- annotations: kueue.k8s.io/queue-name: main {code} The best case is Apache Spark users use it in the future via pod templates or existing configuration. In other words, we don't need to do anything and close this JIRA without any patches. was: This issue is tracking K8s-native Job Queueing related work. * [https://github.com/kubernetes-sigs/kueue] {code} metadata: generateName: sample-job- annotations: kueue.k8s.io/queue-name: main {code} The best case is Apache Spark users use it in the future via pod templates or existing configuration. In other words, we don't need to do anything and close this JIRA without any patches. > Support Kueue: K8s-native Job Queueing > -- > > Key: SPARK-38397 > URL: https://issues.apache.org/jira/browse/SPARK-38397 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > There are several ways to run Spark on K8s including vanilla `spark-submit` > with built-in `KubernetesClusterManager`, `spark-submit` with custom > `ExternalClusterManager`, CRD-based operators (like spark-on-k8s-operator), > custom K8s `schedulers`, custom `standalone pod definitions`, and so on. > This issue is tracking K8s-native Job Queueing related work. > * [https://github.com/kubernetes-sigs/kueue] > {code} > metadata: > generateName: sample-job- > annotations: > kueue.k8s.io/queue-name: main > {code} > The best case is Apache Spark users use it in the future via pod templates or > existing configuration. In other words, we don't need to do anything and > close this JIRA without any patches. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38397) Support Kueue: K8s-native Job Queueing
[ https://issues.apache.org/jira/browse/SPARK-38397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38397: -- Description: This issue is tracking K8s-native Job Queueing related work. * [https://github.com/kubernetes-sigs/kueue] {code} metadata: generateName: sample-job- annotations: kueue.k8s.io/queue-name: main {code} The best case is Apache Spark users use it in the future via pod templates or existing configuration. In other words, we don't need to do anything and close this JIRA without any patches. was: This issue is tracking K8s-native Job Queueing related work. * [https://github.com/kubernetes-sigs/kueue] The best case is Apache Spark users use it in the future via pod templates. In other words, we can close this JIRA as `Done` without any patches. > Support Kueue: K8s-native Job Queueing > -- > > Key: SPARK-38397 > URL: https://issues.apache.org/jira/browse/SPARK-38397 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue is tracking K8s-native Job Queueing related work. > * [https://github.com/kubernetes-sigs/kueue] > {code} > metadata: > generateName: sample-job- > annotations: > kueue.k8s.io/queue-name: main > {code} > The best case is Apache Spark users use it in the future via pod templates or > existing configuration. In other words, we don't need to do anything and > close this JIRA without any patches. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38397) Support Kueue: K8s-native Job Queueing
[ https://issues.apache.org/jira/browse/SPARK-38397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38397: -- Description: This issue is tracking K8s-native Job Queueing related work. * [https://github.com/kubernetes-sigs/kueue] The best case is Apache Spark users use it in the future via pod templates. In other words, we can close this JIRA as `Done` without any patches. was: This issue is tracking K8s-native Job Queueing related work. * [https://github.com/kubernetes-sigs/kueue] The best case is Apache Spark users use it in the future via pod templates. > Support Kueue: K8s-native Job Queueing > -- > > Key: SPARK-38397 > URL: https://issues.apache.org/jira/browse/SPARK-38397 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue is tracking K8s-native Job Queueing related work. > * [https://github.com/kubernetes-sigs/kueue] > The best case is Apache Spark users use it in the future via pod templates. > In other words, we can close this JIRA as `Done` without any patches. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38397) Support Kueue: K8s-native Job Queueing
Dongjoon Hyun created SPARK-38397: - Summary: Support Kueue: K8s-native Job Queueing Key: SPARK-38397 URL: https://issues.apache.org/jira/browse/SPARK-38397 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.4.0 Reporter: Dongjoon Hyun This issue is tracking K8s-native Job Queueing related work. * [https://github.com/kubernetes-sigs/kueue] The best case is Apache Spark users use it in the future via pod templates. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38310) Support job queue in YuniKorn feature step
[ https://issues.apache.org/jira/browse/SPARK-38310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500448#comment-17500448 ] Wilfred Spiegelenburg commented on SPARK-38310: --- Spark on yarn uses the --queue option as part of the submit. Should the same option work for Spark on K8s when used in combination with YuniKorn or Volcano as a scheduler? Driver pod annotations and labels support the queue specification on the YuniKorn side. Volcano has a similar approach through the pod group. Would be nice to be able to leverage the existing command line option to submit to a specific queue > Support job queue in YuniKorn feature step > -- > > Key: SPARK-38310 > URL: https://issues.apache.org/jira/browse/SPARK-38310 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.2 > Environment: Like SPARK-38188, yunikorn needs to support the queue > property and use that in the feature step to properly configure > driver/executor pods. >Reporter: Weiwei Yang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37753) DynamicJoinSelection.shouldDemoteBroadcastHashJoin should apply to left outer with join many empty partitions on the left
[ https://issues.apache.org/jira/browse/SPARK-37753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500446#comment-17500446 ] Apache Spark commented on SPARK-37753: -- User 'ekoifman' has created a pull request for this issue: https://github.com/apache/spark/pull/35715 > DynamicJoinSelection.shouldDemoteBroadcastHashJoin should apply to left outer > with join many empty partitions on the left > - > > Key: SPARK-37753 > URL: https://issues.apache.org/jira/browse/SPARK-37753 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.8 >Reporter: Eugene Koifman >Priority: Major > > if {{DynamicJoinSelection}} sees a LOJ where there are many empty partitions > on the left, we want to run it as a shuffle join so that tasks with empty LHS > short circuit quickly. > So we should inhibit Broadcast join. > This is a follow up to SPARK-37193 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37753) DynamicJoinSelection.shouldDemoteBroadcastHashJoin should apply to left outer with join many empty partitions on the left
[ https://issues.apache.org/jira/browse/SPARK-37753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37753: Assignee: Apache Spark > DynamicJoinSelection.shouldDemoteBroadcastHashJoin should apply to left outer > with join many empty partitions on the left > - > > Key: SPARK-37753 > URL: https://issues.apache.org/jira/browse/SPARK-37753 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.8 >Reporter: Eugene Koifman >Assignee: Apache Spark >Priority: Major > > if {{DynamicJoinSelection}} sees a LOJ where there are many empty partitions > on the left, we want to run it as a shuffle join so that tasks with empty LHS > short circuit quickly. > So we should inhibit Broadcast join. > This is a follow up to SPARK-37193 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37753) DynamicJoinSelection.shouldDemoteBroadcastHashJoin should apply to left outer with join many empty partitions on the left
[ https://issues.apache.org/jira/browse/SPARK-37753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37753: Assignee: (was: Apache Spark) > DynamicJoinSelection.shouldDemoteBroadcastHashJoin should apply to left outer > with join many empty partitions on the left > - > > Key: SPARK-37753 > URL: https://issues.apache.org/jira/browse/SPARK-37753 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.8 >Reporter: Eugene Koifman >Priority: Major > > if {{DynamicJoinSelection}} sees a LOJ where there are many empty partitions > on the left, we want to run it as a shuffle join so that tasks with empty LHS > short circuit quickly. > So we should inhibit Broadcast join. > This is a follow up to SPARK-37193 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38191) The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.
[ https://issues.apache.org/jira/browse/SPARK-38191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-38191. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35693 [https://github.com/apache/spark/pull/35693] > The staging directory of write job only needs to be initialized once in > HadoopMapReduceCommitProtocol. > -- > > Key: SPARK-38191 > URL: https://issues.apache.org/jira/browse/SPARK-38191 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.2.1 >Reporter: weixiuli >Assignee: weixiuli >Priority: Minor > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37895) Error while joining two tables with non-english field names
[ https://issues.apache.org/jira/browse/SPARK-37895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500419#comment-17500419 ] Pablo Langa Blanco commented on SPARK-37895: I'm working on a fix > Error while joining two tables with non-english field names > --- > > Key: SPARK-37895 > URL: https://issues.apache.org/jira/browse/SPARK-37895 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0 >Reporter: Marina Krasilnikova >Priority: Minor > > While trying to join two tables with non-english field names in postgresql > with query like > "select view1.`Имя1` , view1.`Имя2`, view2.`Имя3` from view1 left join view2 > on view1.`Имя2`=view2.`Имя4`" > we get an error which says that there is no field "`Имя4`" (field name is > surrounded by backticks). > It appears that to get the data from the second table it constructs query like > SELECT "Имя3","Имя4" FROM "public"."tab2" WHERE ("`Имя4`" IS NOT NULL) > and these backticks are redundant in WHERE clause. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38396) Improve K8s Integration Tests
[ https://issues.apache.org/jira/browse/SPARK-38396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38396: -- Description: This JIRA aims to improve K8s integration tests for the following recent activity. - Java 17 - Cloud K8s testing (including Graviton instances and K8s 1.22 GA on EKS on March 15th) - New K8s model (K8s clients) - New custom schedulers and statefulset was: This JIRA aims to improve K8s integration tests for the following recent activity. - Java 17 - Cloud K8s testing (including K8s 1.22 GA on EKS on March 15th) - New K8s model (K8s clients) - New custom schedulers and statefulset > Improve K8s Integration Tests > - > > Key: SPARK-38396 > URL: https://issues.apache.org/jira/browse/SPARK-38396 > Project: Spark > Issue Type: Umbrella > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: William Hyun >Priority: Major > Labels: releasenotes > Fix For: 3.3.0 > > > This JIRA aims to improve K8s integration tests for the following recent > activity. > - Java 17 > - Cloud K8s testing (including Graviton instances and K8s 1.22 GA on EKS on > March 15th) > - New K8s model (K8s clients) > - New custom schedulers and statefulset -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38081) Support cloud-backend in K8s IT with SBT
[ https://issues.apache.org/jira/browse/SPARK-38081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38081: -- Fix Version/s: 3.2.2 > Support cloud-backend in K8s IT with SBT > > > Key: SPARK-38081 > URL: https://issues.apache.org/jira/browse/SPARK-38081 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0, 3.2.2 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38379) Kubernetes: NoSuchElementException: spark.app.id when using PersistentVolumes
[ https://issues.apache.org/jira/browse/SPARK-38379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500416#comment-17500416 ] Dongjoon Hyun edited comment on SPARK-38379 at 3/2/22, 10:54 PM: - Could you describe your procedure about this, [~tgraves]? I've been never running `spark-shell` on K8s cluster so far. bq. and starting a spark-shell in client mode was (Author: dongjoon): Could you describe your procedure about this, [~tgraves]? I've been never running `spark-shell` on K8s cluster so far. > and starting a spark-shell in client mode > Kubernetes: NoSuchElementException: spark.app.id when using PersistentVolumes > -- > > Key: SPARK-38379 > URL: https://issues.apache.org/jira/browse/SPARK-38379 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.1 >Reporter: Thomas Graves >Priority: Major > > I'm using Spark 3.2.1 on a kubernetes cluster and starting a spark-shell in > client mode. I'm using persistent local volumes to mount nvme under /data in > the executors and on startup the driver always throws the warning below. > using these options: > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName=OnDemand > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass=fast-disks > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit=500Gi > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path=/data > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly=false > > > {code:java} > 22/03/01 20:21:22 WARN ExecutorPodsSnapshotsStoreImpl: Exception when > notifying snapshot subscriber. > java.util.NoSuchElementException: spark.app.id > at org.apache.spark.SparkConf.$anonfun$get$1(SparkConf.scala:245) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.SparkConf.get(SparkConf.scala:245) > at org.apache.spark.SparkConf.getAppId(SparkConf.scala:450) > at > org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.$anonfun$constructVolumes$4(MountVolumesFeatureStep.scala:88) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.constructVolumes(MountVolumesFeatureStep.scala:57) > at > org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.configurePod(MountVolumesFeatureStep.scala:34) > at > org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.$anonfun$buildFromFeatures$4(KubernetesExecutorBuilder.scala:64) > at > scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) > at > scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) > at scala.collection.immutable.List.foldLeft(List.scala:91) > at > org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.buildFromFeatures(KubernetesExecutorBuilder.scala:63) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:391) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:382) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:346) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:339) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at
[jira] [Commented] (SPARK-38379) Kubernetes: NoSuchElementException: spark.app.id when using PersistentVolumes
[ https://issues.apache.org/jira/browse/SPARK-38379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500416#comment-17500416 ] Dongjoon Hyun commented on SPARK-38379: --- Could you describe your procedure about this, [~tgraves]? I've been never running `spark-shell` on K8s cluster so far. > and starting a spark-shell in client mode > Kubernetes: NoSuchElementException: spark.app.id when using PersistentVolumes > -- > > Key: SPARK-38379 > URL: https://issues.apache.org/jira/browse/SPARK-38379 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.1 >Reporter: Thomas Graves >Priority: Major > > I'm using Spark 3.2.1 on a kubernetes cluster and starting a spark-shell in > client mode. I'm using persistent local volumes to mount nvme under /data in > the executors and on startup the driver always throws the warning below. > using these options: > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName=OnDemand > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass=fast-disks > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit=500Gi > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path=/data > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly=false > > > {code:java} > 22/03/01 20:21:22 WARN ExecutorPodsSnapshotsStoreImpl: Exception when > notifying snapshot subscriber. > java.util.NoSuchElementException: spark.app.id > at org.apache.spark.SparkConf.$anonfun$get$1(SparkConf.scala:245) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.SparkConf.get(SparkConf.scala:245) > at org.apache.spark.SparkConf.getAppId(SparkConf.scala:450) > at > org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.$anonfun$constructVolumes$4(MountVolumesFeatureStep.scala:88) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.constructVolumes(MountVolumesFeatureStep.scala:57) > at > org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.configurePod(MountVolumesFeatureStep.scala:34) > at > org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.$anonfun$buildFromFeatures$4(KubernetesExecutorBuilder.scala:64) > at > scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) > at > scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) > at scala.collection.immutable.List.foldLeft(List.scala:91) > at > org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.buildFromFeatures(KubernetesExecutorBuilder.scala:63) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:391) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:382) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:346) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:339) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.onNewSnapshots(ExecutorPodsAllocator.scala:339) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3(ExecutorPodsAllocator.scala:117) >
[jira] [Updated] (SPARK-38353) Instrument __enter__ and __exit__ magic methods for pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-38353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-38353: Description: Create the ticket since instrumenting _{_}enter{_}_ and _{_}exit{_}_ magic methods for pandas API on Spark can help improve accuracy of the usage data. Besides, we are interested in extending the pandas-on-Spark usage logger to other PySpark modules in the future so it will help improve accuracy of usage data of other PySpark modules. For example, for the following code: {code:java} pdf = pd.DataFrame( [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] ) psdf = ps.from_pandas(pdf) with psdf.spark.cache() as cached_df: self.assert_eq(isinstance(cached_df, CachedDataFrame), True) self.assert_eq( repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, False, True)) ){code} pandas-on-Spark usage logger records the internal call [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] since __enter__ and __exit__ methods of [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] are not instrumented. was: Create the ticket since instrumenting __enter__ and __exit__ magic methods for pandas API on Spark can help improve accuracy of the usage data. Besides, we are interested in extending the pandas-on-Spark usage logger to other PySpark modules in the future so it will help improve accuracy of usage data of other PySpark modules. For example, for the following code: {code:java} pdf = pd.DataFrame( [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] ) psdf = ps.from_pandas(pdf) with psdf.spark.cache() as cached_df: self.assert_eq(isinstance(cached_df, CachedDataFrame), True) self.assert_eq( repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, False, True)) ){code} pandas usage logger records the internal call [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] since _{_}enter{_}_ and _{_}exit{_}_ methods of [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] are not instrumented. > Instrument __enter__ and __exit__ magic methods for pandas API on Spark > --- > > Key: SPARK-38353 > URL: https://issues.apache.org/jira/browse/SPARK-38353 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Yihong He >Priority: Minor > > Create the ticket since instrumenting _{_}enter{_}_ and _{_}exit{_}_ magic > methods for pandas API on Spark can help improve accuracy of the usage data. > Besides, we are interested in extending the pandas-on-Spark usage logger to > other PySpark modules in the future so it will help improve accuracy of usage > data of other PySpark modules. > For example, for the following code: > > {code:java} > pdf = pd.DataFrame( > [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] > ) > psdf = ps.from_pandas(pdf) > with psdf.spark.cache() as cached_df: > self.assert_eq(isinstance(cached_df, CachedDataFrame), True) > self.assert_eq( > repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, > False, True)) > ){code} > > pandas-on-Spark usage logger records the internal call > [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] > since __enter__ and __exit__ methods of > [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] > are not instrumented. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38353) Instrument __enter__ and __exit__ magic methods for pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-38353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-38353: Description: Create the ticket since instrumenting __enter__ and __exit__ magic methods for pandas API on Spark can help improve accuracy of the usage data. Besides, we are interested in extending the pandas-on-Spark usage logger to other PySpark modules in the future so it will help improve accuracy of usage data of other PySpark modules. For example, for the following code: {code:java} pdf = pd.DataFrame( [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] ) psdf = ps.from_pandas(pdf) with psdf.spark.cache() as cached_df: self.assert_eq(isinstance(cached_df, CachedDataFrame), True) self.assert_eq( repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, False, True)) ){code} pandas usage logger records the internal call [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] since _{_}enter{_}_ and _{_}exit{_}_ methods of [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] are not instrumented. was: Create the ticket since instrumenting {_}{{_}}enter{{_}}{_} and {_}{{_}}exit{{_}}{_} magic methods for Pandas module can help improve accuracy of the usage data. Besides, we are interested in extending the Pandas usage logger to other PySpark modules in the future so it will help improve accuracy of usage data of other PySpark modules. For example, for the following code: {code:java} pdf = pd.DataFrame( [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] ) psdf = ps.from_pandas(pdf) with psdf.spark.cache() as cached_df: self.assert_eq(isinstance(cached_df, CachedDataFrame), True) self.assert_eq( repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, False, True)) ){code} pandas usage logger records the internal call [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] since __enter__ and __exit__ methods of [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] are not instrumented. > Instrument __enter__ and __exit__ magic methods for pandas API on Spark > --- > > Key: SPARK-38353 > URL: https://issues.apache.org/jira/browse/SPARK-38353 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Yihong He >Priority: Minor > > Create the ticket since instrumenting __enter__ and __exit__ magic methods > for pandas API on Spark can help improve accuracy of the usage data. Besides, > we are interested in extending the pandas-on-Spark usage logger to other > PySpark modules in the future so it will help improve accuracy of usage data > of other PySpark modules. > For example, for the following code: > > {code:java} > pdf = pd.DataFrame( > [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] > ) > psdf = ps.from_pandas(pdf) > with psdf.spark.cache() as cached_df: > self.assert_eq(isinstance(cached_df, CachedDataFrame), True) > self.assert_eq( > repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, > False, True)) > ){code} > > pandas usage logger records the internal call > [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] > since _{_}enter{_}_ and _{_}exit{_}_ methods of > [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] > are not instrumented. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38353) Instrument __enter__ and __exit__ magic methods for pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-38353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-38353: Summary: Instrument __enter__ and __exit__ magic methods for pandas API on Spark (was: Instrument __enter__ and __exit__ magic methods for Pandas module) > Instrument __enter__ and __exit__ magic methods for pandas API on Spark > --- > > Key: SPARK-38353 > URL: https://issues.apache.org/jira/browse/SPARK-38353 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Yihong He >Priority: Minor > > Create the ticket since instrumenting {_}{{_}}enter{{_}}{_} and > {_}{{_}}exit{{_}}{_} magic methods for Pandas module can help improve > accuracy of the usage data. Besides, we are interested in extending the > Pandas usage logger to other PySpark modules in the future so it will help > improve accuracy of usage data of other PySpark modules. > For example, for the following code: > > {code:java} > pdf = pd.DataFrame( > [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] > ) > psdf = ps.from_pandas(pdf) > with psdf.spark.cache() as cached_df: > self.assert_eq(isinstance(cached_df, CachedDataFrame), True) > self.assert_eq( > repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, > False, True)) > ){code} > > pandas usage logger records the internal call > [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] > since __enter__ and __exit__ methods of > [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] > are not instrumented. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38396) Improve K8s Integration Tests
[ https://issues.apache.org/jira/browse/SPARK-38396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38396: -- Description: This JIRA aims to improve K8s integration tests for the following recent activity. - Java 17 - Cloud K8s testing (including K8s 1.22 GA on EKS on March 15th) - New K8s model (K8s clients) - New custom schedulers and statefulset was: This JIRA aims to improve K8s integration tests for the following recent activity. - Java 17 - Cloud K8s testing (including K8s 1.22 GA on EKS on March 15th) - New K8s model (K8s clients) - New custom schedulers > Improve K8s Integration Tests > - > > Key: SPARK-38396 > URL: https://issues.apache.org/jira/browse/SPARK-38396 > Project: Spark > Issue Type: Umbrella > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: William Hyun >Priority: Major > Labels: releasenotes > Fix For: 3.3.0 > > > This JIRA aims to improve K8s integration tests for the following recent > activity. > - Java 17 > - Cloud K8s testing (including K8s 1.22 GA on EKS on March 15th) > - New K8s model (K8s clients) > - New custom schedulers and statefulset -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33206) Spark Shuffle Index Cache calculates memory usage wrong
[ https://issues.apache.org/jira/browse/SPARK-33206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33206: -- Fix Version/s: 3.2.2 > Spark Shuffle Index Cache calculates memory usage wrong > --- > > Key: SPARK-33206 > URL: https://issues.apache.org/jira/browse/SPARK-33206 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.4.0, 3.0.1 >Reporter: Lars Francke >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 3.3.0, 3.2.2 > > Attachments: image001(1).png > > > SPARK-21501 changed the spark shuffle index service to be based on memory > instead of the number of files. > Unfortunately, there's a problem with the calculation which is based on size > information provided by `ShuffleIndexInformation`. > It is based purely on the file size of the cached file on disk. > We're running in OOMs with very small index files (byte size ~16 bytes) but > the overhead of the ShuffleIndexInformation around this is much larger (e.g. > 184 bytes, see screenshot). We need to take this into account and should > probably add a fixed overhead of somewhere between 152 and 180 bytes > according to my tests. I'm not 100% sure what the correct number is and it'll > also depend on the architecture etc. so we can't be exact anyway. > If we do that we can maybe get rid of the size field in > ShuffleIndexInformation to save a few more bytes per entry. > In effect this means that for small files we use up about 70-100 times as > much memory as we intend to. Our NodeManagers OOM with 4GB and more of > indexShuffleCache. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38272) Use docker-desktop instead of docker-for-desktop for Docker K8S IT deployMode and context name
[ https://issues.apache.org/jira/browse/SPARK-38272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38272: -- Fix Version/s: 3.2.2 > Use docker-desktop instead of docker-for-desktop for Docker K8S IT deployMode > and context name > --- > > Key: SPARK-38272 > URL: https://issues.apache.org/jira/browse/SPARK-38272 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.3.0, 3.2.2 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38396) Improve K8s Integration Tests
[ https://issues.apache.org/jira/browse/SPARK-38396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38396. --- Fix Version/s: 3.3.0 Resolution: Fixed > Improve K8s Integration Tests > - > > Key: SPARK-38396 > URL: https://issues.apache.org/jira/browse/SPARK-38396 > Project: Spark > Issue Type: Umbrella > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: William Hyun >Priority: Major > Labels: releasenotes > Fix For: 3.3.0 > > > This JIRA aims to improve K8s integration tests for the following recent > activity. > - Java 17 > - Cloud K8s testing (including K8s 1.22 GA on EKS on March 15th) > - New K8s model (K8s clients) > - New custom schedulers -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38396) Improve K8s Integration Tests
[ https://issues.apache.org/jira/browse/SPARK-38396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-38396: - Assignee: William Hyun > Improve K8s Integration Tests > - > > Key: SPARK-38396 > URL: https://issues.apache.org/jira/browse/SPARK-38396 > Project: Spark > Issue Type: Umbrella > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: William Hyun >Priority: Major > Labels: releasenotes > > This JIRA aims to improve K8s integration tests for the following recent > activity. > - Java 17 > - Cloud K8s testing (including K8s 1.22 GA on EKS on March 15th) > - New K8s model (K8s clients) > - New custom schedulers -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37354) Make the Java version installed on the container image used by the K8s integration tests with SBT configurable
[ https://issues.apache.org/jira/browse/SPARK-37354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37354: -- Parent: SPARK-38396 Issue Type: Sub-task (was: Bug) > Make the Java version installed on the container image used by the K8s > integration tests with SBT configurable > -- > > Key: SPARK-37354 > URL: https://issues.apache.org/jira/browse/SPARK-37354 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.3.0 > > > I noticed that the default Java version installed on the container image used > by the K8s integration tests are different depending on the way to run the > tests. > If the tests are launched by Maven, the Java version is 8 is installed. > On the other hand, if the tests are launched by SBT, the Java version is 11. > Further, we have no way to change the version. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37319) Support K8s image building with Java 17
[ https://issues.apache.org/jira/browse/SPARK-37319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37319: -- Parent Issue: SPARK-38396 (was: SPARK-33772) > Support K8s image building with Java 17 > --- > > Key: SPARK-37319 > URL: https://issues.apache.org/jira/browse/SPARK-37319 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38302) Use Java 17 in K8S integration tests when setting spark-tgz
[ https://issues.apache.org/jira/browse/SPARK-38302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38302: -- Parent Issue: SPARK-38396 (was: SPARK-33772) > Use Java 17 in K8S integration tests when setting spark-tgz > --- > > Key: SPARK-38302 > URL: https://issues.apache.org/jira/browse/SPARK-38302 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: qian >Assignee: qian >Priority: Minor > Fix For: 3.3.0 > > > When setting parameters `spark-tgz` during integration tests, the error that > `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17` > cannot be found occurs. This is due to the default value of > `spark.kubernetes.test.dockerFile` being > `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17`. > When using the tgz, the working directory is > `${spark.kubernetes.test.unpackSparkDir}`, and the relative path > `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17` > is invalid. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38269) Clean up redundant type cast
[ https://issues.apache.org/jira/browse/SPARK-38269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao resolved SPARK-38269. Fix Version/s: 3.3.0 Assignee: Yang Jie Resolution: Fixed > Clean up redundant type cast > > > Key: SPARK-38269 > URL: https://issues.apache.org/jira/browse/SPARK-38269 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37529) Support K8s integration tests for Java 17
[ https://issues.apache.org/jira/browse/SPARK-37529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37529: -- Parent Issue: SPARK-38396 (was: SPARK-33772) > Support K8s integration tests for Java 17 > - > > Key: SPARK-37529 > URL: https://issues.apache.org/jira/browse/SPARK-37529 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.3.0 > > > Now that we can build container image for Java 17, let's support K8s > integration tests for Java 17. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37645) Word spell error - "labeled" spells as "labled"
[ https://issues.apache.org/jira/browse/SPARK-37645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37645: -- Parent: SPARK-38396 Issue Type: Sub-task (was: Improvement) > Word spell error - "labeled" spells as "labled" > --- > > Key: SPARK-37645 > URL: https://issues.apache.org/jira/browse/SPARK-37645 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: qian >Assignee: qian >Priority: Minor > Fix For: 3.3.0 > > > Word spell error - "labeled" spells as "labled" -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37998) Use `rbac.authorization.k8s.io/v1` instead of `v1beta1`
[ https://issues.apache.org/jira/browse/SPARK-37998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37998: -- Parent: SPARK-38396 Issue Type: Sub-task (was: Test) > Use `rbac.authorization.k8s.io/v1` instead of `v1beta1` > --- > > Key: SPARK-37998 > URL: https://issues.apache.org/jira/browse/SPARK-37998 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Minor > Fix For: 3.3.0, 3.2.2 > > > *$ kubectl version* > Client Version: version.Info\{Major:"1", Minor:"22", GitVersion:"v1.22.3", > GitCommit:"c92036820499fedefec0f847e2054d824aea6cd1", GitTreeState:"clean", > BuildDate:"2021-10-27T18:41:28Z", GoVersion:"go1.16.9", Compiler:"gc", > Platform:"linux/amd64"} > Server Version: version.Info\{Major:"1", Minor:"23", GitVersion:"v1.23.1", > GitCommit:"86ec240af8cbd1b60bcc4c03c20da9b98005b92e", GitTreeState:"clean", > BuildDate:"2021-12-16T11:34:54Z", GoVersion:"go1.17.5", Compiler:"gc", > Platform:"linux/amd64"} > *$ k apply -f > resource-managers/kubernetes/integration-tests/dev/spark-rbac.yaml* > namespace/spark created > serviceaccount/spark-sa created > *unable to recognize > "resource-managers/kubernetes/integration-tests/dev/spark-rbac.yaml": no > matches for kind "ClusterRole" in version "rbac.authorization.k8s.io/v1beta1"* > *unable to recognize > "resource-managers/kubernetes/integration-tests/dev/spark-rbac.yaml": no > matches for kind "ClusterRoleBinding" in version > "rbac.authorization.k8s.io/v1beta1"* -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38081) Support cloud-backend in K8s IT with SBT
[ https://issues.apache.org/jira/browse/SPARK-38081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38081: -- Parent: SPARK-38396 Issue Type: Sub-task (was: Test) > Support cloud-backend in K8s IT with SBT > > > Key: SPARK-38081 > URL: https://issues.apache.org/jira/browse/SPARK-38081 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38241) Close KubernetesClient in K8S integrations tests
[ https://issues.apache.org/jira/browse/SPARK-38241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38241: -- Parent: SPARK-38396 Issue Type: Sub-task (was: Task) > Close KubernetesClient in K8S integrations tests > > > Key: SPARK-38241 > URL: https://issues.apache.org/jira/browse/SPARK-38241 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.2.1 >Reporter: Martin Tzvetanov Grigorov >Assignee: Martin Tzvetanov Grigorov >Priority: Minor > Fix For: 3.3.0 > > > The implementations of > org.apache.spark.deploy.k8s.integrationtest.backend.IntegrationTestBackend > should close their KubernetesClient instance in #cleanUp() method. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36553) KMeans fails with NegativeArraySizeException for K = 50000 after issue #27758 was introduced
[ https://issues.apache.org/jira/browse/SPARK-36553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao resolved SPARK-36553. Fix Version/s: 3.1.3 3.3.0 3.2.2 Assignee: zhengruifeng Resolution: Fixed > KMeans fails with NegativeArraySizeException for K = 5 after issue #27758 > was introduced > > > Key: SPARK-36553 > URL: https://issues.apache.org/jira/browse/SPARK-36553 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, PySpark >Affects Versions: 3.1.1 >Reporter: Anders Rydbirk >Assignee: zhengruifeng >Priority: Major > Fix For: 3.1.3, 3.3.0, 3.2.2 > > > We are running KMeans on approximately 350M rows of x, y, z coordinates using > the following configuration: > {code:java} > KMeans( > featuresCol='features', > predictionCol='centroid_id', > k=5, > initMode='k-means||', > initSteps=2, > tol=0.5, > maxIter=20, > seed=SEED, > distanceMeasure='euclidean' > ) > {code} > When using Spark 3.0.0 this worked fine, but when upgrading to 3.1.1 we are > consistently getting errors unless we reduce K. > Stacktrace: > > {code:java} > An error occurred while calling o167.fit.An error occurred while calling > o167.fit.: java.lang.NegativeArraySizeException: -897458648 at > scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:194) at > scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:191) at > scala.Array$.ofDim(Array.scala:221) at > org.apache.spark.mllib.clustering.DistanceMeasure.computeStatistics(DistanceMeasure.scala:52) > at > org.apache.spark.mllib.clustering.KMeans.runAlgorithmWithWeight(KMeans.scala:280) > at org.apache.spark.mllib.clustering.KMeans.runWithWeight(KMeans.scala:231) > at org.apache.spark.ml.clustering.KMeans.$anonfun$fit$1(KMeans.scala:354) at > org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) > at scala.util.Try$.apply(Try.scala:213) at > org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) > at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:329) at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown > Source) at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown > Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at > py4j.Gateway.invoke(Gateway.java:282) at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at > py4j.commands.CallCommand.execute(CallCommand.java:79) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at > java.base/java.lang.Thread.run(Unknown Source) > {code} > > The issue is introduced by > [#27758|#diff-725d4624ddf4db9cc51721c2ddaef50a1bc30e7b471e0439da28c5b5582efdfdR52]] > which significantly reduces the maximum value of K. Snippit of line that > throws error from [DistanceMeasure.scala:|#L52]] > {code:java} > val packedValues = Array.ofDim[Double](k * (k + 1) / 2) > {code} > > *What we have tried:* > * Reducing iterations > * Reducing input volume > * Reducing K > Only reducing K have yielded success. > > *Possible workaround:* > # Roll back to Spark 3.0.0 since a KMeansModel generated with 3.0.0 cannot > be loaded in 3.1.1. > # Reduce K. Currently trying with 45000. > > *What we don't understand*: > Given the line of code above, we do not understand why we would get an > integer overflow. > For K=50,000, packedValues should be allocated with the size of 1,250,025,000 > < (2^31) and not result in a negative array size. > > *Suggested resolution:* > I'm not strong in the inner workings on KMeans, but my immediate thought > would be to add a fallback to previous logic for K larger than a set > threshold if the optimisation is to stay in place, as it breaks compatibility > from 3.0.0 to 3.1.1 for edge cases. > > Please let me know if more information is needed, this is my first time > raising a bug for a OS. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38272) Use docker-desktop instead of docker-for-desktop for Docker K8S IT deployMode and context name
[ https://issues.apache.org/jira/browse/SPARK-38272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38272: -- Parent: SPARK-38396 Issue Type: Sub-task (was: Improvement) > Use docker-desktop instead of docker-for-desktop for Docker K8S IT deployMode > and context name > --- > > Key: SPARK-38272 > URL: https://issues.apache.org/jira/browse/SPARK-38272 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38022) Use relativePath for K8s remote file test in BasicTestsSuite
[ https://issues.apache.org/jira/browse/SPARK-38022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38022: -- Parent: SPARK-38396 Issue Type: Sub-task (was: Test) > Use relativePath for K8s remote file test in BasicTestsSuite > > > Key: SPARK-38022 > URL: https://issues.apache.org/jira/browse/SPARK-38022 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > > *BEFORE* > {code:java} > $ build/sbt -Pkubernetes -Pkubernetes-integration-tests > -Dspark.kubernetes.test.dockerFile=resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17 > -Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test" > ... > [info] KubernetesSuite: > ... > [info] - Run SparkRemoteFileTest using a remote data file *** FAILED *** (3 > minutes, 3 seconds) > [info] The code passed to eventually never returned normally. Attempted 190 > times over 3.01226506667 minutes. Last failure message: false was not > true. (KubernetesSuite.scala:452) > ... {code} > *AFTER* > {code:java} > $ build/sbt -Pkubernetes -Pkubernetes-integration-tests > -Dspark.kubernetes.test.dockerFile=resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17 > -Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test" > ... > [info] KubernetesSuite: > ... > [info] - Run SparkRemoteFileTest using a remote data file (8 seconds, 608 > milliseconds){code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38396) Improve K8s Integration Tests
[ https://issues.apache.org/jira/browse/SPARK-38396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38396: -- Description: This JIRA aims to improve K8s integration tests for the following recent activity. - Java 17 - Cloud K8s testing (including K8s 1.22 GA on EKS on March 15th) - New K8s model (K8s clients) - New custom schedulers was: This JIRA aims to improve K8s integration tests for the following recent activity. - Cloud K8s testing (including K8s 1.22 GA on EKS on March 15th) - New K8s model (K8s clients) - New custom schedulers > Improve K8s Integration Tests > - > > Key: SPARK-38396 > URL: https://issues.apache.org/jira/browse/SPARK-38396 > Project: Spark > Issue Type: Umbrella > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: releasenotes > > This JIRA aims to improve K8s integration tests for the following recent > activity. > - Java 17 > - Cloud K8s testing (including K8s 1.22 GA on EKS on March 15th) > - New K8s model (K8s clients) > - New custom schedulers -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38396) Improve K8s Integration Tests
[ https://issues.apache.org/jira/browse/SPARK-38396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38396: -- Description: This JIRA aims to improve K8s integration tests for the following recent activity. - Cloud K8s testing (including K8s 1.22 GA on EKS on March 15th) - New K8s model (K8s clients) - New custom schedulers > Improve K8s Integration Tests > - > > Key: SPARK-38396 > URL: https://issues.apache.org/jira/browse/SPARK-38396 > Project: Spark > Issue Type: Umbrella > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: releasenotes > > This JIRA aims to improve K8s integration tests for the following recent > activity. > - Cloud K8s testing (including K8s 1.22 GA on EKS on March 15th) > - New K8s model (K8s clients) > - New custom schedulers -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38244) Upgrade kubernetes-client to 5.12.1
[ https://issues.apache.org/jira/browse/SPARK-38244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38244: -- Parent: SPARK-38396 Issue Type: Sub-task (was: Bug) > Upgrade kubernetes-client to 5.12.1 > --- > > Key: SPARK-38244 > URL: https://issues.apache.org/jira/browse/SPARK-38244 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38396) Improve K8s Integration Tests
[ https://issues.apache.org/jira/browse/SPARK-38396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38396: -- Labels: releasenotes (was: ) > Improve K8s Integration Tests > - > > Key: SPARK-38396 > URL: https://issues.apache.org/jira/browse/SPARK-38396 > Project: Spark > Issue Type: Umbrella > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: releasenotes > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38396) Improve K8s Integration Tests
[ https://issues.apache.org/jira/browse/SPARK-38396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38396: -- Target Version/s: 3.3.0 > Improve K8s Integration Tests > - > > Key: SPARK-38396 > URL: https://issues.apache.org/jira/browse/SPARK-38396 > Project: Spark > Issue Type: Umbrella > Components: Kubernetes, Tests >Affects Versions: 3.3.0, 3.2.2 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38396) Improve K8s Integration Tests
[ https://issues.apache.org/jira/browse/SPARK-38396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38396: -- Affects Version/s: (was: 3.2.2) > Improve K8s Integration Tests > - > > Key: SPARK-38396 > URL: https://issues.apache.org/jira/browse/SPARK-38396 > Project: Spark > Issue Type: Umbrella > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38048) Add IntegrationTestBackend.describePods to support all K8s test backends
[ https://issues.apache.org/jira/browse/SPARK-38048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38048: -- Parent: SPARK-38396 Issue Type: Sub-task (was: Test) > Add IntegrationTestBackend.describePods to support all K8s test backends > > > Key: SPARK-38048 > URL: https://issues.apache.org/jira/browse/SPARK-38048 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: William Hyun >Assignee: William Hyun >Priority: Major > Fix For: 3.3.0, 3.2.2 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38029) Support docker-desktop K8S integration test in SBT
[ https://issues.apache.org/jira/browse/SPARK-38029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38029: -- Parent: SPARK-38396 Issue Type: Sub-task (was: Test) > Support docker-desktop K8S integration test in SBT > -- > > Key: SPARK-38029 > URL: https://issues.apache.org/jira/browse/SPARK-38029 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: William Hyun >Assignee: William Hyun >Priority: Major > Fix For: 3.3.0, 3.2.2 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38392) Add `spark-` prefix to namespaces and `-driver` suffix to drivers during IT
[ https://issues.apache.org/jira/browse/SPARK-38392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38392: -- Parent: SPARK-38396 Issue Type: Sub-task (was: Improvement) > Add `spark-` prefix to namespaces and `-driver` suffix to drivers during IT > --- > > Key: SPARK-38392 > URL: https://issues.apache.org/jira/browse/SPARK-38392 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: Martin Tzvetanov Grigorov >Assignee: Martin Tzvetanov Grigorov >Priority: Minor > Fix For: 3.3.0, 3.2.2 > > > Currently when there is no configured K8S namespace by the user the > Kubernetes IT tests use UUID.toString() without the '-'es: > {code:java} > val namespace = > namespaceOption.getOrElse(UUID.randomUUID().toString.replaceAll("-", "")) > {code} > > Proposal: prefix the temporary namespace with "spark-" or "spark-test-". > > The name of the driver pod is: > {code:java} > driverPodName = "spark-test-app-" + > UUID.randomUUID().toString.replaceAll("-", "") {code} > i.e. it does not mention that it is the driver. > For non-test it uses names like `spark-on-k8s-app-f6bfc57f4a938abe-driver`, > i.e. the value of "–name" config + hash + "-driver". > In both IT tests and non-test the executor pods always mention "-exec-", so > they are clear. > > Proposal: unify non-test and IT test name and use "spark-test-app-"+ hash + > "-driver" for the IT tests. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38072) Support K8s imageTag parameter in SBT K8s IT
[ https://issues.apache.org/jira/browse/SPARK-38072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38072: -- Parent: SPARK-38396 Issue Type: Sub-task (was: Test) > Support K8s imageTag parameter in SBT K8s IT > > > Key: SPARK-38072 > URL: https://issues.apache.org/jira/browse/SPARK-38072 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: William Hyun >Assignee: William Hyun >Priority: Major > Fix For: 3.3.0, 3.2.2 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38071) Support K8s namespace parameter in SBT K8s IT
[ https://issues.apache.org/jira/browse/SPARK-38071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38071: -- Parent: SPARK-38396 Issue Type: Sub-task (was: Test) > Support K8s namespace parameter in SBT K8s IT > - > > Key: SPARK-38071 > URL: https://issues.apache.org/jira/browse/SPARK-38071 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: William Hyun >Assignee: William Hyun >Priority: Major > Fix For: 3.3.0, 3.2.2 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38396) Improve K8s Integration Tests
Dongjoon Hyun created SPARK-38396: - Summary: Improve K8s Integration Tests Key: SPARK-38396 URL: https://issues.apache.org/jira/browse/SPARK-38396 Project: Spark Issue Type: Umbrella Components: Kubernetes, Tests Affects Versions: 3.3.0, 3.2.2 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38048) Add IntegrationTestBackend.describePods to support all K8s test backends
[ https://issues.apache.org/jira/browse/SPARK-38048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38048: -- Fix Version/s: 3.2.2 > Add IntegrationTestBackend.describePods to support all K8s test backends > > > Key: SPARK-38048 > URL: https://issues.apache.org/jira/browse/SPARK-38048 > Project: Spark > Issue Type: Test > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: William Hyun >Assignee: William Hyun >Priority: Major > Fix For: 3.3.0, 3.2.2 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38379) Kubernetes: NoSuchElementException: spark.app.id when using PersistentVolumes
[ https://issues.apache.org/jira/browse/SPARK-38379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500354#comment-17500354 ] Thomas Graves commented on SPARK-38379: --- just going by the stack trace this looks related to change https://issues.apache.org/jira/browse/SPARK-35182 [~dongjoon] Just curious if you have run into this? > Kubernetes: NoSuchElementException: spark.app.id when using PersistentVolumes > -- > > Key: SPARK-38379 > URL: https://issues.apache.org/jira/browse/SPARK-38379 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.1 >Reporter: Thomas Graves >Priority: Major > > I'm using Spark 3.2.1 on a kubernetes cluster and starting a spark-shell in > client mode. I'm using persistent local volumes to mount nvme under /data in > the executors and on startup the driver always throws the warning below. > using these options: > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName=OnDemand > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass=fast-disks > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit=500Gi > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path=/data > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly=false > > > {code:java} > 22/03/01 20:21:22 WARN ExecutorPodsSnapshotsStoreImpl: Exception when > notifying snapshot subscriber. > java.util.NoSuchElementException: spark.app.id > at org.apache.spark.SparkConf.$anonfun$get$1(SparkConf.scala:245) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.SparkConf.get(SparkConf.scala:245) > at org.apache.spark.SparkConf.getAppId(SparkConf.scala:450) > at > org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.$anonfun$constructVolumes$4(MountVolumesFeatureStep.scala:88) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.constructVolumes(MountVolumesFeatureStep.scala:57) > at > org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.configurePod(MountVolumesFeatureStep.scala:34) > at > org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.$anonfun$buildFromFeatures$4(KubernetesExecutorBuilder.scala:64) > at > scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) > at > scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) > at scala.collection.immutable.List.foldLeft(List.scala:91) > at > org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.buildFromFeatures(KubernetesExecutorBuilder.scala:63) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:391) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:382) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:346) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:339) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.onNewSnapshots(ExecutorPodsAllocator.scala:339) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3(ExecutorPodsAllocator.scala:117) >
[jira] [Updated] (SPARK-38072) Support K8s imageTag parameter in SBT K8s IT
[ https://issues.apache.org/jira/browse/SPARK-38072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38072: -- Fix Version/s: 3.2.2 > Support K8s imageTag parameter in SBT K8s IT > > > Key: SPARK-38072 > URL: https://issues.apache.org/jira/browse/SPARK-38072 > Project: Spark > Issue Type: Test > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: William Hyun >Assignee: William Hyun >Priority: Major > Fix For: 3.3.0, 3.2.2 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38071) Support K8s namespace parameter in SBT K8s IT
[ https://issues.apache.org/jira/browse/SPARK-38071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38071: -- Fix Version/s: 3.2.2 > Support K8s namespace parameter in SBT K8s IT > - > > Key: SPARK-38071 > URL: https://issues.apache.org/jira/browse/SPARK-38071 > Project: Spark > Issue Type: Test > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: William Hyun >Assignee: William Hyun >Priority: Major > Fix For: 3.3.0, 3.2.2 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38029) Support docker-desktop K8S integration test in SBT
[ https://issues.apache.org/jira/browse/SPARK-38029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38029: -- Fix Version/s: 3.2.2 > Support docker-desktop K8S integration test in SBT > -- > > Key: SPARK-38029 > URL: https://issues.apache.org/jira/browse/SPARK-38029 > Project: Spark > Issue Type: Test > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: William Hyun >Assignee: William Hyun >Priority: Major > Fix For: 3.3.0, 3.2.2 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38353) Instrument __enter__ and __exit__ magic methods for Pandas module
[ https://issues.apache.org/jira/browse/SPARK-38353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yihong He updated SPARK-38353: -- Description: Create the ticket since instrumenting {_}{{_}}enter{{_}}{_} and {_}{{_}}exit{{_}}{_} magic methods for Pandas module can help improve accuracy of the usage data. Besides, we are interested in extending the Pandas usage logger to other PySpark modules in the future so it will help improve accuracy of usage data of other PySpark modules. For example, for the following code: {code:java} pdf = pd.DataFrame( [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] ) psdf = ps.from_pandas(pdf) with psdf.spark.cache() as cached_df: self.assert_eq(isinstance(cached_df, CachedDataFrame), True) self.assert_eq( repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, False, True)) ){code} pandas usage logger records the internal call [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] since __enter__ and __exit__ methods of [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] are not instrumented. was: Create the ticket since instrumenting {_}{{_}}enter{{_}}{_} and {_}{{_}}exit{{_}}{_} magic methods for Pandas module can help improve accuracy of the usage data. Besides, we are interested in extending the Pandas usage logger to other PySpark modules in the future so it will help improve accuracy of usage data of other PySpark modules. For example, for the following code: {code:java} pdf = pd.DataFrame( [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] ) psdf = ps.from_pandas(pdf) with psdf.spark.cache() as cached_df: self.assert_eq(isinstance(cached_df, CachedDataFrame), True) self.assert_eq( repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, False, True)) ){code} pandas usage logger records [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] since {_}{{_}}enter{{_}}{_} and {_}{{_}}exit{{_}}{_} methods of [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] are not instrumented. > Instrument __enter__ and __exit__ magic methods for Pandas module > - > > Key: SPARK-38353 > URL: https://issues.apache.org/jira/browse/SPARK-38353 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Yihong He >Priority: Minor > > Create the ticket since instrumenting {_}{{_}}enter{{_}}{_} and > {_}{{_}}exit{{_}}{_} magic methods for Pandas module can help improve > accuracy of the usage data. Besides, we are interested in extending the > Pandas usage logger to other PySpark modules in the future so it will help > improve accuracy of usage data of other PySpark modules. > For example, for the following code: > > {code:java} > pdf = pd.DataFrame( > [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] > ) > psdf = ps.from_pandas(pdf) > with psdf.spark.cache() as cached_df: > self.assert_eq(isinstance(cached_df, CachedDataFrame), True) > self.assert_eq( > repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, > False, True)) > ){code} > > pandas usage logger records the internal call > [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] > since __enter__ and __exit__ methods of > [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] > are not instrumented. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38353) Instrument __enter__ and __exit__ magic methods for Pandas module
[ https://issues.apache.org/jira/browse/SPARK-38353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yihong He updated SPARK-38353: -- Description: Create the ticket since instrumenting {_}{{_}}enter{{_}}{_} and {_}{{_}}exit{{_}}{_} magic methods for Pandas module can help improve accuracy of the usage data. Besides, we are interested in extending the Pandas usage logger to other PySpark modules in the future so it will help improve accuracy of usage data of other PySpark modules. For example, for the following code: {code:java} pdf = pd.DataFrame( [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] ) psdf = ps.from_pandas(pdf) with psdf.spark.cache() as cached_df: self.assert_eq(isinstance(cached_df, CachedDataFrame), True) self.assert_eq( repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, False, True)) ){code} pandas usage logger records [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] since {_}{{_}}enter{{_}}{_} and {_}{{_}}exit{{_}}{_} methods of [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] are not instrumented. was: Create the ticket since instrumenting _{_}enter{_}_ and _{_}exit{_}_ magic methods for Pandas module can help improve accuracy of the usage data. Besides, we are interested in extend the Pandas usage logger to other PySpark modules in the future so it will help improve accuracy of usage data of other PySpark modules. For example, for the following code: {code:java} pdf = pd.DataFrame( [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] ) psdf = ps.from_pandas(pdf) with psdf.spark.cache() as cached_df: self.assert_eq(isinstance(cached_df, CachedDataFrame), True) self.assert_eq( repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, False, True)) ){code} pandas usage logger records [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] since {_}{{_}}enter{{_}}{_} and {_}{{_}}exit{{_}}{_} methods of [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] are not instrumented. > Instrument __enter__ and __exit__ magic methods for Pandas module > - > > Key: SPARK-38353 > URL: https://issues.apache.org/jira/browse/SPARK-38353 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Yihong He >Priority: Minor > > Create the ticket since instrumenting {_}{{_}}enter{{_}}{_} and > {_}{{_}}exit{{_}}{_} magic methods for Pandas module can help improve > accuracy of the usage data. Besides, we are interested in extending the > Pandas usage logger to other PySpark modules in the future so it will help > improve accuracy of usage data of other PySpark modules. > For example, for the following code: > > {code:java} > pdf = pd.DataFrame( > [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] > ) > psdf = ps.from_pandas(pdf) > with psdf.spark.cache() as cached_df: > self.assert_eq(isinstance(cached_df, CachedDataFrame), True) > self.assert_eq( > repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, > False, True)) > ){code} > > pandas usage logger records > [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] > since {_}{{_}}enter{{_}}{_} and {_}{{_}}exit{{_}}{_} methods of > [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] > are not instrumented. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38353) Instrument __enter__ and __exit__ magic methods for Pandas module
[ https://issues.apache.org/jira/browse/SPARK-38353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yihong He updated SPARK-38353: -- Description: Create the ticket since instrumenting _{_}enter{_}_ and _{_}exit{_}_ magic methods for Pandas module can help improve accuracy of the usage data. Besides, we are interested in extend the Pandas usage logger to other PySpark modules in the future so it will help improve accuracy of usage data of other PySpark modules. For example, for the following code: {code:java} pdf = pd.DataFrame( [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] ) psdf = ps.from_pandas(pdf) with psdf.spark.cache() as cached_df: self.assert_eq(isinstance(cached_df, CachedDataFrame), True) self.assert_eq( repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, False, True)) ){code} pandas usage logger records [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] since {_}{{_}}enter{{_}}{_} and {_}{{_}}exit{{_}}{_} methods of [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] are not instrumented. was: For example, for the following code: {code:java} pdf = pd.DataFrame( [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] ) psdf = ps.from_pandas(pdf) with psdf.spark.cache() as cached_df: self.assert_eq(isinstance(cached_df, CachedDataFrame), True) self.assert_eq( repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, False, True)) ){code} pandas usage logger records [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] since _{_}enter{_}_ and _{_}exit{_}_ methods of [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] are not instrumented. So instrumenting __enter__ and __exit__ magic methods for Pandas module can help improve accuracy of the usage data > Instrument __enter__ and __exit__ magic methods for Pandas module > - > > Key: SPARK-38353 > URL: https://issues.apache.org/jira/browse/SPARK-38353 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Yihong He >Priority: Minor > > Create the ticket since instrumenting _{_}enter{_}_ and _{_}exit{_}_ magic > methods for Pandas module can help improve accuracy of the usage data. > Besides, we are interested in extend the Pandas usage logger to other PySpark > modules in the future so it will help improve accuracy of usage data of other > PySpark modules. > For example, for the following code: > > {code:java} > pdf = pd.DataFrame( > [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] > ) > psdf = ps.from_pandas(pdf) > with psdf.spark.cache() as cached_df: > self.assert_eq(isinstance(cached_df, CachedDataFrame), True) > self.assert_eq( > repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, > False, True)) > ){code} > > pandas usage logger records > [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] > since {_}{{_}}enter{{_}}{_} and {_}{{_}}exit{{_}}{_} methods of > [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] > are not instrumented. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38392) [K8S] Improve the names used for K8S namespace and driver pod in integration tests
[ https://issues.apache.org/jira/browse/SPARK-38392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38392: -- Component/s: Tests > [K8S] Improve the names used for K8S namespace and driver pod in integration > tests > -- > > Key: SPARK-38392 > URL: https://issues.apache.org/jira/browse/SPARK-38392 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: Martin Tzvetanov Grigorov >Assignee: Martin Tzvetanov Grigorov >Priority: Minor > Fix For: 3.3.0, 3.2.2 > > > Currently when there is no configured K8S namespace by the user the > Kubernetes IT tests use UUID.toString() without the '-'es: > {code:java} > val namespace = > namespaceOption.getOrElse(UUID.randomUUID().toString.replaceAll("-", "")) > {code} > > Proposal: prefix the temporary namespace with "spark-" or "spark-test-". > > The name of the driver pod is: > {code:java} > driverPodName = "spark-test-app-" + > UUID.randomUUID().toString.replaceAll("-", "") {code} > i.e. it does not mention that it is the driver. > For non-test it uses names like `spark-on-k8s-app-f6bfc57f4a938abe-driver`, > i.e. the value of "–name" config + hash + "-driver". > In both IT tests and non-test the executor pods always mention "-exec-", so > they are clear. > > Proposal: unify non-test and IT test name and use "spark-test-app-"+ hash + > "-driver" for the IT tests. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20050) Kafka 0.10 DirectStream doesn't commit last processed batch's offset when graceful shutdown
[ https://issues.apache.org/jira/browse/SPARK-20050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500321#comment-17500321 ] James Grinter commented on SPARK-20050: --- You're just working around a genuine bug, and behaviour that is documented as how you're supposed to do it (and using the offsets is a better way, as monitoring tools can then see the consumer-group progress and detect that there's a problem, instead of it being hidden away in a private implementation). > Kafka 0.10 DirectStream doesn't commit last processed batch's offset when > graceful shutdown > --- > > Key: SPARK-20050 > URL: https://issues.apache.org/jira/browse/SPARK-20050 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Sasaki Toru >Priority: Major > > I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and > call 'DirectKafkaInputDStream#commitAsync' finally in each batches, such > below > {code} > val kafkaStream = KafkaUtils.createDirectStream[String, String](...) > kafkaStream.map { input => > "key: " + input.key.toString + " value: " + input.value.toString + " > offset: " + input.offset.toString > }.foreachRDD { rdd => > rdd.foreach { input => > println(input) > } > } > kafkaStream.foreachRDD { rdd => > val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges > kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) > } > {\code} > Some records which processed in the last batch before Streaming graceful > shutdown reprocess in the first batch after Spark Streaming restart, such > below > * output first run of this application > {code} > key: null value: 1 offset: 101452472 > key: null value: 2 offset: 101452473 > key: null value: 3 offset: 101452474 > key: null value: 4 offset: 101452475 > key: null value: 5 offset: 101452476 > key: null value: 6 offset: 101452477 > key: null value: 7 offset: 101452478 > key: null value: 8 offset: 101452479 > key: null value: 9 offset: 101452480 // this is a last record before > shutdown Spark Streaming gracefully > {\code} > * output re-run of this application > {code} > key: null value: 7 offset: 101452478 // duplication > key: null value: 8 offset: 101452479 // duplication > key: null value: 9 offset: 101452480 // duplication > key: null value: 10 offset: 101452481 > {\code} > It may cause offsets specified in commitAsync will commit in the head of next > batch. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38353) Instrument __enter__ and __exit__ magic methods for Pandas module
[ https://issues.apache.org/jira/browse/SPARK-38353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yihong He updated SPARK-38353: -- Description: For example, for the following code: {code:java} pdf = pd.DataFrame( [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] ) psdf = ps.from_pandas(pdf) with psdf.spark.cache() as cached_df: self.assert_eq(isinstance(cached_df, CachedDataFrame), True) self.assert_eq( repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, False, True)) ){code} pandas usage logger records [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] since __enter__ and __exit__ methods of [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] are not instrumented. > Instrument __enter__ and __exit__ magic methods for Pandas module > - > > Key: SPARK-38353 > URL: https://issues.apache.org/jira/browse/SPARK-38353 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Yihong He >Priority: Minor > > For example, for the following code: > > {code:java} > pdf = pd.DataFrame( > [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] > ) > psdf = ps.from_pandas(pdf) > with psdf.spark.cache() as cached_df: > self.assert_eq(isinstance(cached_df, CachedDataFrame), True) > self.assert_eq( > repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, > False, True)) > ){code} > > pandas usage logger records > [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] > since __enter__ and __exit__ methods of > [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] > are not instrumented. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38392) Add spark- prefix to namespaces and -driver suffix to drivers during IT
[ https://issues.apache.org/jira/browse/SPARK-38392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38392: -- Summary: Add spark- prefix to namespaces and -driver suffix to drivers during IT (was: [K8S] Improve the names used for K8S namespace and driver pod in integration tests) > Add spark- prefix to namespaces and -driver suffix to drivers during IT > --- > > Key: SPARK-38392 > URL: https://issues.apache.org/jira/browse/SPARK-38392 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: Martin Tzvetanov Grigorov >Assignee: Martin Tzvetanov Grigorov >Priority: Minor > Fix For: 3.3.0, 3.2.2 > > > Currently when there is no configured K8S namespace by the user the > Kubernetes IT tests use UUID.toString() without the '-'es: > {code:java} > val namespace = > namespaceOption.getOrElse(UUID.randomUUID().toString.replaceAll("-", "")) > {code} > > Proposal: prefix the temporary namespace with "spark-" or "spark-test-". > > The name of the driver pod is: > {code:java} > driverPodName = "spark-test-app-" + > UUID.randomUUID().toString.replaceAll("-", "") {code} > i.e. it does not mention that it is the driver. > For non-test it uses names like `spark-on-k8s-app-f6bfc57f4a938abe-driver`, > i.e. the value of "–name" config + hash + "-driver". > In both IT tests and non-test the executor pods always mention "-exec-", so > they are clear. > > Proposal: unify non-test and IT test name and use "spark-test-app-"+ hash + > "-driver" for the IT tests. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38353) Instrument __enter__ and __exit__ magic methods for Pandas module
[ https://issues.apache.org/jira/browse/SPARK-38353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yihong He updated SPARK-38353: -- Description: For example, for the following code: {code:java} pdf = pd.DataFrame( [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] ) psdf = ps.from_pandas(pdf) with psdf.spark.cache() as cached_df: self.assert_eq(isinstance(cached_df, CachedDataFrame), True) self.assert_eq( repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, False, True)) ){code} pandas usage logger records [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] since _{_}enter{_}_ and _{_}exit{_}_ methods of [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] are not instrumented. So instrumenting __enter__ and __exit__ magic methods for Pandas module can help improve accuracy of the usage data was: For example, for the following code: {code:java} pdf = pd.DataFrame( [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] ) psdf = ps.from_pandas(pdf) with psdf.spark.cache() as cached_df: self.assert_eq(isinstance(cached_df, CachedDataFrame), True) self.assert_eq( repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, False, True)) ){code} pandas usage logger records [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] since __enter__ and __exit__ methods of [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] are not instrumented. > Instrument __enter__ and __exit__ magic methods for Pandas module > - > > Key: SPARK-38353 > URL: https://issues.apache.org/jira/browse/SPARK-38353 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Yihong He >Priority: Minor > > For example, for the following code: > > {code:java} > pdf = pd.DataFrame( > [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] > ) > psdf = ps.from_pandas(pdf) > with psdf.spark.cache() as cached_df: > self.assert_eq(isinstance(cached_df, CachedDataFrame), True) > self.assert_eq( > repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, > False, True)) > ){code} > > pandas usage logger records > [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] > since _{_}enter{_}_ and _{_}exit{_}_ methods of > [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] > are not instrumented. > So instrumenting __enter__ and __exit__ magic methods for Pandas module can > help improve accuracy of the usage data -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38392) Add `spark-` prefix to namespaces and `-driver` suffix to drivers during IT
[ https://issues.apache.org/jira/browse/SPARK-38392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38392: -- Summary: Add `spark-` prefix to namespaces and `-driver` suffix to drivers during IT (was: Add spark- prefix to namespaces and -driver suffix to drivers during IT) > Add `spark-` prefix to namespaces and `-driver` suffix to drivers during IT > --- > > Key: SPARK-38392 > URL: https://issues.apache.org/jira/browse/SPARK-38392 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: Martin Tzvetanov Grigorov >Assignee: Martin Tzvetanov Grigorov >Priority: Minor > Fix For: 3.3.0, 3.2.2 > > > Currently when there is no configured K8S namespace by the user the > Kubernetes IT tests use UUID.toString() without the '-'es: > {code:java} > val namespace = > namespaceOption.getOrElse(UUID.randomUUID().toString.replaceAll("-", "")) > {code} > > Proposal: prefix the temporary namespace with "spark-" or "spark-test-". > > The name of the driver pod is: > {code:java} > driverPodName = "spark-test-app-" + > UUID.randomUUID().toString.replaceAll("-", "") {code} > i.e. it does not mention that it is the driver. > For non-test it uses names like `spark-on-k8s-app-f6bfc57f4a938abe-driver`, > i.e. the value of "–name" config + hash + "-driver". > In both IT tests and non-test the executor pods always mention "-exec-", so > they are clear. > > Proposal: unify non-test and IT test name and use "spark-test-app-"+ hash + > "-driver" for the IT tests. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38392) [K8S] Improve the names used for K8S namespace and driver pod in integration tests
[ https://issues.apache.org/jira/browse/SPARK-38392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-38392: - Assignee: Martin Tzvetanov Grigorov > [K8S] Improve the names used for K8S namespace and driver pod in integration > tests > -- > > Key: SPARK-38392 > URL: https://issues.apache.org/jira/browse/SPARK-38392 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Martin Tzvetanov Grigorov >Assignee: Martin Tzvetanov Grigorov >Priority: Minor > > Currently when there is no configured K8S namespace by the user the > Kubernetes IT tests use UUID.toString() without the '-'es: > {code:java} > val namespace = > namespaceOption.getOrElse(UUID.randomUUID().toString.replaceAll("-", "")) > {code} > > Proposal: prefix the temporary namespace with "spark-" or "spark-test-". > > The name of the driver pod is: > {code:java} > driverPodName = "spark-test-app-" + > UUID.randomUUID().toString.replaceAll("-", "") {code} > i.e. it does not mention that it is the driver. > For non-test it uses names like `spark-on-k8s-app-f6bfc57f4a938abe-driver`, > i.e. the value of "–name" config + hash + "-driver". > In both IT tests and non-test the executor pods always mention "-exec-", so > they are clear. > > Proposal: unify non-test and IT test name and use "spark-test-app-"+ hash + > "-driver" for the IT tests. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38392) [K8S] Improve the names used for K8S namespace and driver pod in integration tests
[ https://issues.apache.org/jira/browse/SPARK-38392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38392. --- Fix Version/s: 3.3.0 3.2.2 Resolution: Fixed Issue resolved by pull request 35711 [https://github.com/apache/spark/pull/35711] > [K8S] Improve the names used for K8S namespace and driver pod in integration > tests > -- > > Key: SPARK-38392 > URL: https://issues.apache.org/jira/browse/SPARK-38392 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Martin Tzvetanov Grigorov >Assignee: Martin Tzvetanov Grigorov >Priority: Minor > Fix For: 3.3.0, 3.2.2 > > > Currently when there is no configured K8S namespace by the user the > Kubernetes IT tests use UUID.toString() without the '-'es: > {code:java} > val namespace = > namespaceOption.getOrElse(UUID.randomUUID().toString.replaceAll("-", "")) > {code} > > Proposal: prefix the temporary namespace with "spark-" or "spark-test-". > > The name of the driver pod is: > {code:java} > driverPodName = "spark-test-app-" + > UUID.randomUUID().toString.replaceAll("-", "") {code} > i.e. it does not mention that it is the driver. > For non-test it uses names like `spark-on-k8s-app-f6bfc57f4a938abe-driver`, > i.e. the value of "–name" config + hash + "-driver". > In both IT tests and non-test the executor pods always mention "-exec-", so > they are clear. > > Proposal: unify non-test and IT test name and use "spark-test-app-"+ hash + > "-driver" for the IT tests. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38395) Pyspark issue in resolving column when there is dot (.)
Krishna Sangeeth KS created SPARK-38395: --- Summary: Pyspark issue in resolving column when there is dot (.) Key: SPARK-38395 URL: https://issues.apache.org/jira/browse/SPARK-38395 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.0.0 Environment: Issue found in Mac OS Catalina, Pyspark 3.0 Reporter: Krishna Sangeeth KS Pyspark apply in pandas have some difficult in resolving columns when there is dot in the column name. Here is an example that I have which reproduces the issue. Example taken by modifying doctest example [here|https://github.com/apache/spark/blob/branch-3.0/python/pyspark/sql/pandas/group_ops.py#L237-L248] {code:python} df1 = spark.createDataFrame( [(2101, 1, 1.0), (2101, 2, 2.0), (2102, 1, 3.0), (2102, 2, 4.0)], ("abc|database|10.159.154|xef", "id", "v1")) df2 = spark.createDataFrame( [(2101, 1, "x"), (2101, 2, "y")], ("abc|database|10.159.154|xef", "id", "v2")) def asof_join(l, r): return pd.merge_asof(l, r, on="abc|database|10.159.154|xef", by="id") df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas( asof_join, schema="`abc|database|10.159.154|xef` int, id int, v1 double, v2 string").show(){code} This gives the below error {code:python} AnalysisException Traceback (most recent call last) in 8 return pd.merge_asof(l, r, on="abc|database|10.159.154|xef", by="id") 9 df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas( ---> 10 asof_join, schema="`abc|database|10.159.154|xef` int, id int, v1 double, v2 string").show() ~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/pandas/group_ops.py in applyInPandas(self, func, schema) 295 udf = pandas_udf( 296 func, returnType=schema, functionType=PythonEvalType.SQL_COGROUPED_MAP_PANDAS_UDF) --> 297 all_cols = self._extract_cols(self._gd1) + self._extract_cols(self._gd2) 298 udf_column = udf(*all_cols) 299 jdf = self._gd1._jgd.flatMapCoGroupsInPandas(self._gd2._jgd, udf_column._jc.expr()) ~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/pandas/group_ops.py in _extract_cols(gd) 303 def _extract_cols(gd): 304 df = gd._df --> 305 return [df[col] for col in df.columns] 306 307 ~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/pandas/group_ops.py in (.0) 303 def _extract_cols(gd): 304 df = gd._df --> 305 return [df[col] for col in df.columns] 306 307 ~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/dataframe.py in __getitem__(self, item) 1378 """ 1379 if isinstance(item, basestring): -> 1380 jc = self._jdf.apply(item) 1381 return Column(jc) 1382 elif isinstance(item, Column): ~/anaconda3/envs/py37/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args) 1303 answer = self.gateway_client.send_command(command) 1304 return_value = get_return_value( -> 1305 answer, self.gateway_client, self.target_id, self.name) 1306 1307 for temp_arg in temp_args: ~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw) 135 # Hide where the exception came from that shows a non-Pythonic 136 # JVM exception message. --> 137 raise_from(converted) 138 else: 139 raise ~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/utils.py in raise_from(e) AnalysisException: Cannot resolve column name "abc|database|10.159.154|xef" among (abc|database|10.159.154|xef, id, v1); did you mean to quote the `abc|database|10.159.154|xef` column?; {code} As we can see the column is present there in the `among` list. When i replace `.` (dot) with `_` (underscore) the code actually works. {code:python} df1 = spark.createDataFrame( [(2101, 1, 1.0), (2101, 2, 2.0), (2102, 1, 3.0), (2102, 2, 4.0)], ("abc|database|10_159_154|xef", "id", "v1")) df2 = spark.createDataFrame( [(2101, 1, "x"), (2101, 2, "y")], ("abc|database|10_159_154|xef", "id", "v2")) def asof_join(l, r): return pd.merge_asof(l, r, on="abc|database|10_159_154|xef", by="id") df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas( asof_join, schema="`abc|database|10_159_154|xef` int, id int, v1 double, v2 string").show() {code} {code:java} +---+---+---+---+ |abc|database|10_159_154|xef| id| v1| v2| +---+---+---+---+ | 2101| 1|1.0| x| | 2102| 1|3.0| x| | 2101| 2|2.0| y| | 2102| 2|4.0| y| +---+---+---+---+ {code} -- This messag
[jira] [Updated] (SPARK-38388) Repartition + Stage retries could lead to incorrect data
[ https://issues.apache.org/jira/browse/SPARK-38388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Xu updated SPARK-38388: - Description: Spark repartition uses RoundRobinPartitioning, the generated results is non-deterministic when data has some randomness and stage/task retries happen. The bug can be triggered when upstream data has some randomness, a repartition is called on them, then followed by result stage (could be more stages). As the pattern shows below: upstream stage (data with randomness) -> (repartition shuffle) -> result stage When one executor goes down at result stage, some tasks of that stage might have finished, others would fail, shuffle files on that executor also get lost, some tasks from previous stage (upstream data generation, repartition) will need to rerun to generate dependent shuffle data files. Because data has some randomness, regenerated data in upstream retried tasks is slightly different, repartition then generates inconsistent ordering, then tasks at result stage will be retried generating different data. This is similar but different to https://issues.apache.org/jira/browse/SPARK-23207, fix for it uses extra local sort to make the row ordering deterministic, the sorting algorithm it uses simply compares row/record binaries. But in this case, upstream data has some randomness, the sorting algorithm doesn't help keep the order, thus RoundRobinPartitioning introduced non-deterministic result. The following code returns 986415, instead of 100: {code:java} import scala.sys.process._ import org.apache.spark.TaskContext case class TestObject(id: Long, value: Double) val ds = spark.range(0, 1000 * 1000, 1).repartition(100, $"id").withColumn("val", rand()).repartition(100).map { row => if (TaskContext.get.stageAttemptNumber == 0 && TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId > 97) { throw new Exception("pkill -f java".!!) } TestObject(row.getLong(0), row.getDouble(1)) } ds.toDF("id", "value").write.mode("overwrite").saveAsTable("tmp.test_table") spark.sql("select count(distinct id) from tmp.test_table").show{code} Command: {code:java} spark-shell --num-executors 10 (--conf spark.dynamicAllocation.enabled=false --conf spark.shuffle.service.enabled=false){code} To simulate the issue, disable external shuffle service is needed (if it's also enabled by default in your environment), this is to trigger shuffle file loss and previous stage retries. In our production, we have external shuffle service enabled, this data correctness issue happened when there were node losses. Although there's some non-deterministic factor in upstream data, user wouldn't expect to see incorrect result. was: Spark repartition uses RoundRobinPartitioning, the generated results is non-deterministic when data has some randomness and stage/task retries happen. The bug can be triggered when upstream data has some randomness, a repartition is called on them, then followed by result stage (could be more stages). As the pattern shows below: upstream stage (data with randomness) -> (repartition shuffle) -> result stage When one executor goes down at result stage, some tasks of that stage might have finished, others would fail, shuffle files on that executor also get lost, some tasks from previous stage (upstream data generation, repartition) will need to rerun to generate dependent shuffle data files. Because data has some randomness, regenerated data in upstream retried tasks is slightly different, repartition then generates inconsistent ordering, then tasks at result stage will be retried generating different data. This is similar but different to https://issues.apache.org/jira/browse/SPARK-23207, fix for it uses extra local sort to make the row ordering deterministic, the sorting algorithm it uses simply compares row/record binaries. But in this case, upstream data has some randomness, the sorting algorithm doesn't help keep the order, thus RoundRobinPartitioning introduced non-deterministic result. The following code returns 998818, instead of 100: {code:java} import scala.sys.process._ import org.apache.spark.TaskContext case class TestObject(id: Long, value: Double) val ds = spark.range(0, 1000 * 1000, 1).repartition(100, $"id").withColumn("val", rand()).repartition(100).map { row => if (TaskContext.get.stageAttemptNumber == 0 && TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId > 97) { throw new Exception("pkill -f java".!!) } TestObject(row.getLong(0), row.getDouble(1)) } ds.toDF("id", "value").write.mode("overwrite").saveAsTable("tmp.test_table") spark.sql("select count(distinct id) from tmp.test_table").show{code} Command: {code:java} spark-shell --num-executors 10 (--conf spark.dynamicAllocation.enabled=false --conf spark.shuffle.service.enabled=false){code} To simulate the issue, disable external shuffle service
[jira] [Updated] (SPARK-38388) Repartition + Stage retries could lead to incorrect data
[ https://issues.apache.org/jira/browse/SPARK-38388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Xu updated SPARK-38388: - Description: Spark repartition uses RoundRobinPartitioning, the generated results is non-deterministic when data has some randomness and stage/task retries happen. The bug can be triggered when upstream data has some randomness, a repartition is called on them, then followed by result stage (could be more stages). As the pattern shows below: upstream stage (data with randomness) -> (repartition shuffle) -> result stage When one executor goes down at result stage, some tasks of that stage might have finished, others would fail, shuffle files on that executor also get lost, some tasks from previous stage (upstream data generation, repartition) will need to rerun to generate dependent shuffle data files. Because data has some randomness, regenerated data in upstream retried tasks is slightly different, repartition then generates inconsistent ordering, then tasks at result stage will be retried generating different data. This is similar but different to https://issues.apache.org/jira/browse/SPARK-23207, fix for it uses extra local sort to make the row ordering deterministic, the sorting algorithm it uses simply compares row/record binaries. But in this case, upstream data has some randomness, the sorting algorithm doesn't help keep the order, thus RoundRobinPartitioning introduced non-deterministic result. The following code returns 998818, instead of 100: {code:java} import scala.sys.process._ import org.apache.spark.TaskContext case class TestObject(id: Long, value: Double) val ds = spark.range(0, 1000 * 1000, 1).repartition(100, $"id").withColumn("val", rand()).repartition(100).map { row => if (TaskContext.get.stageAttemptNumber == 0 && TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId > 97) { throw new Exception("pkill -f java".!!) } TestObject(row.getLong(0), row.getDouble(1)) } ds.toDF("id", "value").write.mode("overwrite").saveAsTable("tmp.test_table") spark.sql("select count(distinct id) from tmp.test_table").show{code} Command: {code:java} spark-shell --num-executors 10 (--conf spark.dynamicAllocation.enabled=false --conf spark.shuffle.service.enabled=false){code} To simulate the issue, disable external shuffle service is needed (if it's also enabled by default in your environment), this is to trigger shuffle file loss and previous stage retries. In our production, we have external shuffle service enabled, this data correctness issue happened when there were node losses. Although there's some non-deterministic factor in upstream data, user wouldn't expect to see incorrect result. was: Spark repartition uses RoundRobinPartitioning, the generated results is non-deterministic when data has some randomness and stage/task retries happen. The bug can be triggered when upstream data has some randomness, a repartition is called on them, then followed by result stage (could be more stages). As the pattern shows below: upstream stage (data with randomness) -> (repartition shuffle) -> result stage When one executor goes down at result stage, some tasks of that stage might have finished, others would fail, shuffle files on that executor also get lost, some tasks from previous stage (upstream data generation, repartition) will need to rerun to generate dependent shuffle data files. Because data has some randomness, regenerated data in upstream retried tasks is slightly different, repartition then generates inconsistent ordering, then tasks at result stage will be retried generating different data. This is similar but different to https://issues.apache.org/jira/browse/SPARK-23207, fix for it uses extra local sort to make the row ordering deterministic, the sorting algorithm it uses simply compares row/record binaries. But in this case, upstream data has some randomness, the sorting algorithm doesn't help keep the order, thus RoundRobinPartitioning introduced non-deterministic result. The following code returns 998818, instead of 100: {code:java} import scala.sys.process._ import org.apache.spark.TaskContext case class TestObject(id: Long, value: Double) val ds = spark.range(0, 1000 * 1000, 1).repartition(100, $"id").withColumn("val", rand()).repartition(100).map { row => if (TaskContext.get.stageAttemptNumber == 0 && TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId > 98) { throw new Exception("pkill -f java".!!) } TestObject(row.getLong(0), row.getDouble(1)) } ds.toDF("id", "value").write.saveAsTable("tmp.test_table") spark.sql("select count(distinct id) from tmp.test_table").show{code} Command: {code:java} spark-shell --num-executors 10 (--conf spark.dynamicAllocation.enabled=false --conf spark.shuffle.service.enabled=false){code} To simulate the issue, disable external shuffle service is needed (if it's
[jira] [Resolved] (SPARK-38389) Add the `DATEDIFF` and `DATE_DIFF` aliases for `TIMESTAMPDIFF()`
[ https://issues.apache.org/jira/browse/SPARK-38389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-38389. -- Resolution: Fixed Issue resolved by pull request 35709 [https://github.com/apache/spark/pull/35709] > Add the `DATEDIFF` and `DATE_DIFF` aliases for `TIMESTAMPDIFF()` > > > Key: SPARK-38389 > URL: https://issues.apache.org/jira/browse/SPARK-38389 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.3.0 > > > Introduce the datediff()/date_diff() function, which takes three arguments: > unit, and two datetime expressions, i.e., > {code:sql} > datediff(unit, startDatetime, endDatetime) > {code} > The function can be an alias to timestampdiff(). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33206) Spark Shuffle Index Cache calculates memory usage wrong
[ https://issues.apache.org/jira/browse/SPARK-33206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500272#comment-17500272 ] Apache Spark commented on SPARK-33206: -- User 'attilapiros' has created a pull request for this issue: https://github.com/apache/spark/pull/35714 > Spark Shuffle Index Cache calculates memory usage wrong > --- > > Key: SPARK-33206 > URL: https://issues.apache.org/jira/browse/SPARK-33206 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.4.0, 3.0.1 >Reporter: Lars Francke >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 3.3.0 > > Attachments: image001(1).png > > > SPARK-21501 changed the spark shuffle index service to be based on memory > instead of the number of files. > Unfortunately, there's a problem with the calculation which is based on size > information provided by `ShuffleIndexInformation`. > It is based purely on the file size of the cached file on disk. > We're running in OOMs with very small index files (byte size ~16 bytes) but > the overhead of the ShuffleIndexInformation around this is much larger (e.g. > 184 bytes, see screenshot). We need to take this into account and should > probably add a fixed overhead of somewhere between 152 and 180 bytes > according to my tests. I'm not 100% sure what the correct number is and it'll > also depend on the architecture etc. so we can't be exact anyway. > If we do that we can maybe get rid of the size field in > ShuffleIndexInformation to save a few more bytes per entry. > In effect this means that for small files we use up about 70-100 times as > much memory as we intend to. Our NodeManagers OOM with 4GB and more of > indexShuffleCache. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33206) Spark Shuffle Index Cache calculates memory usage wrong
[ https://issues.apache.org/jira/browse/SPARK-33206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500271#comment-17500271 ] Apache Spark commented on SPARK-33206: -- User 'attilapiros' has created a pull request for this issue: https://github.com/apache/spark/pull/35714 > Spark Shuffle Index Cache calculates memory usage wrong > --- > > Key: SPARK-33206 > URL: https://issues.apache.org/jira/browse/SPARK-33206 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.4.0, 3.0.1 >Reporter: Lars Francke >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 3.3.0 > > Attachments: image001(1).png > > > SPARK-21501 changed the spark shuffle index service to be based on memory > instead of the number of files. > Unfortunately, there's a problem with the calculation which is based on size > information provided by `ShuffleIndexInformation`. > It is based purely on the file size of the cached file on disk. > We're running in OOMs with very small index files (byte size ~16 bytes) but > the overhead of the ShuffleIndexInformation around this is much larger (e.g. > 184 bytes, see screenshot). We need to take this into account and should > probably add a fixed overhead of somewhere between 152 and 180 bytes > according to my tests. I'm not 100% sure what the correct number is and it'll > also depend on the architecture etc. so we can't be exact anyway. > If we do that we can maybe get rid of the size field in > ShuffleIndexInformation to save a few more bytes per entry. > In effect this means that for small files we use up about 70-100 times as > much memory as we intend to. Our NodeManagers OOM with 4GB and more of > indexShuffleCache. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38394) build of spark sql against hadoop-3.4.0-snapshot failing with bouncycastle classpath error
Steve Loughran created SPARK-38394: -- Summary: build of spark sql against hadoop-3.4.0-snapshot failing with bouncycastle classpath error Key: SPARK-38394 URL: https://issues.apache.org/jira/browse/SPARK-38394 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.3.0 Reporter: Steve Loughran builidng spark master with {{-Dhadoop.version=3.4.0-SNAPSHOT}} and a local hadoop build breaks in the sbt compiler plugin {code} [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:4.3.0:testCompile (scala-test-compile-first) on project spark-sql_2.12: Execution scala-test-compile-first of goal net.alchim31.maven:scala-maven-plugin:4.3.0:testCompile failed: A required class was missing while executing net.alchim31.maven:scala-maven-plugin:4.3.0:testCompile: org/bouncycastle/jce/provider/BouncyCastleProvider [ERROR] - [ERROR] realm =plugin>net.alchim31.maven:scala-maven-plugin:4.3.0 {code} * this is the classpath of the sbt compiler * hadoop hasn't been doing anything related to bouncy castle. setting scala-maven-plugin to 3.4.0 makes this go away, i.e. reapplying SPARK-36547 the implication here is that the plugin version is going to have to be configured in different profiles. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38393) Clean up deprecated usage of GenSeq/GenMap
[ https://issues.apache.org/jira/browse/SPARK-38393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500207#comment-17500207 ] Apache Spark commented on SPARK-38393: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/35713 > Clean up deprecated usage of GenSeq/GenMap > -- > > Key: SPARK-38393 > URL: https://issues.apache.org/jira/browse/SPARK-38393 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Minor > > GenSeq/GenMap is identified as @deprecated since Scala 2.13.0 and Gen* > collection types have been removed. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38393) Clean up deprecated usage of GenSeq/GenMap
[ https://issues.apache.org/jira/browse/SPARK-38393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38393: Assignee: Apache Spark > Clean up deprecated usage of GenSeq/GenMap > -- > > Key: SPARK-38393 > URL: https://issues.apache.org/jira/browse/SPARK-38393 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > > GenSeq/GenMap is identified as @deprecated since Scala 2.13.0 and Gen* > collection types have been removed. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38393) Clean up deprecated usage of GenSeq/GenMap
[ https://issues.apache.org/jira/browse/SPARK-38393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38393: Assignee: (was: Apache Spark) > Clean up deprecated usage of GenSeq/GenMap > -- > > Key: SPARK-38393 > URL: https://issues.apache.org/jira/browse/SPARK-38393 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Minor > > GenSeq/GenMap is identified as @deprecated since Scala 2.13.0 and Gen* > collection types have been removed. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38393) Clean up deprecated usage of GenSeq/GenMap
[ https://issues.apache.org/jira/browse/SPARK-38393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500206#comment-17500206 ] Apache Spark commented on SPARK-38393: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/35713 > Clean up deprecated usage of GenSeq/GenMap > -- > > Key: SPARK-38393 > URL: https://issues.apache.org/jira/browse/SPARK-38393 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Minor > > GenSeq/GenMap is identified as @deprecated since Scala 2.13.0 and Gen* > collection types have been removed. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38393) Clean up deprecated usage of GenSeq/GenMap
Yang Jie created SPARK-38393: Summary: Clean up deprecated usage of GenSeq/GenMap Key: SPARK-38393 URL: https://issues.apache.org/jira/browse/SPARK-38393 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Yang Jie GenSeq/GenMap is identified as @deprecated since Scala 2.13.0 and Gen* collection types have been removed. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38342) Clean up deprecated api usage of Ivy
[ https://issues.apache.org/jira/browse/SPARK-38342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-38342. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35672 [https://github.com/apache/spark/pull/35672] > Clean up deprecated api usage of Ivy > > > Key: SPARK-38342 > URL: https://issues.apache.org/jira/browse/SPARK-38342 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.3.0 > > > {code:java} > [WARNING] [Warn] > /spark-source/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:1459: > [deprecation @ > org.apache.spark.deploy.SparkSubmitUtils.resolveMavenCoordinates | > origin=org.apache.ivy.Ivy.retrieve | version=] method retrieve in class Ivy > is deprecated {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38342) Clean up deprecated api usage of Ivy
[ https://issues.apache.org/jira/browse/SPARK-38342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-38342: Assignee: Yang Jie > Clean up deprecated api usage of Ivy > > > Key: SPARK-38342 > URL: https://issues.apache.org/jira/browse/SPARK-38342 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > {code:java} > [WARNING] [Warn] > /spark-source/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:1459: > [deprecation @ > org.apache.spark.deploy.SparkSubmitUtils.resolveMavenCoordinates | > origin=org.apache.ivy.Ivy.retrieve | version=] method retrieve in class Ivy > is deprecated {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38392) [K8S] Improve the names used for K8S namespace and driver pod in integration tests
[ https://issues.apache.org/jira/browse/SPARK-38392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38392: Assignee: (was: Apache Spark) > [K8S] Improve the names used for K8S namespace and driver pod in integration > tests > -- > > Key: SPARK-38392 > URL: https://issues.apache.org/jira/browse/SPARK-38392 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Martin Tzvetanov Grigorov >Priority: Minor > > Currently when there is no configured K8S namespace by the user the > Kubernetes IT tests use UUID.toString() without the '-'es: > {code:java} > val namespace = > namespaceOption.getOrElse(UUID.randomUUID().toString.replaceAll("-", "")) > {code} > > Proposal: prefix the temporary namespace with "spark-" or "spark-test-". > > The name of the driver pod is: > {code:java} > driverPodName = "spark-test-app-" + > UUID.randomUUID().toString.replaceAll("-", "") {code} > i.e. it does not mention that it is the driver. > For non-test it uses names like `spark-on-k8s-app-f6bfc57f4a938abe-driver`, > i.e. the value of "–name" config + hash + "-driver". > In both IT tests and non-test the executor pods always mention "-exec-", so > they are clear. > > Proposal: unify non-test and IT test name and use "spark-test-app-"+ hash + > "-driver" for the IT tests. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org