[jira] [Updated] (SPARK-28103) Cannot infer filters from union table with empty local relation table properly
[ https://issues.apache.org/jira/browse/SPARK-28103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Wong updated SPARK-28103: - Description: Basically, the constraints of a union table could be turned empty if any subtable is turned into an empty local relation. The side effect is filter cannot be inferred correctly (by InferFiltersFromConstrains) We may reproduce the issue with the following setup: 1) Prepare two tables: {code:java} spark.sql("CREATE TABLE IF NOT EXISTS table1(id string, val string) USING PARQUET"); spark.sql("CREATE TABLE IF NOT EXISTS table2(id string, val string) USING PARQUET");{code} 2) Create a union view on table1. {code:java} spark.sql(""" | CREATE VIEW partitioned_table_1 AS | SELECT * FROM table1 WHERE id = 'a' | UNION ALL | SELECT * FROM table1 WHERE id = 'b' | UNION ALL | SELECT * FROM table1 WHERE id = 'c' | UNION ALL | SELECT * FROM table1 WHERE id NOT IN ('a','b','c') | """.stripMargin){code} 3) View the optimized plan of this SQL. The filter '[t2.id = 'a']' cannot be inferred. We can see that the constraints of the left table are empty. {code:java} scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id [t1.id] = t2.id [t2.id] AND t1.id [t1.id] = 'a'").queryExecution.optimizedPlan res39: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Join Inner, (id#0 = id#4) :- Union : :- Filter (isnotnull(id#0) && (id#0 = a)) : : +- Relation[id#0,val#1] parquet : :- LocalRelation , [id#0, val#1] : :- LocalRelation , [id#0, val#1] : +- Filter ((isnotnull(id#0) && NOT id#0 IN (a,b,c)) && (id#0 = a)) : +- Relation[id#0,val#1] parquet +- Filter isnotnull(id#4) +- Relation[id#4,val#5] parquet scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id [t1.id] = t2.id [t2.id] AND t1.id [t1.id] = 'a'").queryExecution.optimizedPlan.children(0).constraints res40: org.apache.spark.sql.catalyst.expressions.ExpressionSet = Set() {code} 4) Modified the query to avoid empty local relation. The filter '[td.id in ('a','b','c','d')' is then inferred properly. The constraints of the left table are not empty as well. {code:java} scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id [t1.id] = t2.id [t2.id] AND t1.id [t1.id] IN ('a','b','c','d')").queryExecution.optimizedPlan res42: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Join Inner, (id#0 = id#4) :- Union : :- Filter ((isnotnull(id#0) && (id#0 = a)) && id#0 IN (a,b,c,d)) : : +- Relation[id#0,val#1] parquet : :- Filter ((isnotnull(id#0) && (id#0 = b)) && id#0 IN (a,b,c,d)) : : +- Relation[id#0,val#1] parquet : :- Filter ((isnotnull(id#0) && (id#0 = c)) && id#0 IN (a,b,c,d)) : : +- Relation[id#0,val#1] parquet : +- Filter ((NOT id#0 IN (a,b,c) && id#0 IN (a,b,c,d)) && isnotnull(id#0)) : +- Relation[id#0,val#1] parquet +- Filter ((id#4 IN (a,b,c,d) && ((isnotnull(id#4) && (((id#4 = a) || (id#4 = b)) || (id#4 = c))) || NOT id#4 IN (a,b,c))) && isnotnull(id#4)) +- Relation[id#4,val#5] parquet scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id [t1.id] = t2.id [t2.id] AND t1.id [t1.id] IN ('a','b','c','d')").queryExecution.optimizedPlan.children(0).constraints res44: org.apache.spark.sql.catalyst.expressions.ExpressionSet = Set(isnotnull(id#0), id#0 IN (a,b,c,d), id#0 = a) || (id#0 = b)) || (id#0 = c)) || NOT id#0 IN (a,b,c))) {code} One of the possible workaround is create a rule to remove all empty local relation from a union table. Or, when we convert a relation to into an empty local relation, we should preserve those constraints in the empty local relation as well. A side node. Expression in optimized plan is not well optimized. For example, the expression {code:java} ((id#4 IN (a,b,c,d) && ((isnotnull(id#4) && (((id#4 = a) || (id#4 = b)) || (id#4 = c))) || NOT id#4 IN (a,b,c))) && isnotnull(id#4)){code} could be further optimized into {code:java} (isnotnull(id#4) && (id = d)){code} We may implement following rules to simplify the expression. 1) convert all 'equal' operators into 'in' operator, and then group all 'in' and 'not in' expressions by 'attribute reference' i) eq(a,val) => in(a,val::Nil) 2) merge all those 'in' and 'not in' operators, like i) or(in(a,list1),in(a,list2)) => in(a, list1 ++ list2) ii) or(in(a,list1), not(in(a,list2)) => not(in(a, list2 -- list1)) iii) and(in(a,list1),in(a,list2)) => in(a, list1 intersect list2) vi) and(in(a,list1),not(in(a,list2))) => in(a, list1 – list2) 3) revert in operator into 'equal' if there is only one element in the list. i) in(a,list) if list.size == 1 => eq(a,list.head) was: Basically, the constraints of a union table could be turned empty if any subtable is turned into an empty local relation. The side effect is filter cannot be
[jira] [Resolved] (SPARK-27890) Improve SQL parser error message when missing backquotes for identifiers with hyphens
[ https://issues.apache.org/jira/browse/SPARK-27890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-27890. - Resolution: Fixed Fix Version/s: 3.0.0 > Improve SQL parser error message when missing backquotes for identifiers with > hyphens > - > > Key: SPARK-27890 > URL: https://issues.apache.org/jira/browse/SPARK-27890 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.3 >Reporter: Yesheng Ma >Assignee: Yesheng Ma >Priority: Major > Fix For: 3.0.0 > > > Current SQL parser's error message for hyphen-connected identifiers without > surrounding backquotes(e.g. {{hyphen-table}}) is confusing for end users. A > possible approach to tackle this is to explicitly capture these wrong usages > in the SQL parser. In this way, the end users can fix these errors more > quickly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27890) Improve SQL parser error message when missing backquotes for identifiers with hyphens
[ https://issues.apache.org/jira/browse/SPARK-27890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-27890: --- Assignee: Yesheng Ma > Improve SQL parser error message when missing backquotes for identifiers with > hyphens > - > > Key: SPARK-27890 > URL: https://issues.apache.org/jira/browse/SPARK-27890 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.3 >Reporter: Yesheng Ma >Assignee: Yesheng Ma >Priority: Major > > Current SQL parser's error message for hyphen-connected identifiers without > surrounding backquotes(e.g. {{hyphen-table}}) is confusing for end users. A > possible approach to tackle this is to explicitly capture these wrong usages > in the SQL parser. In this way, the end users can fix these errors more > quickly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28103) Cannot infer filters from union table with empty local relation table properly
[ https://issues.apache.org/jira/browse/SPARK-28103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-28103: Target Version/s: 3.0.0 > Cannot infer filters from union table with empty local relation table properly > -- > > Key: SPARK-28103 > URL: https://issues.apache.org/jira/browse/SPARK-28103 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.1 >Reporter: William Wong >Priority: Major > > Basically, the constraints of a union table could be turned empty if any > subtable is turned into an empty local relation. The side effect is filter > cannot be inferred correctly (by InferFiltersFromConstrains) > > We may reproduce the issue with the following setup: > 1) Prepare two tables: > > {code:java} > spark.sql("CREATE TABLE IF NOT EXISTS table1(id string, val string) USING > PARQUET"); > spark.sql("CREATE TABLE IF NOT EXISTS table2(id string, val string) USING > PARQUET");{code} > > 2) Create a union view on table1. > {code:java} > spark.sql(""" > | CREATE VIEW partitioned_table_1 AS > | SELECT * FROM table1 WHERE id = 'a' > | UNION ALL > | SELECT * FROM table1 WHERE id = 'b' > | UNION ALL > | SELECT * FROM table1 WHERE id = 'c' > | UNION ALL > | SELECT * FROM table1 WHERE id NOT IN ('a','b','c') > | """.stripMargin){code} > > 3) View the optimized plan of this SQL. The filter '[t2.id = 'a']' cannot be > inferred. We can see that the constraints of the left table are empty. > {code:java} > scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id > [t1.id] = t2.id [t2.id] AND t1.id [t1.id] = 'a'").queryExecution.optimizedPlan > res39: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Join Inner, (id#0 = id#4) > :- Union > : :- Filter (isnotnull(id#0) && (id#0 = a)) > : : +- Relation[id#0,val#1] parquet > : :- LocalRelation , [id#0, val#1] > : :- LocalRelation , [id#0, val#1] > : +- Filter ((isnotnull(id#0) && NOT id#0 IN (a,b,c)) && (id#0 = a)) > : +- Relation[id#0,val#1] parquet > +- Filter isnotnull(id#4) > +- Relation[id#4,val#5] parquet > scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id > [t1.id] = t2.id [t2.id] AND t1.id [t1.id] = > 'a'").queryExecution.optimizedPlan.children(0).constraints > res40: org.apache.spark.sql.catalyst.expressions.ExpressionSet = Set() > > {code} > > 4) Modified the query to avoid empty local relation. The filter '[td.id in > ('a','b','c','d')' is then inferred properly. The constraints of the left > table are not empty as well. > {code:java} > scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id > [t1.id] = t2.id [t2.id] AND t1.id [t1.id] IN > ('a','b','c','d')").queryExecution.optimizedPlan > res42: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Join Inner, (id#0 = id#4) > :- Union > : :- Filter ((isnotnull(id#0) && (id#0 = a)) && id#0 IN (a,b,c,d)) > : : +- Relation[id#0,val#1] parquet > : :- Filter ((isnotnull(id#0) && (id#0 = b)) && id#0 IN (a,b,c,d)) > : : +- Relation[id#0,val#1] parquet > : :- Filter ((isnotnull(id#0) && (id#0 = c)) && id#0 IN (a,b,c,d)) > : : +- Relation[id#0,val#1] parquet > : +- Filter ((NOT id#0 IN (a,b,c) && id#0 IN (a,b,c,d)) && isnotnull(id#0)) > : +- Relation[id#0,val#1] parquet > +- Filter ((id#4 IN (a,b,c,d) && ((isnotnull(id#4) && (((id#4 = a) || (id#4 = > b)) || (id#4 = c))) || NOT id#4 IN (a,b,c))) && isnotnull(id#4)) > +- Relation[id#4,val#5] parquet > > scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id > [t1.id] = t2.id [t2.id] AND t1.id [t1.id] IN > ('a','b','c','d')").queryExecution.optimizedPlan.children(0).constraints > res44: org.apache.spark.sql.catalyst.expressions.ExpressionSet = > Set(isnotnull(id#0), id#0 IN (a,b,c,d), id#0 = a) || (id#0 = b)) || (id#0 > = c)) || NOT id#0 IN (a,b,c))) > {code} > > One of the possible workaround is create a rule to remove all empty local > relation from a union table. Or, when we convert a relation to into an empty > local relation, we should preserve those constraints in the empty local > relation as well. > > A side node. Expression in optimized plan is not well optimized. For example, > the expression > {code:java} > ((id#4 IN (a,b,c,d) && ((isnotnull(id#4) && (((id#4 = a) || (id#4 = b)) || > (id#4 = c))) || NOT id#4 IN (a,b,c))) && isnotnull(id#4)){code} > could be further optimized into > {code:java} > (isnotnull(id#4) && (id = d)){code} > We may implement another rule to > 1) convert all 'equal' operators into 'in' operator, and then group all > expressions by 'attribute reference' > 3) merge all those 'in' (or not in) operators > 4) revert in operator into
[jira] [Resolved] (SPARK-28096) Lazy val performance pitfall in Spark SQL LogicalPlans
[ https://issues.apache.org/jira/browse/SPARK-28096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-28096. - Resolution: Fixed Assignee: Yesheng Ma Fix Version/s: 3.0.0 > Lazy val performance pitfall in Spark SQL LogicalPlans > -- > > Key: SPARK-28096 > URL: https://issues.apache.org/jira/browse/SPARK-28096 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yesheng Ma >Assignee: Yesheng Ma >Priority: Major > Fix For: 3.0.0 > > > The original {{references}} and {{validConstraints}} implementations in a few > QueryPlan and Expression classes are methods, which means unnecessary > re-computation can happen at times. This PR resolves this problem by making > these method lazy vals. > We benchmarked this optimization on TPC-DS queries whose planning time is > longer than 1s. In the benchmark, we warmed up the queries 5 iterations and > then took the average of 10 runs. Results showed that this micro-optimization > can improve the end-to-end planning time by 25%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28103) Cannot infer filters from union table with empty local relation table properly
[ https://issues.apache.org/jira/browse/SPARK-28103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Wong updated SPARK-28103: - Description: Basically, the constraints of a union table could be turned empty if any subtable is turned into an empty local relation. The side effect is filter cannot be inferred correctly (by InferFiltersFromConstrains) We may reproduce the issue with the following setup: 1) Prepare two tables: {code:java} spark.sql("CREATE TABLE IF NOT EXISTS table1(id string, val string) USING PARQUET"); spark.sql("CREATE TABLE IF NOT EXISTS table2(id string, val string) USING PARQUET");{code} 2) Create a union view on table1. {code:java} spark.sql(""" | CREATE VIEW partitioned_table_1 AS | SELECT * FROM table1 WHERE id = 'a' | UNION ALL | SELECT * FROM table1 WHERE id = 'b' | UNION ALL | SELECT * FROM table1 WHERE id = 'c' | UNION ALL | SELECT * FROM table1 WHERE id NOT IN ('a','b','c') | """.stripMargin){code} 3) View the optimized plan of this SQL. The filter '[t2.id = 'a']' cannot be inferred. We can see that the constraints of the left table are empty. {code:java} scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id [t1.id] = t2.id [t2.id] AND t1.id [t1.id] = 'a'").queryExecution.optimizedPlan res39: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Join Inner, (id#0 = id#4) :- Union : :- Filter (isnotnull(id#0) && (id#0 = a)) : : +- Relation[id#0,val#1] parquet : :- LocalRelation , [id#0, val#1] : :- LocalRelation , [id#0, val#1] : +- Filter ((isnotnull(id#0) && NOT id#0 IN (a,b,c)) && (id#0 = a)) : +- Relation[id#0,val#1] parquet +- Filter isnotnull(id#4) +- Relation[id#4,val#5] parquet scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id [t1.id] = t2.id [t2.id] AND t1.id [t1.id] = 'a'").queryExecution.optimizedPlan.children(0).constraints res40: org.apache.spark.sql.catalyst.expressions.ExpressionSet = Set() {code} 4) Modified the query to avoid empty local relation. The filter '[td.id in ('a','b','c','d')' is then inferred properly. The constraints of the left table are not empty as well. {code:java} scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id [t1.id] = t2.id [t2.id] AND t1.id [t1.id] IN ('a','b','c','d')").queryExecution.optimizedPlan res42: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Join Inner, (id#0 = id#4) :- Union : :- Filter ((isnotnull(id#0) && (id#0 = a)) && id#0 IN (a,b,c,d)) : : +- Relation[id#0,val#1] parquet : :- Filter ((isnotnull(id#0) && (id#0 = b)) && id#0 IN (a,b,c,d)) : : +- Relation[id#0,val#1] parquet : :- Filter ((isnotnull(id#0) && (id#0 = c)) && id#0 IN (a,b,c,d)) : : +- Relation[id#0,val#1] parquet : +- Filter ((NOT id#0 IN (a,b,c) && id#0 IN (a,b,c,d)) && isnotnull(id#0)) : +- Relation[id#0,val#1] parquet +- Filter ((id#4 IN (a,b,c,d) && ((isnotnull(id#4) && (((id#4 = a) || (id#4 = b)) || (id#4 = c))) || NOT id#4 IN (a,b,c))) && isnotnull(id#4)) +- Relation[id#4,val#5] parquet scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id [t1.id] = t2.id [t2.id] AND t1.id [t1.id] IN ('a','b','c','d')").queryExecution.optimizedPlan.children(0).constraints res44: org.apache.spark.sql.catalyst.expressions.ExpressionSet = Set(isnotnull(id#0), id#0 IN (a,b,c,d), id#0 = a) || (id#0 = b)) || (id#0 = c)) || NOT id#0 IN (a,b,c))) {code} One of the possible workaround is create a rule to remove all empty local relation from a union table. Or, when we convert a relation to into an empty local relation, we should preserve those constraints in the empty local relation as well. A side node. Expression in optimized plan is not well optimized. For example, the expression {code:java} ((id#4 IN (a,b,c,d) && ((isnotnull(id#4) && (((id#4 = a) || (id#4 = b)) || (id#4 = c))) || NOT id#4 IN (a,b,c))) && isnotnull(id#4)){code} could be further optimized into {code:java} (isnotnull(id#4) && (id = d)){code} We may implement another rule to 1) convert all 'equal' operators into 'in' operator, and then group all expressions by 'attribute reference' 3) merge all those 'in' (or not in) operators 4) revert in operator into 'equal' if there is only one element in the set. was: Basically, the constraints of a union table could be turned empty if any subtable is turned into an empty local relation. The side effect is filter cannot be inferred correctly (by InferFiltersFromConstrains) We may reproduce the issue with the following setup: 1) Prepare two tables: {code:java} spark.sql("CREATE TABLE IF NOT EXISTS table1(id string, val string) USING PARQUET"); spark.sql("CREATE TABLE IF NOT EXISTS table2(id string, val string) USING PARQUET");{code} 2) Create a union view on table1. {code:java} spark.sql(""" | CREATE VIEW
[jira] [Created] (SPARK-28103) Cannot infer filters from union table with empty local relation table properly
William Wong created SPARK-28103: Summary: Cannot infer filters from union table with empty local relation table properly Key: SPARK-28103 URL: https://issues.apache.org/jira/browse/SPARK-28103 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.1, 2.3.2 Reporter: William Wong Basically, the constraints of a union table could be turned empty if any subtable is turned into an empty local relation. The side effect is filter cannot be inferred correctly (by InferFiltersFromConstrains) We may reproduce the issue with the following setup: 1) Prepare two tables: {code:java} spark.sql("CREATE TABLE IF NOT EXISTS table1(id string, val string) USING PARQUET"); spark.sql("CREATE TABLE IF NOT EXISTS table2(id string, val string) USING PARQUET");{code} 2) Create a union view on table1. {code:java} spark.sql(""" | CREATE VIEW partitioned_table_1 AS | SELECT * FROM table1 WHERE id = 'a' | UNION ALL | SELECT * FROM table1 WHERE id = 'b' | UNION ALL | SELECT * FROM table1 WHERE id = 'c' | UNION ALL | SELECT * FROM table1 WHERE id NOT IN ('a','b','c') | """.stripMargin){code} 3) View the optimized plan of this SQL. The filter '[t2.id [t2.id]|https://urldefense.proofpoint.com/v2/url?u=http-3A__t2.id=DwMFaQ=lxzXOFU02467FL7HOPRqCw=QLWkn-MIQZ6wM0VKRZSxipwIbmB7fKk9_zd1_axi-XQ=ezb9buJE3VsOytBu2oydJfvIfdTmVHPIGwaagdYSG98=L-aQUAtCG1PufnRe0Hy0adnmxqny1GitX8OJV9zq2oI=] = 'a'' cannot be inferred. We can see that the constraints of the left table are empty. {code:java} scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id [t1.id] = t2.id [t2.id] AND t1.id [t1.id] = 'a'").queryExecution.optimizedPlan res39: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Join Inner, (id#0 = id#4) :- Union : :- Filter (isnotnull(id#0) && (id#0 = a)) : : +- Relation[id#0,val#1] parquet : :- LocalRelation , [id#0, val#1] : :- LocalRelation , [id#0, val#1] : +- Filter ((isnotnull(id#0) && NOT id#0 IN (a,b,c)) && (id#0 = a)) : +- Relation[id#0,val#1] parquet +- Filter isnotnull(id#4) +- Relation[id#4,val#5] parquet scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id [t1.id] = t2.id [t2.id] AND t1.id [t1.id] = 'a'").queryExecution.optimizedPlan.children(0).constraints res40: org.apache.spark.sql.catalyst.expressions.ExpressionSet = Set() {code} 4) Modified the query to avoid empty local relation. The filter '[t2.id [t2.id]|https://urldefense.proofpoint.com/v2/url?u=http-3A__t2.id=DwMFaQ=lxzXOFU02467FL7HOPRqCw=QLWkn-MIQZ6wM0VKRZSxipwIbmB7fKk9_zd1_axi-XQ=ezb9buJE3VsOytBu2oydJfvIfdTmVHPIGwaagdYSG98=L-aQUAtCG1PufnRe0Hy0adnmxqny1GitX8OJV9zq2oI=] in ('a','b','c','d')' is then inferred properly. The constraints of the left table are not empty as well. {code:java} scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id [t1.id] = t2.id [t2.id] AND t1.id [t1.id] IN ('a','b','c','d')").queryExecution.optimizedPlan res42: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Join Inner, (id#0 = id#4) :- Union : :- Filter ((isnotnull(id#0) && (id#0 = a)) && id#0 IN (a,b,c,d)) : : +- Relation[id#0,val#1] parquet : :- Filter ((isnotnull(id#0) && (id#0 = b)) && id#0 IN (a,b,c,d)) : : +- Relation[id#0,val#1] parquet : :- Filter ((isnotnull(id#0) && (id#0 = c)) && id#0 IN (a,b,c,d)) : : +- Relation[id#0,val#1] parquet : +- Filter ((NOT id#0 IN (a,b,c) && id#0 IN (a,b,c,d)) && isnotnull(id#0)) : +- Relation[id#0,val#1] parquet +- Filter ((id#4 IN (a,b,c,d) && ((isnotnull(id#4) && (((id#4 = a) || (id#4 = b)) || (id#4 = c))) || NOT id#4 IN (a,b,c))) && isnotnull(id#4)) +- Relation[id#4,val#5] parquet scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id [t1.id] = t2.id [t2.id] AND t1.id [t1.id] IN ('a','b','c','d')").queryExecution.optimizedPlan.children(0).constraints res44: org.apache.spark.sql.catalyst.expressions.ExpressionSet = Set(isnotnull(id#0), id#0 IN (a,b,c,d), id#0 = a) || (id#0 = b)) || (id#0 = c)) || NOT id#0 IN (a,b,c))) {code} One of the possible workaround is create a rule to remove all empty local relation from a union table. Or, when we convert a relation to into an empty local relation, we should preserve those constraints in the empty local relation as well. A side node. Expression in optimized plan is not well optimized. For example, the expression {code:java} ((id#4 IN (a,b,c,d) && ((isnotnull(id#4) && (((id#4 = a) || (id#4 = b)) || (id#4 = c))) || NOT id#4 IN (a,b,c))) && isnotnull(id#4)){code} could be further optimized into {code:java} (isnotnull(id#4) && (id = d)){code} We may implement another rule to 1) convert all 'equal' operators into 'in' operator, and then group all expressions by 'attribute reference' 3) merge all those 'in' (or
[jira] [Resolved] (SPARK-27105) Prevent exponential complexity in ORC `createFilter`
[ https://issues.apache.org/jira/browse/SPARK-27105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27105. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24068 [https://github.com/apache/spark/pull/24068] > Prevent exponential complexity in ORC `createFilter` > -- > > Key: SPARK-27105 > URL: https://issues.apache.org/jira/browse/SPARK-27105 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Ivan Vergiliev >Assignee: Ivan Vergiliev >Priority: Major > Labels: performance > Fix For: 3.0.0 > > > `OrcFilters.createFilters` currently has complexity that's exponential in the > height of the filter tree. There are multiple places in Spark that try to > prevent the generation of skewed trees so as to not trigger this behaviour, > for example: > - `org.apache.spark.sql.catalyst.parser.AstBuilder.visitLogicalBinary` > combines a number of binary logical expressions into a balanced tree. > - https://github.com/apache/spark/pull/22313 introduced a change to > `OrcFilters` to create a balanced tree instead of a skewed tree. > However, the underlying exponential behaviour can still be triggered by code > paths that don't go through any of the tree balancing methods. For example, > if one generates a tree of `Column`s directly in user code, there's nothing > in Spark that automatically balances that tree and, hence, skewed trees hit > the exponential behaviour. We have hit this in production with jobs > mysteriously taking hours on the Spark driver with no worker activity, with > as few as ~30 OR filters. > I have a fix locally that makes the underlying logic have linear complexity > instead of exponential complexity. With this fix, the code can handle > thousands of filters in milliseconds. I'll send a PR with the fix soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27105) Prevent exponential complexity in ORC `createFilter`
[ https://issues.apache.org/jira/browse/SPARK-27105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27105: --- Assignee: Ivan Vergiliev > Prevent exponential complexity in ORC `createFilter` > -- > > Key: SPARK-27105 > URL: https://issues.apache.org/jira/browse/SPARK-27105 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Ivan Vergiliev >Assignee: Ivan Vergiliev >Priority: Major > Labels: performance > > `OrcFilters.createFilters` currently has complexity that's exponential in the > height of the filter tree. There are multiple places in Spark that try to > prevent the generation of skewed trees so as to not trigger this behaviour, > for example: > - `org.apache.spark.sql.catalyst.parser.AstBuilder.visitLogicalBinary` > combines a number of binary logical expressions into a balanced tree. > - https://github.com/apache/spark/pull/22313 introduced a change to > `OrcFilters` to create a balanced tree instead of a skewed tree. > However, the underlying exponential behaviour can still be triggered by code > paths that don't go through any of the tree balancing methods. For example, > if one generates a tree of `Column`s directly in user code, there's nothing > in Spark that automatically balances that tree and, hence, skewed trees hit > the exponential behaviour. We have hit this in production with jobs > mysteriously taking hours on the Spark driver with no worker activity, with > as few as ~30 OR filters. > I have a fix locally that makes the underlying logic have linear complexity > instead of exponential complexity. With this fix, the code can handle > thousands of filters in milliseconds. I'll send a PR with the fix soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28102) Add configuration for selecting LZ4 implementation (safe, unsafe, JNI)
[ https://issues.apache.org/jira/browse/SPARK-28102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28102: Assignee: Josh Rosen (was: Apache Spark) > Add configuration for selecting LZ4 implementation (safe, unsafe, JNI) > -- > > Key: SPARK-28102 > URL: https://issues.apache.org/jira/browse/SPARK-28102 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > > Spark's use of {{lz4-java}} ends up calling {{Lz4Factory.fastestInstance}}, > which attempts to load JNI libraries and falls back on Java implementations > in case the JNI library cannot be loaded or initialized. > I run Spark in a configuration where the JNI libraries don't work, so I'd > like to configure LZ4 to not even attempt to use JNI code: if the JNI library > loads but cannot be initialized then the fallback code path involves catching > an exception and this is slow because the exception is thrown under a static > initializer lock (leading to significant lock contention because the filling > of stacktraces is done while holding this lock). > I propose to introduce a {{spark.io.compression.lz4.factory}} configuration > for selecting the LZ4 implementation, allowing users to disable the use of > the JNI library without having to recompile Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28102) Add configuration for selecting LZ4 implementation (safe, unsafe, JNI)
[ https://issues.apache.org/jira/browse/SPARK-28102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28102: Assignee: Apache Spark (was: Josh Rosen) > Add configuration for selecting LZ4 implementation (safe, unsafe, JNI) > -- > > Key: SPARK-28102 > URL: https://issues.apache.org/jira/browse/SPARK-28102 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Assignee: Apache Spark >Priority: Major > > Spark's use of {{lz4-java}} ends up calling {{Lz4Factory.fastestInstance}}, > which attempts to load JNI libraries and falls back on Java implementations > in case the JNI library cannot be loaded or initialized. > I run Spark in a configuration where the JNI libraries don't work, so I'd > like to configure LZ4 to not even attempt to use JNI code: if the JNI library > loads but cannot be initialized then the fallback code path involves catching > an exception and this is slow because the exception is thrown under a static > initializer lock (leading to significant lock contention because the filling > of stacktraces is done while holding this lock). > I propose to introduce a {{spark.io.compression.lz4.factory}} configuration > for selecting the LZ4 implementation, allowing users to disable the use of > the JNI library without having to recompile Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28102) Add configuration for selecting LZ4 implementation (safe, unsafe, JNI)
Josh Rosen created SPARK-28102: -- Summary: Add configuration for selecting LZ4 implementation (safe, unsafe, JNI) Key: SPARK-28102 URL: https://issues.apache.org/jira/browse/SPARK-28102 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Josh Rosen Assignee: Josh Rosen Spark's use of {{lz4-java}} ends up calling {{Lz4Factory.fastestInstance}}, which attempts to load JNI libraries and falls back on Java implementations in case the JNI library cannot be loaded or initialized. I run Spark in a configuration where the JNI libraries don't work, so I'd like to configure LZ4 to not even attempt to use JNI code: if the JNI library loads but cannot be initialized then the fallback code path involves catching an exception and this is slow because the exception is thrown under a static initializer lock (leading to significant lock contention because the filling of stacktraces is done while holding this lock). I propose to introduce a {{spark.io.compression.lz4.factory}} configuration for selecting the LZ4 implementation, allowing users to disable the use of the JNI library without having to recompile Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28075) String Functions: Enhance TRIM function
[ https://issues.apache.org/jira/browse/SPARK-28075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-28075: --- Assignee: Yuming Wang > String Functions: Enhance TRIM function > --- > > Key: SPARK-28075 > URL: https://issues.apache.org/jira/browse/SPARK-28075 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > Add support {{TRIM(BOTH/LEADING/TRAILING FROM str)}} format. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28075) String Functions: Enhance TRIM function
[ https://issues.apache.org/jira/browse/SPARK-28075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-28075. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24891 [https://github.com/apache/spark/pull/24891] > String Functions: Enhance TRIM function > --- > > Key: SPARK-28075 > URL: https://issues.apache.org/jira/browse/SPARK-28075 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > Add support {{TRIM(BOTH/LEADING/TRAILING FROM str)}} format. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27716) Complete the transactions support for part of jdbc datasource operations.
[ https://issues.apache.org/jira/browse/SPARK-27716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27716. --- Resolution: Won't Fix > Complete the transactions support for part of jdbc datasource operations. > - > > Key: SPARK-27716 > URL: https://issues.apache.org/jira/browse/SPARK-27716 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: feiwang >Priority: Major > Labels: pull-request-available > > With the jdbc datasource, we can save a rdd to the database. > The comments for the function saveTable is that. > {code:java} > /** >* Saves the RDD to the database in a single transaction. >*/ > def saveTable( > df: DataFrame, > tableSchema: Option[StructType], > isCaseSensitive: Boolean, > options: JdbcOptionsInWrite) > {code} > In fact, it is not true. > The savePartition operation is in a single transaction but the saveTable > operation is not in a single transaction. > There are several cases of data transmission: > case1: Append data to origin existed gptable. > case2: Overwrite origin gptable, but the table is a cascadingTruncateTable, > so we can not drop the gptable, we have to truncate it and append data. > case3: Overwrite origin existed table and the table is not a > cascadingTruncateTable, so we can drop it first. > case4: For an unexisted table, create and transmit data. > In this PR, I add a transactions support for case3 and case4. > For case3 and case4, we can transmit the rdd to a temp table at first. > We use an accumulator to record the suceessful savePartition operations. > At last, we compare the value of accumulator with dataFrame's partitionNum. > If all the savePartition operations are successful, we drop the origin table > if it exists, then we alter the temp table rename to origin table. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28085) Spark Scala API documentation URLs not working properly in Chrome
[ https://issues.apache.org/jira/browse/SPARK-28085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Leverentz updated SPARK-28085: - Description: In Chrome version 75, URLs in the Scala API documentation are not working properly, which makes them difficult to bookmark. For example, URLs like the following get redirected to a generic "root" package page: [https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html] [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset] Here's the URL that I get redirected to: [https://spark.apache.org/docs/latest/api/scala/index.html#package] This issue seems to have appeared between versions 74 and 75 of Chrome, but the documentation URLs still work in Safari. I suspect that this has something to do with security-related changes to how Chrome 75 handles frames and/or redirects. I've reported this issue to the Chrome team via the in-browser help menu, but I don't have any visibility into their response, so it's not clear whether they'll consider this a bug or "working as intended". was: In Chrome version 75, URLs in the Scala API documentation are not working properly, which makes them difficult to bookmark. For example, URLs like the following get redirected to a generic "root" package page: [https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html] [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset] Here's the URL that I get : [https://spark.apache.org/docs/latest/api/scala/index.html#package] This issue seems to have appeared between versions 74 and 75 of Chrome, but the documentation URLs still work in Safari. I suspect that this has something to do with security-related changes to how Chrome 75 handles frames and/or redirects. I've reported this issue to the Chrome team via the in-browser help menu, but I don't have any visibility into their response, so it's not clear whether they'll consider this a bug or "working as intended". > Spark Scala API documentation URLs not working properly in Chrome > - > > Key: SPARK-28085 > URL: https://issues.apache.org/jira/browse/SPARK-28085 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.4.3 >Reporter: Andrew Leverentz >Priority: Minor > > In Chrome version 75, URLs in the Scala API documentation are not working > properly, which makes them difficult to bookmark. > For example, URLs like the following get redirected to a generic "root" > package page: > [https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html] > [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset] > Here's the URL that I get redirected to: > [https://spark.apache.org/docs/latest/api/scala/index.html#package] > This issue seems to have appeared between versions 74 and 75 of Chrome, but > the documentation URLs still work in Safari. I suspect that this has > something to do with security-related changes to how Chrome 75 handles frames > and/or redirects. I've reported this issue to the Chrome team via the > in-browser help menu, but I don't have any visibility into their response, so > it's not clear whether they'll consider this a bug or "working as intended". -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28081) word2vec 'large' count value too low for very large corpora
[ https://issues.apache.org/jira/browse/SPARK-28081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-28081. --- Resolution: Fixed Fix Version/s: 2.4.4 2.3.4 3.0.0 Issue resolved by pull request 24893 [https://github.com/apache/spark/pull/24893] > word2vec 'large' count value too low for very large corpora > --- > > Key: SPARK-28081 > URL: https://issues.apache.org/jira/browse/SPARK-28081 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.3 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Fix For: 3.0.0, 2.3.4, 2.4.4 > > > The word2vec implementation operates on word counts, and uses a hard-coded > value of 1e9 to mean "a very large count, larger than any actual count". > However this causes the logic to fail if, in fact, a large corpora has some > words that really do occur more than this many times. We can probably improve > the implementation to better handle very large counts in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27823) Add an abstraction layer for accelerator resource handling to avoid manipulating raw confs
[ https://issues.apache.org/jira/browse/SPARK-27823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xingbo Jiang resolved SPARK-27823. -- Resolution: Fixed > Add an abstraction layer for accelerator resource handling to avoid > manipulating raw confs > -- > > Key: SPARK-27823 > URL: https://issues.apache.org/jira/browse/SPARK-27823 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > > In SPARK-27488, we extract resource requests and allocation by parsing raw > Spark confs. This hurts readability because we didn't have the abstraction at > resource level. After we merge the core changes, we should do a refactoring > and make the code more readable. > See https://github.com/apache/spark/pull/24615#issuecomment-494580663. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28101) Fix Flaky Test: `InputStreamsSuite.Modified files are correctly detected` in JDK9+
[ https://issues.apache.org/jira/browse/SPARK-28101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28101: -- Summary: Fix Flaky Test: `InputStreamsSuite.Modified files are correctly detected` in JDK9+ (was: Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in JDK9+) > Fix Flaky Test: `InputStreamsSuite.Modified files are correctly detected` in > JDK9+ > -- > > Key: SPARK-28101 > URL: https://issues.apache.org/jira/browse/SPARK-28101 > Project: Spark > Issue Type: Sub-task > Components: DStreams, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > Attachments: error.png > > > !error.png|width=100%! > {code} > $ build/sbt "streaming/testOnly *.InputStreamsSuite" > [info] - Modified files are correctly detected. *** FAILED *** (134 > milliseconds) > [info] Set("renamed") did not equal Set() (InputStreamsSuite.scala:312) > [info] org.scalatest.exceptions.TestFailedException: > {code} > {code} > Getting new files for time 1560896662000, ignoring files older than > 1560896659679 > file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt > not selected as mod time 1560896662679 > current time 1560896662000 > file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing > ignored as mod time 1560896657679 <= ignore time 1560896659679 > Finding new files took 0 ms > New files at time 1560896662000 ms: > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28101) Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in JDK9+
[ https://issues.apache.org/jira/browse/SPARK-28101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28101: Assignee: (was: Apache Spark) > Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in > JDK9+ > - > > Key: SPARK-28101 > URL: https://issues.apache.org/jira/browse/SPARK-28101 > Project: Spark > Issue Type: Sub-task > Components: DStreams, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > Attachments: error.png > > > !error.png|width=100%! > {code} > $ build/sbt "streaming/testOnly *.InputStreamsSuite" > [info] - Modified files are correctly detected. *** FAILED *** (134 > milliseconds) > [info] Set("renamed") did not equal Set() (InputStreamsSuite.scala:312) > [info] org.scalatest.exceptions.TestFailedException: > {code} > {code} > Getting new files for time 1560896662000, ignoring files older than > 1560896659679 > file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt > not selected as mod time 1560896662679 > current time 1560896662000 > file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing > ignored as mod time 1560896657679 <= ignore time 1560896659679 > Finding new files took 0 ms > New files at time 1560896662000 ms: > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28101) Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in JDK9+
[ https://issues.apache.org/jira/browse/SPARK-28101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28101: Assignee: Apache Spark > Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in > JDK9+ > - > > Key: SPARK-28101 > URL: https://issues.apache.org/jira/browse/SPARK-28101 > Project: Spark > Issue Type: Sub-task > Components: DStreams, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > Attachments: error.png > > > !error.png|width=100%! > {code} > $ build/sbt "streaming/testOnly *.InputStreamsSuite" > [info] - Modified files are correctly detected. *** FAILED *** (134 > milliseconds) > [info] Set("renamed") did not equal Set() (InputStreamsSuite.scala:312) > [info] org.scalatest.exceptions.TestFailedException: > {code} > {code} > Getting new files for time 1560896662000, ignoring files older than > 1560896659679 > file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt > not selected as mod time 1560896662679 > current time 1560896662000 > file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing > ignored as mod time 1560896657679 <= ignore time 1560896659679 > Finding new files took 0 ms > New files at time 1560896662000 ms: > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27815) do not leak SaveMode to file source v2
[ https://issues.apache.org/jira/browse/SPARK-27815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867136#comment-16867136 ] Tony Zhang commented on SPARK-27815: [~cloud_fan] Hi Wenchen, do you think it is viable solution mentioned below? Create a new V2WriteCommand case class and its Exec named maybe _OverwriteByQueryId_ to replace WriteToDataSourceV2, which accepts a QueryId so that tests can pass. Or should we keep WriteToDataSourceV2? > do not leak SaveMode to file source v2 > -- > > Key: SPARK-27815 > URL: https://issues.apache.org/jira/browse/SPARK-27815 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Blocker > > Currently there is a hack in `DataFrameWriter`, which passes `SaveMode` to > file source v2. This should be removed and file source v2 should not accept > SaveMode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28101) Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in JDK9+
[ https://issues.apache.org/jira/browse/SPARK-28101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28101: -- Description: !error.png|height=250,width=250! {code} $ build/sbt "streaming/testOnly *.InputStreamsSuite" [info] - Modified files are correctly detected. *** FAILED *** (134 milliseconds) [info] Set("renamed") did not equal Set() (InputStreamsSuite.scala:312) [info] org.scalatest.exceptions.TestFailedException: {code} {code} Getting new files for time 1560896662000, ignoring files older than 1560896659679 file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt not selected as mod time 1560896662679 > current time 1560896662000 file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing ignored as mod time 1560896657679 <= ignore time 1560896659679 Finding new files took 0 ms New files at time 1560896662000 ms: {code} was: !error.png! {code} $ build/sbt "streaming/testOnly *.InputStreamsSuite" [info] - Modified files are correctly detected. *** FAILED *** (134 milliseconds) [info] Set("renamed") did not equal Set() (InputStreamsSuite.scala:312) [info] org.scalatest.exceptions.TestFailedException: {code} {code} Getting new files for time 1560896662000, ignoring files older than 1560896659679 file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt not selected as mod time 1560896662679 > current time 1560896662000 file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing ignored as mod time 1560896657679 <= ignore time 1560896659679 Finding new files took 0 ms New files at time 1560896662000 ms: {code} > Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in > JDK9+ > - > > Key: SPARK-28101 > URL: https://issues.apache.org/jira/browse/SPARK-28101 > Project: Spark > Issue Type: Sub-task > Components: DStreams, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > Attachments: error.png > > > !error.png|height=250,width=250! > {code} > $ build/sbt "streaming/testOnly *.InputStreamsSuite" > [info] - Modified files are correctly detected. *** FAILED *** (134 > milliseconds) > [info] Set("renamed") did not equal Set() (InputStreamsSuite.scala:312) > [info] org.scalatest.exceptions.TestFailedException: > {code} > {code} > Getting new files for time 1560896662000, ignoring files older than > 1560896659679 > file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt > not selected as mod time 1560896662679 > current time 1560896662000 > file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing > ignored as mod time 1560896657679 <= ignore time 1560896659679 > Finding new files took 0 ms > New files at time 1560896662000 ms: > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28101) Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in JDK9+
[ https://issues.apache.org/jira/browse/SPARK-28101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28101: -- Description: !error.png|width=100%! {code} $ build/sbt "streaming/testOnly *.InputStreamsSuite" [info] - Modified files are correctly detected. *** FAILED *** (134 milliseconds) [info] Set("renamed") did not equal Set() (InputStreamsSuite.scala:312) [info] org.scalatest.exceptions.TestFailedException: {code} {code} Getting new files for time 1560896662000, ignoring files older than 1560896659679 file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt not selected as mod time 1560896662679 > current time 1560896662000 file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing ignored as mod time 1560896657679 <= ignore time 1560896659679 Finding new files took 0 ms New files at time 1560896662000 ms: {code} was: !error.png|height=250,width=250! {code} $ build/sbt "streaming/testOnly *.InputStreamsSuite" [info] - Modified files are correctly detected. *** FAILED *** (134 milliseconds) [info] Set("renamed") did not equal Set() (InputStreamsSuite.scala:312) [info] org.scalatest.exceptions.TestFailedException: {code} {code} Getting new files for time 1560896662000, ignoring files older than 1560896659679 file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt not selected as mod time 1560896662679 > current time 1560896662000 file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing ignored as mod time 1560896657679 <= ignore time 1560896659679 Finding new files took 0 ms New files at time 1560896662000 ms: {code} > Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in > JDK9+ > - > > Key: SPARK-28101 > URL: https://issues.apache.org/jira/browse/SPARK-28101 > Project: Spark > Issue Type: Sub-task > Components: DStreams, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > Attachments: error.png > > > !error.png|width=100%! > {code} > $ build/sbt "streaming/testOnly *.InputStreamsSuite" > [info] - Modified files are correctly detected. *** FAILED *** (134 > milliseconds) > [info] Set("renamed") did not equal Set() (InputStreamsSuite.scala:312) > [info] org.scalatest.exceptions.TestFailedException: > {code} > {code} > Getting new files for time 1560896662000, ignoring files older than > 1560896659679 > file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt > not selected as mod time 1560896662679 > current time 1560896662000 > file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing > ignored as mod time 1560896657679 <= ignore time 1560896659679 > Finding new files took 0 ms > New files at time 1560896662000 ms: > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28101) Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in JDK9+
[ https://issues.apache.org/jira/browse/SPARK-28101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28101: -- Attachment: error.png > Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in > JDK9+ > - > > Key: SPARK-28101 > URL: https://issues.apache.org/jira/browse/SPARK-28101 > Project: Spark > Issue Type: Sub-task > Components: DStreams, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > Attachments: error.png > > > {code} > $ build/sbt "streaming/testOnly *.InputStreamsSuite" > [info] - Modified files are correctly detected. *** FAILED *** (134 > milliseconds) > [info] Set("renamed") did not equal Set() (InputStreamsSuite.scala:312) > [info] org.scalatest.exceptions.TestFailedException: > {code} > {code} > Getting new files for time 1560896662000, ignoring files older than > 1560896659679 > file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt > not selected as mod time 1560896662679 > current time 1560896662000 > file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing > ignored as mod time 1560896657679 <= ignore time 1560896659679 > Finding new files took 0 ms > New files at time 1560896662000 ms: > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28101) Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in JDK9+
[ https://issues.apache.org/jira/browse/SPARK-28101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28101: -- Description: !error.png! {code} $ build/sbt "streaming/testOnly *.InputStreamsSuite" [info] - Modified files are correctly detected. *** FAILED *** (134 milliseconds) [info] Set("renamed") did not equal Set() (InputStreamsSuite.scala:312) [info] org.scalatest.exceptions.TestFailedException: {code} {code} Getting new files for time 1560896662000, ignoring files older than 1560896659679 file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt not selected as mod time 1560896662679 > current time 1560896662000 file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing ignored as mod time 1560896657679 <= ignore time 1560896659679 Finding new files took 0 ms New files at time 1560896662000 ms: {code} was: {code} $ build/sbt "streaming/testOnly *.InputStreamsSuite" [info] - Modified files are correctly detected. *** FAILED *** (134 milliseconds) [info] Set("renamed") did not equal Set() (InputStreamsSuite.scala:312) [info] org.scalatest.exceptions.TestFailedException: {code} {code} Getting new files for time 1560896662000, ignoring files older than 1560896659679 file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt not selected as mod time 1560896662679 > current time 1560896662000 file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing ignored as mod time 1560896657679 <= ignore time 1560896659679 Finding new files took 0 ms New files at time 1560896662000 ms: {code} > Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in > JDK9+ > - > > Key: SPARK-28101 > URL: https://issues.apache.org/jira/browse/SPARK-28101 > Project: Spark > Issue Type: Sub-task > Components: DStreams, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > Attachments: error.png > > > !error.png! > {code} > $ build/sbt "streaming/testOnly *.InputStreamsSuite" > [info] - Modified files are correctly detected. *** FAILED *** (134 > milliseconds) > [info] Set("renamed") did not equal Set() (InputStreamsSuite.scala:312) > [info] org.scalatest.exceptions.TestFailedException: > {code} > {code} > Getting new files for time 1560896662000, ignoring files older than > 1560896659679 > file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt > not selected as mod time 1560896662679 > current time 1560896662000 > file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing > ignored as mod time 1560896657679 <= ignore time 1560896659679 > Finding new files took 0 ms > New files at time 1560896662000 ms: > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28101) Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in JDK9+
[ https://issues.apache.org/jira/browse/SPARK-28101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28101: -- Summary: Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in JDK9+ (was: Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected`) > Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in > JDK9+ > - > > Key: SPARK-28101 > URL: https://issues.apache.org/jira/browse/SPARK-28101 > Project: Spark > Issue Type: Sub-task > Components: DStreams, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > {code} > $ build/sbt "streaming/testOnly *.InputStreamsSuite" > [info] - Modified files are correctly detected. *** FAILED *** (134 > milliseconds) > [info] Set("renamed") did not equal Set() (InputStreamsSuite.scala:312) > [info] org.scalatest.exceptions.TestFailedException: > {code} > {code} > Getting new files for time 1560896662000, ignoring files older than > 1560896659679 > file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt > not selected as mod time 1560896662679 > current time 1560896662000 > file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing > ignored as mod time 1560896657679 <= ignore time 1560896659679 > Finding new files took 0 ms > New files at time 1560896662000 ms: > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28101) Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in JDK9+
[ https://issues.apache.org/jira/browse/SPARK-28101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28101: -- Attachment: (was: Screen Shot 2019-06-18 at 4.41.11 PM.png) > Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in > JDK9+ > - > > Key: SPARK-28101 > URL: https://issues.apache.org/jira/browse/SPARK-28101 > Project: Spark > Issue Type: Sub-task > Components: DStreams, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > {code} > $ build/sbt "streaming/testOnly *.InputStreamsSuite" > [info] - Modified files are correctly detected. *** FAILED *** (134 > milliseconds) > [info] Set("renamed") did not equal Set() (InputStreamsSuite.scala:312) > [info] org.scalatest.exceptions.TestFailedException: > {code} > {code} > Getting new files for time 1560896662000, ignoring files older than > 1560896659679 > file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt > not selected as mod time 1560896662679 > current time 1560896662000 > file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing > ignored as mod time 1560896657679 <= ignore time 1560896659679 > Finding new files took 0 ms > New files at time 1560896662000 ms: > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28101) Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected`
Dongjoon Hyun created SPARK-28101: - Summary: Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` Key: SPARK-28101 URL: https://issues.apache.org/jira/browse/SPARK-28101 Project: Spark Issue Type: Sub-task Components: DStreams, Tests Affects Versions: 3.0.0 Reporter: Dongjoon Hyun {code} $ build/sbt "streaming/testOnly *.InputStreamsSuite" [info] - Modified files are correctly detected. *** FAILED *** (134 milliseconds) [info] Set("renamed") did not equal Set() (InputStreamsSuite.scala:312) [info] org.scalatest.exceptions.TestFailedException: {code} {code} Getting new files for time 1560896662000, ignoring files older than 1560896659679 file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt not selected as mod time 1560896662679 > current time 1560896662000 file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing ignored as mod time 1560896657679 <= ignore time 1560896659679 Finding new files took 0 ms New files at time 1560896662000 ms: {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28101) Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in JDK9+
[ https://issues.apache.org/jira/browse/SPARK-28101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28101: -- Attachment: Screen Shot 2019-06-18 at 4.41.11 PM.png > Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in > JDK9+ > - > > Key: SPARK-28101 > URL: https://issues.apache.org/jira/browse/SPARK-28101 > Project: Spark > Issue Type: Sub-task > Components: DStreams, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > {code} > $ build/sbt "streaming/testOnly *.InputStreamsSuite" > [info] - Modified files are correctly detected. *** FAILED *** (134 > milliseconds) > [info] Set("renamed") did not equal Set() (InputStreamsSuite.scala:312) > [info] org.scalatest.exceptions.TestFailedException: > {code} > {code} > Getting new files for time 1560896662000, ignoring files older than > 1560896659679 > file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt > not selected as mod time 1560896662679 > current time 1560896662000 > file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing > ignored as mod time 1560896657679 <= ignore time 1560896659679 > Finding new files took 0 ms > New files at time 1560896662000 ms: > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28039) Add float4.sql
[ https://issues.apache.org/jira/browse/SPARK-28039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-28039. - Resolution: Fixed Assignee: Yuming Wang Fix Version/s: 3.0.0 > Add float4.sql > -- > > Key: SPARK-28039 > URL: https://issues.apache.org/jira/browse/SPARK-28039 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > In this ticket, we plan to add the regression test cases of > https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/float4.sql. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28088) String Functions: Enhance LPAD/RPAD function
[ https://issues.apache.org/jira/browse/SPARK-28088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-28088: - Assignee: Yuming Wang > String Functions: Enhance LPAD/RPAD function > > > Key: SPARK-28088 > URL: https://issues.apache.org/jira/browse/SPARK-28088 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > Enhance LPAD/RPAD function to make {{pad}} parameter optional. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28088) String Functions: Enhance LPAD/RPAD function
[ https://issues.apache.org/jira/browse/SPARK-28088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28088. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24899 [https://github.com/apache/spark/pull/24899] > String Functions: Enhance LPAD/RPAD function > > > Key: SPARK-28088 > URL: https://issues.apache.org/jira/browse/SPARK-28088 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > Enhance LPAD/RPAD function to make {{pad}} parameter optional. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24936) Better error message when trying a shuffle fetch over 2 GB
[ https://issues.apache.org/jira/browse/SPARK-24936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Maynard updated SPARK-24936: - Description: *strong text*After SPARK-24297, spark will try to fetch shuffle blocks to disk if their over 2GB. However, this will fail with an external shuffle service running < spark 2.2, with an unhelpful error message like: {noformat} 18/07/26 07:15:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.3 (TID 15, xyz.com, executor 2): FetchFailed(BlockManagerId(1 , xyz.com, 7337, None), shuffleId=0, mapId=1, reduceId=1, message= org.apache.spark.shuffle.FetchFailedException: java.lang.UnsupportedOperationException at org.apache.spark.network.server.StreamManager.openStream(StreamManager.java:60) at org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136) ... {noformat} We can't do anything to make the shuffle succeed, in this situation, but we should fail with a better error message. was: After SPARK-24297, spark will try to fetch shuffle blocks to disk if their over 2GB. However, this will fail with an external shuffle service running < spark 2.2, with an unhelpful error message like: {noformat} 18/07/26 07:15:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.3 (TID 15, xyz.com, executor 2): FetchFailed(BlockManagerId(1 , xyz.com, 7337, None), shuffleId=0, mapId=1, reduceId=1, message= org.apache.spark.shuffle.FetchFailedException: java.lang.UnsupportedOperationException at org.apache.spark.network.server.StreamManager.openStream(StreamManager.java:60) at org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136) ... {noformat} We can't do anything to make the shuffle succeed, in this situation, but we should fail with a better error message. > Better error message when trying a shuffle fetch over 2 GB > -- > > Key: SPARK-24936 > URL: https://issues.apache.org/jira/browse/SPARK-24936 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > > *strong text*After SPARK-24297, spark will try to fetch shuffle blocks to > disk if their over 2GB. However, this will fail with an external shuffle > service running < spark 2.2, with an unhelpful error message like: > {noformat} > 18/07/26 07:15:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.3 > (TID 15, xyz.com, executor 2): FetchFailed(BlockManagerId(1 > , xyz.com, 7337, None), shuffleId=0, mapId=1, reduceId=1, message= > org.apache.spark.shuffle.FetchFailedException: > java.lang.UnsupportedOperationException > at > org.apache.spark.network.server.StreamManager.openStream(StreamManager.java:60) > at > org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136) > ... > {noformat} > We can't do anything to make the shuffle succeed, in this situation, but we > should fail with a better error message. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866973#comment-16866973 ] M. Le Bihan edited comment on SPARK-18105 at 6/18/19 8:45 PM: -- My trick eventually didn't succeed. And I fall back into the bug again. I've apttemted to upgrade from spark-xxx_2.11 to spark_xxx.2.12 for scala but received this kind of stacktrace : {code:log} 2019-06-18 20:43:54.747 INFO 1539 --- [er for task 547] o.a.s.s.ShuffleBlockFetcherIterator : Started 0 remote fetches in 0 ms 2019-06-18 20:43:59.015 ERROR 1539 --- [er for task 547] org.apache.spark.executor.Executor : Exception in task 93.0 in stage 4.2 (TID 547) java.lang.NullPointerException: null at org.apache.spark.rdd.PairRDDFunctions.$anonfun$mapValues$3(PairRDDFunctions.scala:757) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ~[scala-library-2.12.8.jar!/:na] at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.scheduler.Task.run(Task.scala:121) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_212] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_212] {code} This issue 18105 prevent from using spark at all. Spark with 300 opened and in progress (but stalling) issues is become less and less reliable each day. I'm about to send a message on dev forum to ask if developers can stop implementing new features until they have corrected the issues on the features they once written. was (Author: mlebihan): My trick eventually didn't succeed. And I fall back into the bug again. I've apttemted to upgrade from spark-xxx_2.11 to spark_xxx.2.12 for scala but received this kind of stacktrace : {code:log} 2019-06-18 20:43:54.747 INFO 1539 --- [er for task 547] o.a.s.s.ShuffleBlockFetcherIterator : Started 0 remote fetches in 0 ms 2019-06-18 20:43:59.015 ERROR 1539 --- [er for task 547] org.apache.spark.executor.Executor : Exception in task 93.0 in stage 4.2 (TID 547) java.lang.NullPointerException: null at org.apache.spark.rdd.PairRDDFunctions.$anonfun$mapValues$3(PairRDDFunctions.scala:757) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ~[scala-library-2.12.8.jar!/:na] at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.scheduler.Task.run(Task.scala:121)
[jira] [Updated] (SPARK-24936) Better error message when trying a shuffle fetch over 2 GB
[ https://issues.apache.org/jira/browse/SPARK-24936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Maynard updated SPARK-24936: - Description: After SPARK-24297, spark will try to fetch shuffle blocks to disk if their over 2GB. However, this will fail with an external shuffle service running < spark 2.2, with an unhelpful error message like: {noformat} 18/07/26 07:15:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.3 (TID 15, xyz.com, executor 2): FetchFailed(BlockManagerId(1 , xyz.com, 7337, None), shuffleId=0, mapId=1, reduceId=1, message= org.apache.spark.shuffle.FetchFailedException: java.lang.UnsupportedOperationException at org.apache.spark.network.server.StreamManager.openStream(StreamManager.java:60) at org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136) ... {noformat} We can't do anything to make the shuffle succeed, in this situation, but we should fail with a better error message. was: *strong text*After SPARK-24297, spark will try to fetch shuffle blocks to disk if their over 2GB. However, this will fail with an external shuffle service running < spark 2.2, with an unhelpful error message like: {noformat} 18/07/26 07:15:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.3 (TID 15, xyz.com, executor 2): FetchFailed(BlockManagerId(1 , xyz.com, 7337, None), shuffleId=0, mapId=1, reduceId=1, message= org.apache.spark.shuffle.FetchFailedException: java.lang.UnsupportedOperationException at org.apache.spark.network.server.StreamManager.openStream(StreamManager.java:60) at org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136) ... {noformat} We can't do anything to make the shuffle succeed, in this situation, but we should fail with a better error message. > Better error message when trying a shuffle fetch over 2 GB > -- > > Key: SPARK-24936 > URL: https://issues.apache.org/jira/browse/SPARK-24936 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > > After SPARK-24297, spark will try to fetch shuffle blocks to disk if their > over 2GB. However, this will fail with an external shuffle service running < > spark 2.2, with an unhelpful error message like: > {noformat} > 18/07/26 07:15:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.3 > (TID 15, xyz.com, executor 2): FetchFailed(BlockManagerId(1 > , xyz.com, 7337, None), shuffleId=0, mapId=1, reduceId=1, message= > org.apache.spark.shuffle.FetchFailedException: > java.lang.UnsupportedOperationException > at > org.apache.spark.network.server.StreamManager.openStream(StreamManager.java:60) > at > org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136) > ... > {noformat} > We can't do anything to make the shuffle succeed, in this situation, but we > should fail with a better error message. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28100) Unable to override JDBC types for ByteType when providing a custom JdbcDialect for Postgres
Seth Fitzsimmons created SPARK-28100: Summary: Unable to override JDBC types for ByteType when providing a custom JdbcDialect for Postgres Key: SPARK-28100 URL: https://issues.apache.org/jira/browse/SPARK-28100 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.4.3 Reporter: Seth Fitzsimmons {{AggregatedDialect}} defines [{{getJDBCType}} |https://github.com/apache/spark/blob/1217996f1574f758d81c4e3846452d24b35b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/AggregatedDialect.scala#L41-L43] as: {{dialects.flatMap(_.getJDBCType(dt)).headOption}} However, when attempting to write a {{ByteType}}, {{PostgreDialect}} currently throws: https://github.com/apache/spark/blob/1217996f1574f758d81c4e3846452d24b35b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L83 This prevents any other registered dialects from providing a JDBC type for {{ByteType}} as the last one (Spark's default {{PostgresDialect}}) will raise an uncaught exception. https://github.com/apache/spark/pull/24845 addresses this by providing a mapping, but the general problem holds if any {{JdbcDialect}} implementations ever throw in {{getJDBCType}} for any reason. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28099) Assertion when querying unpartitioned Hive table with partition-like naming
Douglas Drinka created SPARK-28099: -- Summary: Assertion when querying unpartitioned Hive table with partition-like naming Key: SPARK-28099 URL: https://issues.apache.org/jira/browse/SPARK-28099 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3 Reporter: Douglas Drinka {code:java} val testData = List(1,2,3,4,5) val dataFrame = testData.toDF() dataFrame .coalesce(1) .write .mode(SaveMode.Overwrite) .format("orc") .option("compression", "zlib") .save("s3://ddrinka.sparkbug/testFail/dir1=1/dir2=2/") spark.sql("DROP TABLE IF EXISTS ddrinka_sparkbug.testFail") spark.sql("CREATE EXTERNAL TABLE ddrinka_sparkbug.testFail (val INT) STORED AS ORC LOCATION 's3://ddrinka.sparkbug/testFail/'") val queryResponse = spark.sql("SELECT * FROM ddrinka_sparkbug.testFail") //Throws AssertionError //at org.apache.spark.sql.hive.HiveMetastoreCatalog.convertToLogicalRelation(HiveMetastoreCatalog.scala:214){code} It looks like the native ORC reader is creating virtual columns named dir1 and dir2, which don't exist in the Hive table. [The assertion|[https://github.com/apache/spark/blob/c0297dedd829a92cca920ab8983dab399f8f32d5/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L257]] is checking that the number of columns match, which fails due to the virtual partition columns. Actually getting data back from this query will be dependent on SPARK-28098, supporting subdirectories for Hive queries at all. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28096) Lazy val performance pitfall in Spark SQL LogicalPlans
[ https://issues.apache.org/jira/browse/SPARK-28096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28096: -- Issue Type: Improvement (was: Bug) > Lazy val performance pitfall in Spark SQL LogicalPlans > -- > > Key: SPARK-28096 > URL: https://issues.apache.org/jira/browse/SPARK-28096 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yesheng Ma >Priority: Major > > The original {{references}} and {{validConstraints}} implementations in a few > QueryPlan and Expression classes are methods, which means unnecessary > re-computation can happen at times. This PR resolves this problem by making > these method lazy vals. > We benchmarked this optimization on TPC-DS queries whose planning time is > longer than 1s. In the benchmark, we warmed up the queries 5 iterations and > then took the average of 10 runs. Results showed that this micro-optimization > can improve the end-to-end planning time by 25%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28098) Native ORC reader doesn't support subdirectories with Hive tables
Douglas Drinka created SPARK-28098: -- Summary: Native ORC reader doesn't support subdirectories with Hive tables Key: SPARK-28098 URL: https://issues.apache.org/jira/browse/SPARK-28098 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.3 Reporter: Douglas Drinka The Hive ORC reader supports recursive directory reads from S3. Spark's native ORC reader supports recursive directory reads, but not when used with Hive. {code:java} val testData = List(1,2,3,4,5) val dataFrame = testData.toDF() dataFrame .coalesce(1) .write .mode(SaveMode.Overwrite) .format("orc") .option("compression", "zlib") .save("s3://ddrinka.sparkbug/dirTest/dir1/dir2/") spark.sql("DROP TABLE IF EXISTS ddrinka_sparkbug.dirTest") spark.sql("CREATE EXTERNAL TABLE ddrinka_sparkbug.dirTest (val INT) STORED AS ORC LOCATION 's3://ddrinka.sparkbug/dirTest/'") spark.conf.set("hive.mapred.supports.subdirectories","true") spark.conf.set("mapred.input.dir.recursive","true") spark.conf.set("mapreduce.input.fileinputformat.input.dir.recursive","true") spark.conf.set("spark.sql.hive.convertMetastoreOrc", "true") println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count) //0 spark.conf.set("spark.sql.hive.convertMetastoreOrc", "false") println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count) //5{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28097) Map ByteType to SMALLINT when using JDBC with PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-28097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28097: Assignee: Apache Spark > Map ByteType to SMALLINT when using JDBC with PostgreSQL > > > Key: SPARK-28097 > URL: https://issues.apache.org/jira/browse/SPARK-28097 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.3.3, 2.4.3 >Reporter: Seth Fitzsimmons >Assignee: Apache Spark >Priority: Minor > > PostgreSQL doesn't have {{TINYINT}}, which would map directly, but > {{SMALLINT}}s are sufficient for uni-directional translation (i.e. when > writing). > This is equivalent to a user selecting {{'byteColumn.cast(ShortType)}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28097) Map ByteType to SMALLINT when using JDBC with PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-28097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28097: Assignee: (was: Apache Spark) > Map ByteType to SMALLINT when using JDBC with PostgreSQL > > > Key: SPARK-28097 > URL: https://issues.apache.org/jira/browse/SPARK-28097 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.3.3, 2.4.3 >Reporter: Seth Fitzsimmons >Priority: Minor > > PostgreSQL doesn't have {{TINYINT}}, which would map directly, but > {{SMALLINT}}s are sufficient for uni-directional translation (i.e. when > writing). > This is equivalent to a user selecting {{'byteColumn.cast(ShortType)}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28097) Map ByteType to SMALLINT when using JDBC with PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-28097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867004#comment-16867004 ] Apache Spark commented on SPARK-28097: -- User 'mojodna' has created a pull request for this issue: https://github.com/apache/spark/pull/24845 > Map ByteType to SMALLINT when using JDBC with PostgreSQL > > > Key: SPARK-28097 > URL: https://issues.apache.org/jira/browse/SPARK-28097 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.3.3, 2.4.3 >Reporter: Seth Fitzsimmons >Priority: Minor > > PostgreSQL doesn't have {{TINYINT}}, which would map directly, but > {{SMALLINT}}s are sufficient for uni-directional translation (i.e. when > writing). > This is equivalent to a user selecting {{'byteColumn.cast(ShortType)}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28097) Map ByteType to SMALLINT when using JDBC with PostgreSQL
Seth Fitzsimmons created SPARK-28097: Summary: Map ByteType to SMALLINT when using JDBC with PostgreSQL Key: SPARK-28097 URL: https://issues.apache.org/jira/browse/SPARK-28097 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 2.4.3, 2.3.3 Reporter: Seth Fitzsimmons PostgreSQL doesn't have {{TINYINT}}, which would map directly, but {{SMALLINT}}s are sufficient for uni-directional translation (i.e. when writing). This is equivalent to a user selecting {{'byteColumn.cast(ShortType)}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28096) Lazy val performance pitfall in Spark SQL LogicalPlans
[ https://issues.apache.org/jira/browse/SPARK-28096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866995#comment-16866995 ] Apache Spark commented on SPARK-28096: -- User 'yeshengm' has created a pull request for this issue: https://github.com/apache/spark/pull/24866 > Lazy val performance pitfall in Spark SQL LogicalPlans > -- > > Key: SPARK-28096 > URL: https://issues.apache.org/jira/browse/SPARK-28096 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yesheng Ma >Priority: Major > > The original {{references}} and {{validConstraints}} implementations in a few > QueryPlan and Expression classes are methods, which means unnecessary > re-computation can happen at times. This PR resolves this problem by making > these method lazy vals. > We benchmarked this optimization on TPC-DS queries whose planning time is > longer than 1s. In the benchmark, we warmed up the queries 5 iterations and > then took the average of 10 runs. Results showed that this micro-optimization > can improve the end-to-end planning time by 25%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28096) Lazy val performance pitfall in Spark SQL LogicalPlans
[ https://issues.apache.org/jira/browse/SPARK-28096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28096: Assignee: (was: Apache Spark) > Lazy val performance pitfall in Spark SQL LogicalPlans > -- > > Key: SPARK-28096 > URL: https://issues.apache.org/jira/browse/SPARK-28096 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yesheng Ma >Priority: Major > > The original {{references}} and {{validConstraints}} implementations in a few > QueryPlan and Expression classes are methods, which means unnecessary > re-computation can happen at times. This PR resolves this problem by making > these method lazy vals. > We benchmarked this optimization on TPC-DS queries whose planning time is > longer than 1s. In the benchmark, we warmed up the queries 5 iterations and > then took the average of 10 runs. Results showed that this micro-optimization > can improve the end-to-end planning time by 25%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28096) Lazy val performance pitfall in Spark SQL LogicalPlans
[ https://issues.apache.org/jira/browse/SPARK-28096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28096: Assignee: Apache Spark > Lazy val performance pitfall in Spark SQL LogicalPlans > -- > > Key: SPARK-28096 > URL: https://issues.apache.org/jira/browse/SPARK-28096 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yesheng Ma >Assignee: Apache Spark >Priority: Major > > The original {{references}} and {{validConstraints}} implementations in a few > QueryPlan and Expression classes are methods, which means unnecessary > re-computation can happen at times. This PR resolves this problem by making > these method lazy vals. > We benchmarked this optimization on TPC-DS queries whose planning time is > longer than 1s. In the benchmark, we warmed up the queries 5 iterations and > then took the average of 10 runs. Results showed that this micro-optimization > can improve the end-to-end planning time by 25%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28096) Lazy val performance pitfall in Spark SQL LogicalPlans
[ https://issues.apache.org/jira/browse/SPARK-28096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yesheng Ma updated SPARK-28096: --- Summary: Lazy val performance pitfall in Spark SQL LogicalPlans (was: Performance pitfall in Spark SQL LogicalPlans) > Lazy val performance pitfall in Spark SQL LogicalPlans > -- > > Key: SPARK-28096 > URL: https://issues.apache.org/jira/browse/SPARK-28096 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yesheng Ma >Priority: Major > > The original {{references}} and {{validConstraints}} implementations in a few > QueryPlan and Expression classes are methods, which means unnecessary > re-computation can happen at times. This PR resolves this problem by making > these method lazy vals. > We benchmarked this optimization on TPC-DS queries whose planning time is > longer than 1s. In the benchmark, we warmed up the queries 5 iterations and > then took the average of 10 runs. Results showed that this micro-optimization > can improve the end-to-end planning time by 25%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28096) Performance pitfall in Spark SQL LogicalPlans
Yesheng Ma created SPARK-28096: -- Summary: Performance pitfall in Spark SQL LogicalPlans Key: SPARK-28096 URL: https://issues.apache.org/jira/browse/SPARK-28096 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Yesheng Ma The original {{references}} and {{validConstraints}} implementations in a few QueryPlan and Expression classes are methods, which means unnecessary re-computation can happen at times. This PR resolves this problem by making these method lazy vals. We benchmarked this optimization on TPC-DS queries whose planning time is longer than 1s. In the benchmark, we warmed up the queries 5 iterations and then took the average of 10 runs. Results showed that this micro-optimization can improve the end-to-end planning time by 25%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28095) Pyspark with kubernetes doesn't parse arguments with spaces as expected.
[ https://issues.apache.org/jira/browse/SPARK-28095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emma Dickson updated SPARK-28095: - Description: When passing in arguments to a bash script that sets up spark submit using a python file that sets up a pyspark context strings with spaces are processed as individual strings. This occurs even when the argument is encased in double quotes, using backslashes or unicode escape characters. Example Command entered: This uses and IBM specific driver, [stochater|https://github.com/CODAIT/stocator] hence the cos url {code:java} ./scripts/spark-k8s.sh v0.0.32 --job-args "cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer" --job pages{code} Error Message {code:java} + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.30.83.253 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner /opt/spark/work-dir/main.py --job-args cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer --job pages 19/06/18 19:28:35 WARN Utils: Kubernetes master URL uses HTTP instead of HTTPS. 19/06/18 19:28:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable usage: main.py [-h] --job JOB --job-args JOB_ARGS main.py: error: unrecognized arguments: Balancer {code} was: When passing in arguments to a bash script that sets up spark submit using a python file that sets up a pyspark context strings with spaces are processed as individual strings. This occurs even when the argument is encased in double quotes, using backslashes or unicode escape characters. Example Command entered: This uses and IBM specific driver hence the cos url {code:java} ./scripts/spark-k8s.sh v0.0.32 --job-args "cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer" --job pages{code} Error Message {code:java} + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.30.83.253 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner /opt/spark/work-dir/main.py --job-args cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer --job pages 19/06/18 19:28:35 WARN Utils: Kubernetes master URL uses HTTP instead of HTTPS. 19/06/18 19:28:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable usage: main.py [-h] --job JOB --job-args JOB_ARGS main.py: error: unrecognized arguments: Balancer {code} > Pyspark with kubernetes doesn't parse arguments with spaces as expected. > > > Key: SPARK-28095 > URL: https://issues.apache.org/jira/browse/SPARK-28095 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 2.4.3 > Environment: Python 2.7.13 > Spark 2.4.3 > Kubernetes > >Reporter: Emma Dickson >Priority: Minor > Labels: newbie, usability > > When passing in arguments to a bash script that sets up spark submit using a > python file that sets up a pyspark context strings with spaces are processed > as individual strings. This occurs even when the argument is encased in > double quotes, using backslashes or unicode escape characters. > > Example > Command entered: This uses and IBM specific driver, > [stochater|https://github.com/CODAIT/stocator] hence the cos url > {code:java} > ./scripts/spark-k8s.sh v0.0.32 --job-args > "cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer" > --job pages{code} > > Error Message > > {code:java} > + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf > spark.driver.bindAddress=172.30.83.253 --deploy-mode client --properties-file > /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner > /opt/spark/work-dir/main.py --job-args > cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer > --job pages > 19/06/18 19:28:35 WARN Utils: Kubernetes master URL uses HTTP instead of > HTTPS. > 19/06/18 19:28:36 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > usage: main.py [-h] --job JOB --job-args JOB_ARGS > main.py: error: unrecognized arguments: Balancer > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27783) Add customizable hint error handler
[ https://issues.apache.org/jira/browse/SPARK-27783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-27783. - Resolution: Fixed Assignee: Maryann Xue Fix Version/s: 3.0.0 > Add customizable hint error handler > --- > > Key: SPARK-27783 > URL: https://issues.apache.org/jira/browse/SPARK-27783 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maryann Xue >Assignee: Maryann Xue >Priority: Minor > Fix For: 3.0.0 > > > Add a customizable hint error handler, with default behavior as logging > warnings. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28095) Pyspark with kubernetes doesn't parse arguments with spaces as expected.
[ https://issues.apache.org/jira/browse/SPARK-28095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emma Dickson updated SPARK-28095: - Description: When passing in arguments to a bash script that sets up spark submit using a python file that sets up a pyspark context strings with spaces are processed as individual strings. This occurs even when the argument is encased in double quotes, using backslashes or unicode escape characters. Example Command entered: This uses and IBM specific driver hence the cos url {code:java} ./scripts/spark-k8s.sh v0.0.32 --job-args "cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer" --job pages{code} Error Message {code:java} + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.30.83.253 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner /opt/spark/work-dir/main.py --job-args cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer --job pages 19/06/18 19:28:35 WARN Utils: Kubernetes master URL uses HTTP instead of HTTPS. 19/06/18 19:28:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable usage: main.py [-h] --job JOB --job-args JOB_ARGS main.py: error: unrecognized arguments: Balancer {code} was: When passing in arguments to a bash script that sets up spark submit using a python file that sets up a pyspark context strings with spaces are processed as individual strings. This occurs even when the argument is encased in double quotes, using backslashes or unicode escape characters. Example Command entered {code:java} ./scripts/spark-k8s.sh v0.0.32 --job-args "cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer" --job pages{code} Error Message {code:java} + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.30.83.253 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner /opt/spark/work-dir/main.py --job-args cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer --job pages 19/06/18 19:28:35 WARN Utils: Kubernetes master URL uses HTTP instead of HTTPS. 19/06/18 19:28:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable usage: main.py [-h] --job JOB --job-args JOB_ARGS main.py: error: unrecognized arguments: Balancer {code} > Pyspark with kubernetes doesn't parse arguments with spaces as expected. > > > Key: SPARK-28095 > URL: https://issues.apache.org/jira/browse/SPARK-28095 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 2.4.3 > Environment: Python 2.7.13 > Spark 2.4.3 > Kubernetes > >Reporter: Emma Dickson >Priority: Minor > Labels: newbie, usability > > When passing in arguments to a bash script that sets up spark submit using a > python file that sets up a pyspark context strings with spaces are processed > as individual strings. This occurs even when the argument is encased in > double quotes, using backslashes or unicode escape characters. > > Example > Command entered: This uses and IBM specific driver hence the cos url > {code:java} > ./scripts/spark-k8s.sh v0.0.32 --job-args > "cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer" > --job pages{code} > > Error Message > > {code:java} > + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf > spark.driver.bindAddress=172.30.83.253 --deploy-mode client --properties-file > /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner > /opt/spark/work-dir/main.py --job-args > cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer > --job pages > 19/06/18 19:28:35 WARN Utils: Kubernetes master URL uses HTTP instead of > HTTPS. > 19/06/18 19:28:36 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > usage: main.py [-h] --job JOB --job-args JOB_ARGS > main.py: error: unrecognized arguments: Balancer > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28095) Pyspark with kubernetes doesn't parse arguments with spaces as expected.
[ https://issues.apache.org/jira/browse/SPARK-28095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emma Dickson updated SPARK-28095: - Description: When passing in arguments to a bash script that sets up spark submit using a python file that sets up a pyspark context strings with spaces are processed as individual strings. This occurs even when the argument is encased in double quotes, using backslashes or unicode escape characters. Example Command entered {code:java} ./scripts/spark-k8s.sh v0.0.32 --job-args "cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer" --job pages{code} Error Message {code:java} + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.30.83.253 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner /opt/spark/work-dir/main.py --job-args cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer --job pages 19/06/18 19:28:35 WARN Utils: Kubernetes master URL uses HTTP instead of HTTPS. 19/06/18 19:28:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable usage: main.py [-h] --job JOB --job-args JOB_ARGS main.py: error: unrecognized arguments: Balancer {code} was: When passing in arguments to a bash script that calls a python file which sets up a pyspark context strings with spaces are processed as individual strings even when encased in double quotes, using backslashes or unicode escape characters. Example Command entered {code:java} ./scripts/spark-k8s.sh v0.0.32 --job-args "cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer" --job pages{code} Error Message {code:java} + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.30.83.253 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner /opt/spark/work-dir/main.py --job-args cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer --job pages 19/06/18 19:28:35 WARN Utils: Kubernetes master URL uses HTTP instead of HTTPS. 19/06/18 19:28:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable usage: main.py [-h] --job JOB --job-args JOB_ARGS main.py: error: unrecognized arguments: Balancer {code} > Pyspark with kubernetes doesn't parse arguments with spaces as expected. > > > Key: SPARK-28095 > URL: https://issues.apache.org/jira/browse/SPARK-28095 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 2.4.3 > Environment: Python 2.7.13 > Spark 2.4.3 > Kubernetes > >Reporter: Emma Dickson >Priority: Minor > Labels: newbie, usability > > When passing in arguments to a bash script that sets up spark submit using a > python file that sets up a pyspark context strings with spaces are processed > as individual strings. This occurs even when the argument is encased in > double quotes, using backslashes or unicode escape characters. > > Example > Command entered > {code:java} > ./scripts/spark-k8s.sh v0.0.32 --job-args > "cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer" > --job pages{code} > > Error Message > > {code:java} > + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf > spark.driver.bindAddress=172.30.83.253 --deploy-mode client --properties-file > /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner > /opt/spark/work-dir/main.py --job-args > cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer > --job pages > 19/06/18 19:28:35 WARN Utils: Kubernetes master URL uses HTTP instead of > HTTPS. > 19/06/18 19:28:36 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > usage: main.py [-h] --job JOB --job-args JOB_ARGS > main.py: error: unrecognized arguments: Balancer > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28095) Pyspark with kubernetes doesn't parse arguments with spaces as expected.
[ https://issues.apache.org/jira/browse/SPARK-28095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emma Dickson updated SPARK-28095: - Summary: Pyspark with kubernetes doesn't parse arguments with spaces as expected. (was: Pyspark on kubernetes doesn't parse arguments with spaces as expected.) > Pyspark with kubernetes doesn't parse arguments with spaces as expected. > > > Key: SPARK-28095 > URL: https://issues.apache.org/jira/browse/SPARK-28095 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 2.4.3 > Environment: Python 2.7.13 > Spark 2.4.3 > Kubernetes > >Reporter: Emma Dickson >Priority: Minor > Labels: newbie, usability > > When passing in arguments to a bash script that calls a python file which > sets up a pyspark context strings with spaces are processed as individual > strings even when encased in double quotes, using backslashes or unicode > escape characters. > > Example > Command entered > {code:java} > ./scripts/spark-k8s.sh v0.0.32 --job-args > "cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer" > --job pages{code} > > Error Message > > {code:java} > + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf > spark.driver.bindAddress=172.30.83.253 --deploy-mode client --properties-file > /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner > /opt/spark/work-dir/main.py --job-args > cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer > --job pages > 19/06/18 19:28:35 WARN Utils: Kubernetes master URL uses HTTP instead of > HTTPS. > 19/06/18 19:28:36 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > usage: main.py [-h] --job JOB --job-args JOB_ARGS > main.py: error: unrecognized arguments: Balancer > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28095) Pyspark on kubernetes doesn't parse arguments with spaces as expected.
Emma Dickson created SPARK-28095: Summary: Pyspark on kubernetes doesn't parse arguments with spaces as expected. Key: SPARK-28095 URL: https://issues.apache.org/jira/browse/SPARK-28095 Project: Spark Issue Type: Bug Components: Kubernetes, PySpark Affects Versions: 2.4.3 Environment: Python 2.7.13 Spark 2.4.3 Kubernetes Reporter: Emma Dickson When passing in arguments to a bash script that calls a python file which sets up a pyspark context strings with spaces are processed as individual strings even when encased in double quotes, using backslashes or unicode escape characters. Example Command entered {code:java} ./scripts/spark-k8s.sh v0.0.32 --job-args "cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer" --job pages{code} Error Message {code:java} + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.30.83.253 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner /opt/spark/work-dir/main.py --job-args cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer --job pages 19/06/18 19:28:35 WARN Utils: Kubernetes master URL uses HTTP instead of HTTPS. 19/06/18 19:28:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable usage: main.py [-h] --job JOB --job-args JOB_ARGS main.py: error: unrecognized arguments: Balancer {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866973#comment-16866973 ] M. Le Bihan edited comment on SPARK-18105 at 6/18/19 7:10 PM: -- My trick eventually didn't succeed. And I fall back into the bug again. I've apttemted to upgrade from spark-xxx_2.11 to spark_xxx.2.12 for scala but received this kind of stacktrace : {code:log} 2019-06-18 20:43:54.747 INFO 1539 --- [er for task 547] o.a.s.s.ShuffleBlockFetcherIterator : Started 0 remote fetches in 0 ms 2019-06-18 20:43:59.015 ERROR 1539 --- [er for task 547] org.apache.spark.executor.Executor : Exception in task 93.0 in stage 4.2 (TID 547) java.lang.NullPointerException: null at org.apache.spark.rdd.PairRDDFunctions.$anonfun$mapValues$3(PairRDDFunctions.scala:757) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ~[scala-library-2.12.8.jar!/:na] at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.scheduler.Task.run(Task.scala:121) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_212] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_212] {code} This issue 18105 prevent from using spark at all. {color:#654982}Spark +cannot+ handle large files (1 Go, 5 Go, 10 Go that I ask Spark to join for me) : *LOL, LOL, LOL !*{color} When this will be corrected ?! It should be raised to urgent. Spark with 300 opened and in progress (but stalling) issues is become less and less reliable each day. I'm about to send a message on dev forum to ask if developers can stop implementing new features until they have corrected the issues on the features they once written. Spark Today cannot be used at all. At least offer a way to disable the LZ4 feature if it doesn't work ! was (Author: mlebihan): My trick eventually didn't succeed. And I fall back into the bug again. I've apttemted to upgrade from spark-xxx_2.11 to spark_xxx.2.12 for scala but received this kind of stacktrace : {code:log} 2019-06-18 20:43:54.747 INFO 1539 --- [er for task 547] o.a.s.s.ShuffleBlockFetcherIterator : Started 0 remote fetches in 0 ms 2019-06-18 20:43:59.015 ERROR 1539 --- [er for task 547] org.apache.spark.executor.Executor : Exception in task 93.0 in stage 4.2 (TID 547) java.lang.NullPointerException: null at org.apache.spark.rdd.PairRDDFunctions.$anonfun$mapValues$3(PairRDDFunctions.scala:757) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ~[scala-library-2.12.8.jar!/:na] at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at
[jira] [Commented] (SPARK-28092) Spark cannot load files with COLON(:) char if not specified full path
[ https://issues.apache.org/jira/browse/SPARK-28092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866978#comment-16866978 ] Steve Loughran commented on SPARK-28092: BTW, code snippets say ("s3://bucket/prefix/*",; I presume that's really s3a:// ; as s3:// URLs mean EMR and "amazon's private fork gets to field the support call" > Spark cannot load files with COLON(:) char if not specified full path > - > > Key: SPARK-28092 > URL: https://issues.apache.org/jira/browse/SPARK-28092 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3 > Environment: Cloudera 6.2 > Spark latest parcel (I think 2.4.3) >Reporter: Ladislav Jech >Priority: Major > > Scenario: > I have CSV files in S3 bucket like this: > s3a://bucket/prefix/myfile_2019:04:05.csv > s3a://bucket/prefix/myfile_2019:04:06.csv > Now when I try to load files with something like: > df = spark.read.load("s3://bucket/prefix/*", format="csv", sep=":", > inferSchema="true", header="true") > > It fails on error about URI (sorry don't have here exact exception), but when > I list all files from S3 and provide path like array: > df = > spark.read.load(path=["s3://bucket/prefix/myfile_2019:04:05.csv","s3://bucket/prefix/myfile_2019:04:05.csv"], > format="csv", sep=":", inferSchema="true", header="true") > > It works, the reason is COLON character in the name of files as per my > observations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28092) Spark cannot load files with COLON(:) char if not specified full path
[ https://issues.apache.org/jira/browse/SPARK-28092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866977#comment-16866977 ] Steve Loughran commented on SPARK-28092: You're going to have to see if you can replicate this on the real ASF artifacts otherwise push it to your supplier of built spark artifacts FWIW, problems with ":" in filenames in paths are known: HADOOP-14829 HADOOP-14217 for example. HDFS hasn't allowed ":" in names, so nobody even noticed that Filesystem.globFiles doesn't work While nobody actually wants to stop this, its not something which, AFAIK, anyone is actively working on. Patches welcome, *with tests* > Spark cannot load files with COLON(:) char if not specified full path > - > > Key: SPARK-28092 > URL: https://issues.apache.org/jira/browse/SPARK-28092 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3 > Environment: Cloudera 6.2 > Spark latest parcel (I think 2.4.3) >Reporter: Ladislav Jech >Priority: Major > > Scenario: > I have CSV files in S3 bucket like this: > s3a://bucket/prefix/myfile_2019:04:05.csv > s3a://bucket/prefix/myfile_2019:04:06.csv > Now when I try to load files with something like: > df = spark.read.load("s3://bucket/prefix/*", format="csv", sep=":", > inferSchema="true", header="true") > > It fails on error about URI (sorry don't have here exact exception), but when > I list all files from S3 and provide path like array: > df = > spark.read.load(path=["s3://bucket/prefix/myfile_2019:04:05.csv","s3://bucket/prefix/myfile_2019:04:05.csv"], > format="csv", sep=":", inferSchema="true", header="true") > > It works, the reason is COLON character in the name of files as per my > observations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866973#comment-16866973 ] M. Le Bihan edited comment on SPARK-18105 at 6/18/19 7:06 PM: -- My trick eventually didn't succeed. And I fall back into the bug again. I've apttemted to upgrade from spark-xxx_2.11 to spark_xxx.2.12 for scala but received this kind of stacktrace : {code:log} 2019-06-18 20:43:54.747 INFO 1539 --- [er for task 547] o.a.s.s.ShuffleBlockFetcherIterator : Started 0 remote fetches in 0 ms 2019-06-18 20:43:59.015 ERROR 1539 --- [er for task 547] org.apache.spark.executor.Executor : Exception in task 93.0 in stage 4.2 (TID 547) java.lang.NullPointerException: null at org.apache.spark.rdd.PairRDDFunctions.$anonfun$mapValues$3(PairRDDFunctions.scala:757) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ~[scala-library-2.12.8.jar!/:na] at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.scheduler.Task.run(Task.scala:121) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_212] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_212] {code} This issue 18105 prevent from using spark at all. {color:#654982}Spark +cannot+ handle large files (1 Go, 5 Go, 10 Go that I ask Spark to join for me) : *LOL, LOL, LOL !*{color} When this will be corrected ?! It should be raised to urgent. Spark with 300 opened and in progress (but stalling) issues is become less and less reliable each day. I'm about to send a message on dev forum to ask if developers can stop implementing new features until they have corrected the issues on the features they once written. Spark Today cannot be used at all. At least find a way to disable the LZ4 feature you can't handle properly. was (Author: mlebihan): My trick eventually didn't succeed. And I fall back into the bug again. I've apttemted to upgrade from spark-xxx_2.11 to spark_xxx.2.12 for scala but received this kind of stacktrace : {code:log} 2019-06-18 20:43:54.747 INFO 1539 --- [er for task 547] o.a.s.s.ShuffleBlockFetcherIterator : Started 0 remote fetches in 0 ms 2019-06-18 20:43:59.015 ERROR 1539 --- [er for task 547] org.apache.spark.executor.Executor : Exception in task 93.0 in stage 4.2 (TID 547) java.lang.NullPointerException: null at org.apache.spark.rdd.PairRDDFunctions.$anonfun$mapValues$3(PairRDDFunctions.scala:757) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ~[scala-library-2.12.8.jar!/:na] at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at
[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866973#comment-16866973 ] M. Le Bihan commented on SPARK-18105: - My trick eventually didn't succeed. And I fall back into the bug again. I've apttemted to upgrade from spark-xxx_2.11 to spark_xxx.2.12 for scala but received this kind of stacktrace : {code:log} 2019-06-18 20:43:54.747 INFO 1539 --- [er for task 547] o.a.s.s.ShuffleBlockFetcherIterator : Started 0 remote fetches in 0 ms 2019-06-18 20:43:59.015 ERROR 1539 --- [er for task 547] org.apache.spark.executor.Executor : Exception in task 93.0 in stage 4.2 (TID 547) java.lang.NullPointerException: null at org.apache.spark.rdd.PairRDDFunctions.$anonfun$mapValues$3(PairRDDFunctions.scala:757) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ~[scala-library-2.12.8.jar!/:na] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ~[scala-library-2.12.8.jar!/:na] at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.scheduler.Task.run(Task.scala:121) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) ~[spark-core_2.12-2.4.3.jar!/:2.4.3] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_212] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_212] {code} This issue disallow using spark at all. It should be raised to urgent. Spark with 300 opened and in progress (but stalling) issues is become less and less reliable each day. I'm about to send a message on dev forum to ask if developers can stop implementing new features until they have corrected the issues on the features they once written. Spark Today cannot be used at all. At least find a way to disable the LZ4 feature you can't handle properly. > LZ4 failed to decompress a stream of shuffled data > -- > > Key: SPARK-18105 > URL: https://issues.apache.org/jira/browse/SPARK-18105 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Major > > When lz4 is used to compress the shuffle files, it may fail to decompress it > as "stream is corrupt" > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in > stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted > at > org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220) > at > org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109) > at java.io.BufferedInputStream.read(BufferedInputStream.java:353) > at java.io.DataInputStream.read(DataInputStream.java:149) > at com.google.common.io.ByteStreams.read(ByteStreams.java:828) > at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110) > at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) >
[jira] [Assigned] (SPARK-28084) LOAD DATA command resolving the partition column name considering case senstive manner
[ https://issues.apache.org/jira/browse/SPARK-28084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28084: Assignee: (was: Apache Spark) > LOAD DATA command resolving the partition column name considering case > senstive manner > --- > > Key: SPARK-28084 > URL: https://issues.apache.org/jira/browse/SPARK-28084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Sujith Chacko >Priority: Major > Attachments: parition_casesensitive.PNG > > > LOAD DATA command resolving the partition column name considering case > sensitive manner, where as insert command resolves case-insensitive manner. > Refer the snapshot for more details. > !image-2019-06-18-00-04-22-475.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28084) LOAD DATA command resolving the partition column name considering case senstive manner
[ https://issues.apache.org/jira/browse/SPARK-28084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28084: Assignee: Apache Spark > LOAD DATA command resolving the partition column name considering case > senstive manner > --- > > Key: SPARK-28084 > URL: https://issues.apache.org/jira/browse/SPARK-28084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Sujith Chacko >Assignee: Apache Spark >Priority: Major > Attachments: parition_casesensitive.PNG > > > LOAD DATA command resolving the partition column name considering case > sensitive manner, where as insert command resolves case-insensitive manner. > Refer the snapshot for more details. > !image-2019-06-18-00-04-22-475.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861803#comment-16861803 ] M. Le Bihan edited comment on SPARK-18105 at 6/18/19 4:37 PM: -- It _seems_ that exchanging from org.lz4:lz4-java:1.4.0 to org.lz4:lz4-java:1.6.0 helps. {code:java} org.apache.spark spark-core_2.11 ${spark.version} org.slf4j slf4j-log4j12 log4j log4j org.lz4 lz4-java org.lz4 lz4-java 1.6.0 org.slf4j log4j-over-slf4j org.apache.spark spark-sql_2.11 ${spark.version} {code} If it's right, the cause _could be_ a bug they corrected on 1.4.1. Didn't succeed. was (Author: mlebihan): It _seems_ that exchanging from org.lz4:lz4-java:1.4.0 to org.lz4:lz4-java:1.6.0 helps. {code:java} org.apache.spark spark-core_2.11 ${spark.version} org.slf4j slf4j-log4j12 log4j log4j org.lz4 lz4-java org.lz4 lz4-java 1.6.0 org.slf4j log4j-over-slf4j org.apache.spark spark-sql_2.11 ${spark.version} {code} If it's right, the cause _could be_ a bug they corrected on 1.4.1. > LZ4 failed to decompress a stream of shuffled data > -- > > Key: SPARK-18105 > URL: https://issues.apache.org/jira/browse/SPARK-18105 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Major > > When lz4 is used to compress the shuffle files, it may fail to decompress it > as "stream is corrupt" > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in > stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted > at > org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220) > at > org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109) > at java.io.BufferedInputStream.read(BufferedInputStream.java:353) > at java.io.DataInputStream.read(DataInputStream.java:149) > at com.google.common.io.ByteStreams.read(ByteStreams.java:828) > at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110) > at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at
[jira] [Commented] (SPARK-28079) CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is manually added to the schema
[ https://issues.apache.org/jira/browse/SPARK-28079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866777#comment-16866777 ] Liang-Chi Hsieh commented on SPARK-28079: - Isn't it the expected behavior as documented in {{PERMISSIVE}} mode of CSV? > CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is > manually added to the schema > - > > Key: SPARK-28079 > URL: https://issues.apache.org/jira/browse/SPARK-28079 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.4.3 >Reporter: F Jimenez >Priority: Major > > When reading a CSV with mode = "PERMISSIVE", corrupt records are not flagged > as such and read in. Only way to get them flagged is to manually set > "columnNameOfCorruptRecord" AND manually setting the schema including this > column. Example: > {code:java} > // Second row has a 4th column that is not declared in the header/schema > val csvText = s""" > | FieldA, FieldB, FieldC > | a1,b1,c1 > | a2,b2,c2,d*""".stripMargin > val csvFile = new File("/tmp/file.csv") > FileUtils.write(csvFile, csvText) > val reader = sqlContext.read > .format("csv") > .option("header", "true") > .option("mode", "PERMISSIVE") > .option("columnNameOfCorruptRecord", "corrupt") > .schema("corrupt STRING, fieldA STRING, fieldB STRING, fieldC STRING") > reader.load(csvFile.getAbsolutePath).show(truncate = false) > {code} > This produces the correct result: > {code:java} > ++--+--+--+ > |corrupt |fieldA|fieldB|fieldC| > ++--+--+--+ > |null | a1 |b1 |c1 | > | a2,b2,c2,d*| a2 |b2 |c2 | > ++--+--+--+ > {code} > However removing the "schema" option and going: > {code:java} > val reader = sqlContext.read > .format("csv") > .option("header", "true") > .option("mode", "PERMISSIVE") > .option("columnNameOfCorruptRecord", "corrupt") > reader.load(csvFile.getAbsolutePath).show(truncate = false) > {code} > Yields: > {code:java} > +---+---+---+ > | FieldA| FieldB| FieldC| > +---+---+---+ > | a1 |b1 |c1 | > | a2 |b2 |c2 | > +---+---+---+ > {code} > The fourth value "d*" in the second row has been removed and the row not > marked as corrupt > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28094) Multiple left joins or aggregations in one query produce incorrect results
Joe Ammann created SPARK-28094: -- Summary: Multiple left joins or aggregations in one query produce incorrect results Key: SPARK-28094 URL: https://issues.apache.org/jira/browse/SPARK-28094 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.4.3 Reporter: Joe Ammann Structured streaming queries which involve more than one non-map-like transformations (either left joins or aggregations) produce incorrect (and sometimes inconsistent) results. I have put up a sample here as Gist https://gist.github.com/jammann/b58bfbe0f4374b89ecea63c1e32c8f17 The test case involves 4 topics, and I've implement 6 different use cases of which 3 fail to produce the expected results 1) A innerjoin B: to topic A_B expected: 10 inner joined messages observation: OK 2) A innerjoin B outerjoin C: to topic A_B_outer_C expected: 9 inner joined messages, last one in watermark observation: OK 3) A innerjoin B outerjoin C outerjoin D: to topic A_B_outer_C_outer_D expected: 9 inner/outer joined messages, 3 match C, 1 match D, last one in watermark observation: NOK, only 3 messages are produced, where the C outer join matches 4) B outerjoin C: to topic B_outer_C expected: 9 outer joined messages, 3 match C, 1 match D, last one in watermark observation: NOK, same as 3 5) A innerjoin B agg on field of A: to topic A_inner_B_agg expected: 3 groups with a total of 9 inner joined messages, last one in watermark observation: OK 6) A innerjoin B outerjoin C agg on field of A: to topic A_inner_B_outer_C_agg expected: 3 groups with a total of 9 inner joined messages, last one in watermark observation: NOK, 3 groups with 1 message each, where C outer join matches -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28093) Built-in function trim/ltrim/rtrim has bug when using trimStr
[ https://issues.apache.org/jira/browse/SPARK-28093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28093: Affects Version/s: (was: 2.4.3) 2.3.3 > Built-in function trim/ltrim/rtrim has bug when using trimStr > - > > Key: SPARK-28093 > URL: https://issues.apache.org/jira/browse/SPARK-28093 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3 >Reporter: Yuming Wang >Priority: Major > > {noformat} > spark-sql> SELECT trim('yxTomxx', 'xyz'), trim('xxxbarxxx', 'x'); > z > spark-sql> SELECT ltrim('zzzytest', 'xyz'), ltrim('xyxXxyLAST WORD', 'xy'); > xyz > spark-sql> SELECT rtrim('testxxzx', 'xyz'), rtrim('TURNERyxXxy', 'xy'); > xy > spark-sql> > {noformat} > {noformat} > postgres=# SELECT trim('yxTomxx', 'xyz'), trim('xxxbarxxx', 'x'); > btrim | btrim > ---+--- > Tom | bar > (1 row) > postgres=# SELECT ltrim('zzzytest', 'xyz'), ltrim('xyxXxyLAST WORD', 'xy'); > ltrim |ltrim > ---+-- > test | XxyLAST WORD > (1 row) > postgres=# SELECT rtrim('testxxzx', 'xyz'), rtrim('TURNERyxXxy', 'xy'); > rtrim | rtrim > ---+--- > test | TURNERyxX > (1 row) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28093) Built-in function trim/ltrim/rtrim has bug when using trimStr
[ https://issues.apache.org/jira/browse/SPARK-28093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28093: Affects Version/s: (was: 3.0.0) 2.4.3 > Built-in function trim/ltrim/rtrim has bug when using trimStr > - > > Key: SPARK-28093 > URL: https://issues.apache.org/jira/browse/SPARK-28093 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Yuming Wang >Priority: Major > > {noformat} > spark-sql> SELECT trim('yxTomxx', 'xyz'), trim('xxxbarxxx', 'x'); > z > spark-sql> SELECT ltrim('zzzytest', 'xyz'), ltrim('xyxXxyLAST WORD', 'xy'); > xyz > spark-sql> SELECT rtrim('testxxzx', 'xyz'), rtrim('TURNERyxXxy', 'xy'); > xy > spark-sql> > {noformat} > {noformat} > postgres=# SELECT trim('yxTomxx', 'xyz'), trim('xxxbarxxx', 'x'); > btrim | btrim > ---+--- > Tom | bar > (1 row) > postgres=# SELECT ltrim('zzzytest', 'xyz'), ltrim('xyxXxyLAST WORD', 'xy'); > ltrim |ltrim > ---+-- > test | XxyLAST WORD > (1 row) > postgres=# SELECT rtrim('testxxzx', 'xyz'), rtrim('TURNERyxXxy', 'xy'); > rtrim | rtrim > ---+--- > test | TURNERyxX > (1 row) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28093) Built-in function trim/ltrim/rtrim has bug when using trimStr
[ https://issues.apache.org/jira/browse/SPARK-28093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28093: Assignee: Apache Spark > Built-in function trim/ltrim/rtrim has bug when using trimStr > - > > Key: SPARK-28093 > URL: https://issues.apache.org/jira/browse/SPARK-28093 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > {noformat} > spark-sql> SELECT trim('yxTomxx', 'xyz'), trim('xxxbarxxx', 'x'); > z > spark-sql> SELECT ltrim('zzzytest', 'xyz'), ltrim('xyxXxyLAST WORD', 'xy'); > xyz > spark-sql> SELECT rtrim('testxxzx', 'xyz'), rtrim('TURNERyxXxy', 'xy'); > xy > spark-sql> > {noformat} > {noformat} > postgres=# SELECT trim('yxTomxx', 'xyz'), trim('xxxbarxxx', 'x'); > btrim | btrim > ---+--- > Tom | bar > (1 row) > postgres=# SELECT ltrim('zzzytest', 'xyz'), ltrim('xyxXxyLAST WORD', 'xy'); > ltrim |ltrim > ---+-- > test | XxyLAST WORD > (1 row) > postgres=# SELECT rtrim('testxxzx', 'xyz'), rtrim('TURNERyxXxy', 'xy'); > rtrim | rtrim > ---+--- > test | TURNERyxX > (1 row) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28093) Built-in function trim/ltrim/rtrim has bug when using trimStr
[ https://issues.apache.org/jira/browse/SPARK-28093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28093: Assignee: (was: Apache Spark) > Built-in function trim/ltrim/rtrim has bug when using trimStr > - > > Key: SPARK-28093 > URL: https://issues.apache.org/jira/browse/SPARK-28093 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > {noformat} > spark-sql> SELECT trim('yxTomxx', 'xyz'), trim('xxxbarxxx', 'x'); > z > spark-sql> SELECT ltrim('zzzytest', 'xyz'), ltrim('xyxXxyLAST WORD', 'xy'); > xyz > spark-sql> SELECT rtrim('testxxzx', 'xyz'), rtrim('TURNERyxXxy', 'xy'); > xy > spark-sql> > {noformat} > {noformat} > postgres=# SELECT trim('yxTomxx', 'xyz'), trim('xxxbarxxx', 'x'); > btrim | btrim > ---+--- > Tom | bar > (1 row) > postgres=# SELECT ltrim('zzzytest', 'xyz'), ltrim('xyxXxyLAST WORD', 'xy'); > ltrim |ltrim > ---+-- > test | XxyLAST WORD > (1 row) > postgres=# SELECT rtrim('testxxzx', 'xyz'), rtrim('TURNERyxXxy', 'xy'); > rtrim | rtrim > ---+--- > test | TURNERyxX > (1 row) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28093) Built-in function trim/ltrim/rtrim has bug when using trimStr
Yuming Wang created SPARK-28093: --- Summary: Built-in function trim/ltrim/rtrim has bug when using trimStr Key: SPARK-28093 URL: https://issues.apache.org/jira/browse/SPARK-28093 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang {noformat} spark-sql> SELECT trim('yxTomxx', 'xyz'), trim('xxxbarxxx', 'x'); z spark-sql> SELECT ltrim('zzzytest', 'xyz'), ltrim('xyxXxyLAST WORD', 'xy'); xyz spark-sql> SELECT rtrim('testxxzx', 'xyz'), rtrim('TURNERyxXxy', 'xy'); xy spark-sql> {noformat} {noformat} postgres=# SELECT trim('yxTomxx', 'xyz'), trim('xxxbarxxx', 'x'); btrim | btrim ---+--- Tom | bar (1 row) postgres=# SELECT ltrim('zzzytest', 'xyz'), ltrim('xyxXxyLAST WORD', 'xy'); ltrim |ltrim ---+-- test | XxyLAST WORD (1 row) postgres=# SELECT rtrim('testxxzx', 'xyz'), rtrim('TURNERyxXxy', 'xy'); rtrim | rtrim ---+--- test | TURNERyxX (1 row) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28077) ANSI SQL: OVERLAY function(T312)
[ https://issues.apache.org/jira/browse/SPARK-28077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866351#comment-16866351 ] jiaan.geng commented on SPARK-28077: I'm working on. > ANSI SQL: OVERLAY function(T312) > > > Key: SPARK-28077 > URL: https://issues.apache.org/jira/browse/SPARK-28077 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > ||Function||Return Type||Description||Example||Result|| > |{{overlay(_string_ placing _string_ from }}{{int}}{{[for > {{int}}])}}|{{text}}|Replace substring|{{overlay('Tas' placing 'hom' from > 2 for 4)}}|{{Thomas}}| > For example: > {code:sql} > SELECT OVERLAY('abcdef' PLACING '45' FROM 4) AS "abc45f"; > SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5) AS "yabadaba"; > SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5 FOR 0) AS "yabadabadoo"; > SELECT OVERLAY('babosa' PLACING 'ubb' FROM 2 FOR 4) AS "bubba"; > {code} > https://www.postgresql.org/docs/11/functions-string.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28073) ANSI SQL: Character literals
[ https://issues.apache.org/jira/browse/SPARK-28073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866344#comment-16866344 ] jiaan.geng commented on SPARK-28073: I'm working on. > ANSI SQL: Character literals > > > Key: SPARK-28073 > URL: https://issues.apache.org/jira/browse/SPARK-28073 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > ||Feature ID||Feature Name||Feature Description|| > |E021-03|Character literals|— Subclause 5.3, “”: [ > ... ] | > Example: > {code:sql} > SELECT 'first line' > ' - next line' > ' - third line' > AS "Three lines to one"; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28077) ANSI SQL: OVERLAY function(T312)
[ https://issues.apache.org/jira/browse/SPARK-28077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28077: Summary: ANSI SQL: OVERLAY function(T312) (was: String Functions: Add support OVERLAY) > ANSI SQL: OVERLAY function(T312) > > > Key: SPARK-28077 > URL: https://issues.apache.org/jira/browse/SPARK-28077 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > ||Function||Return Type||Description||Example||Result|| > |{{overlay(_string_ placing _string_ from }}{{int}}{{[for > {{int}}])}}|{{text}}|Replace substring|{{overlay('Tas' placing 'hom' from > 2 for 4)}}|{{Thomas}}| > For example: > {code:sql} > SELECT OVERLAY('abcdef' PLACING '45' FROM 4) AS "abc45f"; > SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5) AS "yabadaba"; > SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5 FOR 0) AS "yabadabadoo"; > SELECT OVERLAY('babosa' PLACING 'ubb' FROM 2 FOR 4) AS "bubba"; > {code} > https://www.postgresql.org/docs/11/functions-string.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28091) Extend Spark metrics system with executor plugin metrics
[ https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866336#comment-16866336 ] Luca Canali edited comment on SPARK-28091 at 6/18/19 8:07 AM: -- [~irashid] given your work on SPARK-24918 you may be interested to comment on this? was (Author: lucacanali): @irashid given your work on SPARK-24918 you may be interested to comment on this? > Extend Spark metrics system with executor plugin metrics > > > Key: SPARK-28091 > URL: https://issues.apache.org/jira/browse/SPARK-28091 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Priority: Minor > > This proposes to improve Spark instrumentation by adding a hook for Spark > executor plugin metrics to the Spark metrics systems implemented with the > Dropwizard/Codahale library. > Context: The Spark metrics system provides a large variety of metrics, see > also SPARK-26890, useful to monitor and troubleshoot Spark workloads. A > typical workflow is to sink the metrics to a storage system and build > dashboards on top of that. > Improvement: The original goal of this work was to add instrumentation for S3 > filesystem access metrics by Spark job. Currently, [[ExecutorSource]] > instruments HDFS and local filesystem metrics. Rather than extending the code > there, we proposes to add a metrics plugin system which is of more flexible > and general use. > Advantages: > * The metric plugin system makes it easy to implement instrumentation for S3 > access by Spark jobs. > * The metrics plugin system allows for easy extensions of how Spark collects > HDFS-related workload metrics. This is currently done using the Hadoop > Filesystem GetAllStatistics method, which is deprecated in recent versions of > Hadoop. Recent versions of Hadoop Filesystem recommend using method > GetGlobalStorageStatistics, which also provides several additional metrics. > GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been > introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an > easy way to “opt in” using such new API calls for those deploying suitable > Hadoop versions. > * We also have the use case of adding Hadoop filesystem monitoring for a > custom Hadoop compliant filesystem in use in our organization (EOS using the > XRootD protocol). The metrics plugin infrastructure makes this easy to do. > Others may have similar use cases. > * More generally, this method makes it straightforward to plug in Filesystem > and other metrics to the Spark monitoring system. Future work on plugin > implementation can address extending monitoring to measure usage of external > resources (OS, filesystem, network, accelerator cards, etc), that maybe would > not normally be considered general enough for inclusion in Apache Spark code, > but that can be nevertheless useful for specialized use cases, tests or > troubleshooting. > Implementation: > The proposed implementation is currently a WIP open for comments and > improvements. It is based on the work on Executor Plugin of SPARK-24918 and > builds on recent work on extending Spark executor metrics, such as SPARK-25228 > Tests and examples: > This has been so far manually tested running Spark on YARN and K8S clusters, > in particular for monitoring S3 and for extending HDFS instrumentation with > the Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric > plugin example and code used for testing are available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28091) Extend Spark metrics system with executor plugin metrics
[ https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866336#comment-16866336 ] Luca Canali commented on SPARK-28091: - @irashid given your work on SPARK-24918 you may be interested to comment on this? > Extend Spark metrics system with executor plugin metrics > > > Key: SPARK-28091 > URL: https://issues.apache.org/jira/browse/SPARK-28091 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Priority: Minor > > This proposes to improve Spark instrumentation by adding a hook for Spark > executor plugin metrics to the Spark metrics systems implemented with the > Dropwizard/Codahale library. > Context: The Spark metrics system provides a large variety of metrics, see > also SPARK-26890, useful to monitor and troubleshoot Spark workloads. A > typical workflow is to sink the metrics to a storage system and build > dashboards on top of that. > Improvement: The original goal of this work was to add instrumentation for S3 > filesystem access metrics by Spark job. Currently, [[ExecutorSource]] > instruments HDFS and local filesystem metrics. Rather than extending the code > there, we proposes to add a metrics plugin system which is of more flexible > and general use. > Advantages: > * The metric plugin system makes it easy to implement instrumentation for S3 > access by Spark jobs. > * The metrics plugin system allows for easy extensions of how Spark collects > HDFS-related workload metrics. This is currently done using the Hadoop > Filesystem GetAllStatistics method, which is deprecated in recent versions of > Hadoop. Recent versions of Hadoop Filesystem recommend using method > GetGlobalStorageStatistics, which also provides several additional metrics. > GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been > introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an > easy way to “opt in” using such new API calls for those deploying suitable > Hadoop versions. > * We also have the use case of adding Hadoop filesystem monitoring for a > custom Hadoop compliant filesystem in use in our organization (EOS using the > XRootD protocol). The metrics plugin infrastructure makes this easy to do. > Others may have similar use cases. > * More generally, this method makes it straightforward to plug in Filesystem > and other metrics to the Spark monitoring system. Future work on plugin > implementation can address extending monitoring to measure usage of external > resources (OS, filesystem, network, accelerator cards, etc), that maybe would > not normally be considered general enough for inclusion in Apache Spark code, > but that can be nevertheless useful for specialized use cases, tests or > troubleshooting. > Implementation: > The proposed implementation is currently a WIP open for comments and > improvements. It is based on the work on Executor Plugin of SPARK-24918 and > builds on recent work on extending Spark executor metrics, such as SPARK-25228 > Tests and examples: > This has been so far manually tested running Spark on YARN and K8S clusters, > in particular for monitoring S3 and for extending HDFS instrumentation with > the Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric > plugin example and code used for testing are available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28092) Spark cannot load files with COLON(:) char if not specified full path
Ladislav Jech created SPARK-28092: - Summary: Spark cannot load files with COLON(:) char if not specified full path Key: SPARK-28092 URL: https://issues.apache.org/jira/browse/SPARK-28092 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.3 Environment: Cloudera 6.2 Spark latest parcel (I think 2.4.3) Reporter: Ladislav Jech Scenario: I have CSV files in S3 bucket like this: s3a://bucket/prefix/myfile_2019:04:05.csv s3a://bucket/prefix/myfile_2019:04:06.csv Now when I try to load files with something like: df = spark.read.load("s3://bucket/prefix/*", format="csv", sep=":", inferSchema="true", header="true") It fails on error about URI (sorry don't have here exact exception), but when I list all files from S3 and provide path like array: df = spark.read.load(path=["s3://bucket/prefix/myfile_2019:04:05.csv","s3://bucket/prefix/myfile_2019:04:05.csv"], format="csv", sep=":", inferSchema="true", header="true") It works, the reason is COLON character in the name of files as per my observations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28091) Extend Spark metrics system with executor plugin metrics
[ https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28091: Assignee: (was: Apache Spark) > Extend Spark metrics system with executor plugin metrics > > > Key: SPARK-28091 > URL: https://issues.apache.org/jira/browse/SPARK-28091 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Priority: Minor > > This proposes to improve Spark instrumentation by adding a hook for Spark > executor plugin metrics to the Spark metrics systems implemented with the > Dropwizard/Codahale library. > Context: The Spark metrics system provides a large variety of metrics, see > also SPARK-26890, useful to monitor and troubleshoot Spark workloads. A > typical workflow is to sink the metrics to a storage system and build > dashboards on top of that. > Improvement: The original goal of this work was to add instrumentation for S3 > filesystem access metrics by Spark job. Currently, [[ExecutorSource]] > instruments HDFS and local filesystem metrics. Rather than extending the code > there, we proposes to add a metrics plugin system which is of more flexible > and general use. > Advantages: > * The metric plugin system makes it easy to implement instrumentation for S3 > access by Spark jobs. > * The metrics plugin system allows for easy extensions of how Spark collects > HDFS-related workload metrics. This is currently done using the Hadoop > Filesystem GetAllStatistics method, which is deprecated in recent versions of > Hadoop. Recent versions of Hadoop Filesystem recommend using method > GetGlobalStorageStatistics, which also provides several additional metrics. > GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been > introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an > easy way to “opt in” using such new API calls for those deploying suitable > Hadoop versions. > * We also have the use case of adding Hadoop filesystem monitoring for a > custom Hadoop compliant filesystem in use in our organization (EOS using the > XRootD protocol). The metrics plugin infrastructure makes this easy to do. > Others may have similar use cases. > * More generally, this method makes it straightforward to plug in Filesystem > and other metrics to the Spark monitoring system. Future work on plugin > implementation can address extending monitoring to measure usage of external > resources (OS, filesystem, network, accelerator cards, etc), that maybe would > not normally be considered general enough for inclusion in Apache Spark code, > but that can be nevertheless useful for specialized use cases, tests or > troubleshooting. > Implementation: > The proposed implementation is currently a WIP open for comments and > improvements. It is based on the work on Executor Plugin of SPARK-24918 and > builds on recent work on extending Spark executor metrics, such as SPARK-25228 > Tests and examples: > This has been so far manually tested running Spark on YARN and K8S clusters, > in particular for monitoring S3 and for extending HDFS instrumentation with > the Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric > plugin example and code used for testing are available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28091) Extend Spark metrics system with executor plugin metrics
[ https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28091: Assignee: Apache Spark > Extend Spark metrics system with executor plugin metrics > > > Key: SPARK-28091 > URL: https://issues.apache.org/jira/browse/SPARK-28091 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Assignee: Apache Spark >Priority: Minor > > This proposes to improve Spark instrumentation by adding a hook for Spark > executor plugin metrics to the Spark metrics systems implemented with the > Dropwizard/Codahale library. > Context: The Spark metrics system provides a large variety of metrics, see > also SPARK-26890, useful to monitor and troubleshoot Spark workloads. A > typical workflow is to sink the metrics to a storage system and build > dashboards on top of that. > Improvement: The original goal of this work was to add instrumentation for S3 > filesystem access metrics by Spark job. Currently, [[ExecutorSource]] > instruments HDFS and local filesystem metrics. Rather than extending the code > there, we proposes to add a metrics plugin system which is of more flexible > and general use. > Advantages: > * The metric plugin system makes it easy to implement instrumentation for S3 > access by Spark jobs. > * The metrics plugin system allows for easy extensions of how Spark collects > HDFS-related workload metrics. This is currently done using the Hadoop > Filesystem GetAllStatistics method, which is deprecated in recent versions of > Hadoop. Recent versions of Hadoop Filesystem recommend using method > GetGlobalStorageStatistics, which also provides several additional metrics. > GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been > introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an > easy way to “opt in” using such new API calls for those deploying suitable > Hadoop versions. > * We also have the use case of adding Hadoop filesystem monitoring for a > custom Hadoop compliant filesystem in use in our organization (EOS using the > XRootD protocol). The metrics plugin infrastructure makes this easy to do. > Others may have similar use cases. > * More generally, this method makes it straightforward to plug in Filesystem > and other metrics to the Spark monitoring system. Future work on plugin > implementation can address extending monitoring to measure usage of external > resources (OS, filesystem, network, accelerator cards, etc), that maybe would > not normally be considered general enough for inclusion in Apache Spark code, > but that can be nevertheless useful for specialized use cases, tests or > troubleshooting. > Implementation: > The proposed implementation is currently a WIP open for comments and > improvements. It is based on the work on Executor Plugin of SPARK-24918 and > builds on recent work on extending Spark executor metrics, such as SPARK-25228 > Tests and examples: > This has been so far manually tested running Spark on YARN and K8S clusters, > in particular for monitoring S3 and for extending HDFS instrumentation with > the Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric > plugin example and code used for testing are available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856558#comment-16856558 ] Piotr Chowaniec edited comment on SPARK-18105 at 6/18/19 7:21 AM: -- I have a similar issue with Spark 2.3.2. Here is a stack trace: {code:java} org.apache.spark.scheduler.DAGScheduler : ShuffleMapStage 647 (count at Step.java:20) failed in 1.908 s due to org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:528) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:444) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:62) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.agg_doAggregateWithKeys_1$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.agg_doAggregateWithKeys_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Stream is corrupted at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:252) at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:170) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:349) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:336) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:336) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1381) at org.apache.spark.util.Utils$.copyStream(Utils.scala:357) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:436) ... 21 more Caused by: net.jpountz.lz4.LZ4Exception: Error decoding offset 2010 of input buffer at net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:39) at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:247) ... 29 more {code} It happens during ETL process that has about 200 steps. It looks like it depends on the input data because we have exceptions only on the production environment (on test and dev machines same process with different data is running without problems). Unfortunately there is no way to use production data on other environment, so we cannot find differences. Changing compression codec to Snappy gives: {code:java} o.apache.spark.scheduler.TaskSetManager : Lost task 0.0 in stage 852.3 (TID 308 36, localhost, executor driver): FetchFailed(BlockManagerId(driver, DNS.domena, 33588, None), shuffleId=298, mapId=2, reduceId=3, message= org.apache.spark.shuffle.FetchFailedException: FAILED_TO_UNCOMPRESS(5) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:528) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:444) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:62) at
[jira] [Created] (SPARK-28091) Extend Spark metrics system with executor plugin metrics
Luca Canali created SPARK-28091: --- Summary: Extend Spark metrics system with executor plugin metrics Key: SPARK-28091 URL: https://issues.apache.org/jira/browse/SPARK-28091 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Luca Canali This proposes to improve Spark instrumentation by adding a hook for Spark executor plugin metrics to the Spark metrics systems implemented with the Dropwizard/Codahale library. Context: The Spark metrics system provides a large variety of metrics, see also SPARK-26890, useful to monitor and troubleshoot Spark workloads. A typical workflow is to sink the metrics to a storage system and build dashboards on top of that. Improvement: The original goal of this work was to add instrumentation for S3 filesystem access metrics by Spark job. Currently, [[ExecutorSource]] instruments HDFS and local filesystem metrics. Rather than extending the code there, we proposes to add a metrics plugin system which is of more flexible and general use. Advantages: * The metric plugin system makes it easy to implement instrumentation for S3 access by Spark jobs. * The metrics plugin system allows for easy extensions of how Spark collects HDFS-related workload metrics. This is currently done using the Hadoop Filesystem GetAllStatistics method, which is deprecated in recent versions of Hadoop. Recent versions of Hadoop Filesystem recommend using method GetGlobalStorageStatistics, which also provides several additional metrics. GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an easy way to “opt in” using such new API calls for those deploying suitable Hadoop versions. * We also have the use case of adding Hadoop filesystem monitoring for a custom Hadoop compliant filesystem in use in our organization (EOS using the XRootD protocol). The metrics plugin infrastructure makes this easy to do. Others may have similar use cases. * More generally, this method makes it straightforward to plug in Filesystem and other metrics to the Spark monitoring system. Future work on plugin implementation can address extending monitoring to measure usage of external resources (OS, filesystem, network, accelerator cards, etc), that maybe would not normally be considered general enough for inclusion in Apache Spark code, but that can be nevertheless useful for specialized use cases, tests or troubleshooting. Implementation: The proposed implementation is currently a WIP open for comments and improvements. It is based on the work on Executor Plugin of SPARK-24918 and builds on recent work on extending Spark executor metrics, such as SPARK-25228 Tests and examples: This has been so far manually tested running Spark on YARN and K8S clusters, in particular for monitoring S3 and for extending HDFS instrumentation with the Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric plugin example and code used for testing are available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28072) Fix IncompatibleClassChangeError in `FromUnixTime` codegen on JDK9+
[ https://issues.apache.org/jira/browse/SPARK-28072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28072. --- Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/24889 > Fix IncompatibleClassChangeError in `FromUnixTime` codegen on JDK9+ > --- > > Key: SPARK-28072 > URL: https://issues.apache.org/jira/browse/SPARK-28072 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > With JDK9+, the generate bytecode of `FromUnixTime` raise > `java.lang.IncompatibleClassChangeError` due to > [JDK-8145148|https://bugs.openjdk.java.net/browse/JDK-8145148] . > This is a blocker in our JDK11 Jenkins job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28090) Spark hangs when an execution plan has many projections on nested structs
[ https://issues.apache.org/jira/browse/SPARK-28090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruslan Yushchenko updated SPARK-28090: -- Description: This was already posted (#28016), but the provided example didn't always reproduce the error. This example consistently reproduces the issue. Spark applications freeze on execution plan optimization stage (Catalyst) when a logical execution plan contains a lot of projections that operate on nested struct fields. The code listed below demonstrates the issue. To reproduce the Spark App does the following: * A small dataframe is created from a JSON example. * Several nested transformations (negation of a number) are applied on struct fields and each time a new struct field is created. * Once more than 9 such transformations are applied the Catalyst optimizer freezes on optimizing the execution plan. * You can control the freezing by choosing different upper bound for the Range. E.g. it will work file if the upper bound is 5, but will hang is the bound is 10. {code:java} package com.example import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types.{StructField, StructType} import scala.collection.mutable.ListBuffer object SparkApp1IssueSelfContained { // A sample data for a dataframe with nested structs val sample: List[String] = """ { "numerics": {"num1": 101, "num2": 102, "num3": 103, "num4": 104, "num5": 105, "num6": 106, "num7": 107, "num8": 108, "num9": 109, "num10": 110, "num11": 111, "num12": 112, "num13": 113, "num14": 114, "num15": 115} } """ :: """ { "numerics": {"num1": 201, "num2": 202, "num3": 203, "num4": 204, "num5": 205, "num6": 206, "num7": 207, "num8": 208, "num9": 209, "num10": 210, "num11": 211, "num12": 212, "num13": 213, "num14": 214, "num15": 215} } """ :: """ { "numerics": {"num1": 301, "num2": 302, "num3": 303, "num4": 304, "num5": 305, "num6": 306, "num7": 307, "num8": 308, "num9": 309, "num10": 310, "num11": 311, "num12": 312, "num13": 313, "num14": 314, "num15": 315} } """ :: Nil /** * Transforms a column inside a nested struct. The transformed value will be put into a new field of that nested struct * * The output column name can omit the full path as the field will be created at the same level of nesting as the input column. * * @param inputColumnName A column name for which to apply the transformation, e.g. `company.employee.firstName`. * @param outputColumnName The output column name. The path is optional, e.g. you can use `transformedName` instead of `company.employee.transformedName`. * @param expression A function that applies a transformation to a column as a Spark expression. * @return A dataframe with a new field that contains transformed values. */ def transformInsideNestedStruct(df: DataFrame, inputColumnName: String, outputColumnName: String, expression: Column => Column): DataFrame = { def mapStruct(schema: StructType, path: Seq[String], parentColumn: Option[Column] = None): Seq[Column] = { val mappedFields = new ListBuffer[Column]() def handleMatchedLeaf(field: StructField, curColumn: Column): Seq[Column] = { val newColumn = expression(curColumn).as(outputColumnName) mappedFields += newColumn Seq(curColumn) } def handleMatchedNonLeaf(field: StructField, curColumn: Column): Seq[Column] = { // Non-leaf columns need to be further processed recursively field.dataType match { case dt: StructType => Seq(struct(mapStruct(dt, path.tail, Some(curColumn)): _*).as(field.name)) case _ => throw new IllegalArgumentException(s"Field '${field.name}' is not a struct type.") } } val fieldName = path.head val isLeaf = path.lengthCompare(2) < 0 val newColumns = schema.fields.flatMap(field => { // This is the original column (struct field) we want to process val curColumn = parentColumn match { case None => new Column(field.name) case Some(col) => col.getField(field.name).as(field.name) } if (field.name.compareToIgnoreCase(fieldName) != 0) { // Copy unrelated fields as they were Seq(curColumn) } else { // We have found a match if (isLeaf) { handleMatchedLeaf(field, curColumn) } else { handleMatchedNonLeaf(field, curColumn) } } }) newColumns ++ mappedFields } val schema = df.schema val path = inputColumnName.split('.') df.select(mapStruct(schema, path): _*) } /** * This Spark Job demonstrates an issue of execution plan freezing when there are a lot of projections * involving nested structs in an execution
[jira] [Created] (SPARK-28090) Spark hangs when an execution plan has many projections on nested structs
Ruslan Yushchenko created SPARK-28090: - Summary: Spark hangs when an execution plan has many projections on nested structs Key: SPARK-28090 URL: https://issues.apache.org/jira/browse/SPARK-28090 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 2.4.3 Environment: Tried in * Spark 2.2.1, Spark 2.4.3 in local mode on Linux, MasOS and Windows * Spark 2.4.3 / Yarn on a Linux cluster Reporter: Ruslan Yushchenko This was already posted (#28016), but the provided example didn't always reproduce the error. Spark applications freeze on execution plan optimization stage (Catalyst) when a logical execution plan contains a lot of projections that operate on nested struct fields. The code listed below demonstrates the issue. To reproduce the Spark App does the following: * A small dataframe is created from a JSON example. * Several nested transformations (negation of a number) are applied on struct fields and each time a new struct field is created. * Once more than 9 such transformations are applied the Catalyst optimizer freezes on optimizing the execution plan. * You can control the freezing by choosing different upper bound for the Range. E.g. it will work file if the upper bound is 5, but will hang is the bound is 10. {code} package com.example import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types.{StructField, StructType} import scala.collection.mutable.ListBuffer object SparkApp1IssueSelfContained { // A sample data for a dataframe with nested structs val sample: List[String] = """ { "numerics": {"num1": 101, "num2": 102, "num3": 103, "num4": 104, "num5": 105, "num6": 106, "num7": 107, "num8": 108, "num9": 109, "num10": 110, "num11": 111, "num12": 112, "num13": 113, "num14": 114, "num15": 115} } """ :: """ { "numerics": {"num1": 201, "num2": 202, "num3": 203, "num4": 204, "num5": 205, "num6": 206, "num7": 207, "num8": 208, "num9": 209, "num10": 210, "num11": 211, "num12": 212, "num13": 213, "num14": 214, "num15": 215} } """ :: """ { "numerics": {"num1": 301, "num2": 302, "num3": 303, "num4": 304, "num5": 305, "num6": 306, "num7": 307, "num8": 308, "num9": 309, "num10": 310, "num11": 311, "num12": 312, "num13": 313, "num14": 314, "num15": 315} } """ :: Nil /** * Transforms a column inside a nested struct. The transformed value will be put into a new field of that nested struct * * The output column name can omit the full path as the field will be created at the same level of nesting as the input column. * * @param inputColumnName A column name for which to apply the transformation, e.g. `company.employee.firstName`. * @param outputColumnName The output column name. The path is optional, e.g. you can use `transformedName` instead of `company.employee.transformedName`. * @param expression A function that applies a transformation to a column as a Spark expression. * @return A dataframe with a new field that contains transformed values. */ def transformInsideNestedStruct(df: DataFrame, inputColumnName: String, outputColumnName: String, expression: Column => Column): DataFrame = { def mapStruct(schema: StructType, path: Seq[String], parentColumn: Option[Column] = None): Seq[Column] = { val mappedFields = new ListBuffer[Column]() def handleMatchedLeaf(field: StructField, curColumn: Column): Seq[Column] = { val newColumn = expression(curColumn).as(outputColumnName) mappedFields += newColumn Seq(curColumn) } def handleMatchedNonLeaf(field: StructField, curColumn: Column): Seq[Column] = { // Non-leaf columns need to be further processed recursively field.dataType match { case dt: StructType => Seq(struct(mapStruct(dt, path.tail, Some(curColumn)): _*).as(field.name)) case _ => throw new IllegalArgumentException(s"Field '${field.name}' is not a struct type.") } } val fieldName = path.head val isLeaf = path.lengthCompare(2) < 0 val newColumns = schema.fields.flatMap(field => { // This is the original column (struct field) we want to process val curColumn = parentColumn match { case None => new Column(field.name) case Some(col) => col.getField(field.name).as(field.name) } if (field.name.compareToIgnoreCase(fieldName) != 0) { // Copy unrelated fields as they were Seq(curColumn) } else { // We have found a match if (isLeaf) { handleMatchedLeaf(field, curColumn) } else { handleMatchedNonLeaf(field, curColumn) } } })
[jira] [Updated] (SPARK-28088) String Functions: Enhance LPAD/RPAD function
[ https://issues.apache.org/jira/browse/SPARK-28088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28088: Description: Enhance LPAD/RPAD function to make {{pad}} parameter optional. > String Functions: Enhance LPAD/RPAD function > > > Key: SPARK-28088 > URL: https://issues.apache.org/jira/browse/SPARK-28088 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Enhance LPAD/RPAD function to make {{pad}} parameter optional. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org