[jira] [Updated] (SPARK-28103) Cannot infer filters from union table with empty local relation table properly

2019-06-18 Thread William Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Wong updated SPARK-28103:
-
Description: 
Basically, the constraints of a union table could be turned empty if any 
subtable is turned into an empty local relation. The side effect is filter 
cannot be inferred correctly (by InferFiltersFromConstrains) 

 
 We may reproduce the issue with the following setup:

1) Prepare two tables: 

 
{code:java}
spark.sql("CREATE TABLE IF NOT EXISTS table1(id string, val string) USING 
PARQUET");
spark.sql("CREATE TABLE IF NOT EXISTS table2(id string, val string) USING 
PARQUET");{code}
 

2) Create a union view on table1. 
{code:java}
spark.sql("""
     | CREATE VIEW partitioned_table_1 AS
     | SELECT * FROM table1 WHERE id = 'a'
     | UNION ALL
     | SELECT * FROM table1 WHERE id = 'b'
     | UNION ALL
     | SELECT * FROM table1 WHERE id = 'c'
     | UNION ALL
     | SELECT * FROM table1 WHERE id NOT IN ('a','b','c')
     | """.stripMargin){code}
 

 3) View the optimized plan of this SQL. The filter '[t2.id = 'a']' cannot be 
inferred. We can see that the constraints of the left table are empty. 
{code:java}
scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id 
[t1.id] = t2.id [t2.id] AND t1.id [t1.id] = 'a'").queryExecution.optimizedPlan

res39: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Join Inner, (id#0 = id#4)
:- Union
:  :- Filter (isnotnull(id#0) && (id#0 = a))
:  :  +- Relation[id#0,val#1] parquet
:  :- LocalRelation , [id#0, val#1]
:  :- LocalRelation , [id#0, val#1]
:  +- Filter ((isnotnull(id#0) && NOT id#0 IN (a,b,c)) && (id#0 = a))
:     +- Relation[id#0,val#1] parquet
+- Filter isnotnull(id#4)
   +- Relation[id#4,val#5] parquet

scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id 
[t1.id] = t2.id [t2.id] AND t1.id [t1.id] = 
'a'").queryExecution.optimizedPlan.children(0).constraints
res40: org.apache.spark.sql.catalyst.expressions.ExpressionSet = Set()
 
{code}
 

4) Modified the query to avoid empty local relation. The filter '[td.id in 
('a','b','c','d')' is then inferred properly. The constraints of the left table 
are not empty as well. 
{code:java}
scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id 
[t1.id] = t2.id [t2.id] AND t1.id [t1.id] IN 
('a','b','c','d')").queryExecution.optimizedPlan
res42: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Join Inner, (id#0 = id#4)
:- Union
:  :- Filter ((isnotnull(id#0) && (id#0 = a)) && id#0 IN (a,b,c,d))
:  :  +- Relation[id#0,val#1] parquet
:  :- Filter ((isnotnull(id#0) && (id#0 = b)) && id#0 IN (a,b,c,d))
:  :  +- Relation[id#0,val#1] parquet
:  :- Filter ((isnotnull(id#0) && (id#0 = c)) && id#0 IN (a,b,c,d))
:  :  +- Relation[id#0,val#1] parquet
:  +- Filter ((NOT id#0 IN (a,b,c) && id#0 IN (a,b,c,d)) && isnotnull(id#0))
:     +- Relation[id#0,val#1] parquet
+- Filter ((id#4 IN (a,b,c,d) && ((isnotnull(id#4) && (((id#4 = a) || (id#4 = 
b)) || (id#4 = c))) || NOT id#4 IN (a,b,c))) && isnotnull(id#4))
   +- Relation[id#4,val#5] parquet
 
scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id 
[t1.id] = t2.id [t2.id] AND t1.id [t1.id] IN 
('a','b','c','d')").queryExecution.optimizedPlan.children(0).constraints
res44: org.apache.spark.sql.catalyst.expressions.ExpressionSet = 
Set(isnotnull(id#0), id#0 IN (a,b,c,d), id#0 = a) || (id#0 = b)) || (id#0 = 
c)) || NOT id#0 IN (a,b,c)))
{code}
 

One of the possible workaround is create a rule to remove all empty local 
relation from a union table. Or, when we convert a relation to into an empty 
local relation, we should preserve those constraints in the empty local 
relation as well. 

 

A side node. Expression in optimized plan is not well optimized. For example, 
the expression 
{code:java}
((id#4 IN (a,b,c,d) && ((isnotnull(id#4) && (((id#4 = a) || (id#4 = b)) || 
(id#4 = c))) || NOT id#4 IN (a,b,c))) && isnotnull(id#4)){code}
could be further optimized into 
{code:java}
(isnotnull(id#4) && (id = d)){code}
We may implement following rules to simplify the expression. 

1) convert all 'equal' operators into 'in' operator, and then group all 'in' 
and 'not in' expressions by 'attribute reference' 

    i) eq(a,val) => in(a,val::Nil)

2) merge all those 'in' and 'not in' operators, like

    i)  or(in(a,list1),in(a,list2)) => in(a, list1 ++ list2)

    ii) or(in(a,list1), not(in(a,list2)) => not(in(a, list2 -- list1)) 

   iii) and(in(a,list1),in(a,list2)) => in(a, list1 intersect list2) 

   vi) and(in(a,list1),not(in(a,list2))) => in(a, list1 – list2) 

3) revert in operator into 'equal' if there is only one element in the list. 

  i) in(a,list) if list.size == 1 => eq(a,list.head) 

 

 

  was:
Basically, the constraints of a union table could be turned empty if any 
subtable is turned into an empty local relation. The side effect is filter 
cannot be 

[jira] [Resolved] (SPARK-27890) Improve SQL parser error message when missing backquotes for identifiers with hyphens

2019-06-18 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-27890.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> Improve SQL parser error message when missing backquotes for identifiers with 
> hyphens
> -
>
> Key: SPARK-27890
> URL: https://issues.apache.org/jira/browse/SPARK-27890
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Yesheng Ma
>Assignee: Yesheng Ma
>Priority: Major
> Fix For: 3.0.0
>
>
> Current SQL parser's error message for hyphen-connected identifiers without 
> surrounding backquotes(e.g. {{hyphen-table}}) is confusing for end users. A 
> possible approach to tackle this is to explicitly capture these wrong usages 
> in the SQL parser. In this way, the end users can fix these errors more 
> quickly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27890) Improve SQL parser error message when missing backquotes for identifiers with hyphens

2019-06-18 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-27890:
---

Assignee: Yesheng Ma

> Improve SQL parser error message when missing backquotes for identifiers with 
> hyphens
> -
>
> Key: SPARK-27890
> URL: https://issues.apache.org/jira/browse/SPARK-27890
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Yesheng Ma
>Assignee: Yesheng Ma
>Priority: Major
>
> Current SQL parser's error message for hyphen-connected identifiers without 
> surrounding backquotes(e.g. {{hyphen-table}}) is confusing for end users. A 
> possible approach to tackle this is to explicitly capture these wrong usages 
> in the SQL parser. In this way, the end users can fix these errors more 
> quickly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28103) Cannot infer filters from union table with empty local relation table properly

2019-06-18 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28103:

Target Version/s: 3.0.0

> Cannot infer filters from union table with empty local relation table properly
> --
>
> Key: SPARK-28103
> URL: https://issues.apache.org/jira/browse/SPARK-28103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.1
>Reporter: William Wong
>Priority: Major
>
> Basically, the constraints of a union table could be turned empty if any 
> subtable is turned into an empty local relation. The side effect is filter 
> cannot be inferred correctly (by InferFiltersFromConstrains) 
>  
>  We may reproduce the issue with the following setup:
> 1) Prepare two tables: 
>  
> {code:java}
> spark.sql("CREATE TABLE IF NOT EXISTS table1(id string, val string) USING 
> PARQUET");
> spark.sql("CREATE TABLE IF NOT EXISTS table2(id string, val string) USING 
> PARQUET");{code}
>  
> 2) Create a union view on table1. 
> {code:java}
> spark.sql("""
>      | CREATE VIEW partitioned_table_1 AS
>      | SELECT * FROM table1 WHERE id = 'a'
>      | UNION ALL
>      | SELECT * FROM table1 WHERE id = 'b'
>      | UNION ALL
>      | SELECT * FROM table1 WHERE id = 'c'
>      | UNION ALL
>      | SELECT * FROM table1 WHERE id NOT IN ('a','b','c')
>      | """.stripMargin){code}
>  
>  3) View the optimized plan of this SQL. The filter '[t2.id = 'a']' cannot be 
> inferred. We can see that the constraints of the left table are empty. 
> {code:java}
> scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id 
> [t1.id] = t2.id [t2.id] AND t1.id [t1.id] = 'a'").queryExecution.optimizedPlan
> res39: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Join Inner, (id#0 = id#4)
> :- Union
> :  :- Filter (isnotnull(id#0) && (id#0 = a))
> :  :  +- Relation[id#0,val#1] parquet
> :  :- LocalRelation , [id#0, val#1]
> :  :- LocalRelation , [id#0, val#1]
> :  +- Filter ((isnotnull(id#0) && NOT id#0 IN (a,b,c)) && (id#0 = a))
> :     +- Relation[id#0,val#1] parquet
> +- Filter isnotnull(id#4)
>    +- Relation[id#4,val#5] parquet
> scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id 
> [t1.id] = t2.id [t2.id] AND t1.id [t1.id] = 
> 'a'").queryExecution.optimizedPlan.children(0).constraints
> res40: org.apache.spark.sql.catalyst.expressions.ExpressionSet = Set()
>  
> {code}
>  
> 4) Modified the query to avoid empty local relation. The filter '[td.id in 
> ('a','b','c','d')' is then inferred properly. The constraints of the left 
> table are not empty as well. 
> {code:java}
> scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id 
> [t1.id] = t2.id [t2.id] AND t1.id [t1.id] IN 
> ('a','b','c','d')").queryExecution.optimizedPlan
> res42: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Join Inner, (id#0 = id#4)
> :- Union
> :  :- Filter ((isnotnull(id#0) && (id#0 = a)) && id#0 IN (a,b,c,d))
> :  :  +- Relation[id#0,val#1] parquet
> :  :- Filter ((isnotnull(id#0) && (id#0 = b)) && id#0 IN (a,b,c,d))
> :  :  +- Relation[id#0,val#1] parquet
> :  :- Filter ((isnotnull(id#0) && (id#0 = c)) && id#0 IN (a,b,c,d))
> :  :  +- Relation[id#0,val#1] parquet
> :  +- Filter ((NOT id#0 IN (a,b,c) && id#0 IN (a,b,c,d)) && isnotnull(id#0))
> :     +- Relation[id#0,val#1] parquet
> +- Filter ((id#4 IN (a,b,c,d) && ((isnotnull(id#4) && (((id#4 = a) || (id#4 = 
> b)) || (id#4 = c))) || NOT id#4 IN (a,b,c))) && isnotnull(id#4))
>    +- Relation[id#4,val#5] parquet
>  
> scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id 
> [t1.id] = t2.id [t2.id] AND t1.id [t1.id] IN 
> ('a','b','c','d')").queryExecution.optimizedPlan.children(0).constraints
> res44: org.apache.spark.sql.catalyst.expressions.ExpressionSet = 
> Set(isnotnull(id#0), id#0 IN (a,b,c,d), id#0 = a) || (id#0 = b)) || (id#0 
> = c)) || NOT id#0 IN (a,b,c)))
> {code}
>  
> One of the possible workaround is create a rule to remove all empty local 
> relation from a union table. Or, when we convert a relation to into an empty 
> local relation, we should preserve those constraints in the empty local 
> relation as well. 
>  
> A side node. Expression in optimized plan is not well optimized. For example, 
> the expression 
> {code:java}
> ((id#4 IN (a,b,c,d) && ((isnotnull(id#4) && (((id#4 = a) || (id#4 = b)) || 
> (id#4 = c))) || NOT id#4 IN (a,b,c))) && isnotnull(id#4)){code}
> could be further optimized into 
> {code:java}
> (isnotnull(id#4) && (id = d)){code}
> We may implement another rule to 
> 1) convert all 'equal' operators into 'in' operator, and then group all 
> expressions by 'attribute reference' 
> 3) merge all those 'in' (or not in) operators 
> 4) revert in operator into 

[jira] [Resolved] (SPARK-28096) Lazy val performance pitfall in Spark SQL LogicalPlans

2019-06-18 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-28096.
-
   Resolution: Fixed
 Assignee: Yesheng Ma
Fix Version/s: 3.0.0

> Lazy val performance pitfall in Spark SQL LogicalPlans
> --
>
> Key: SPARK-28096
> URL: https://issues.apache.org/jira/browse/SPARK-28096
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Assignee: Yesheng Ma
>Priority: Major
> Fix For: 3.0.0
>
>
> The original {{references}} and {{validConstraints}} implementations in a few 
> QueryPlan and Expression classes are methods, which means unnecessary 
> re-computation can happen at times. This PR resolves this problem by making 
> these method lazy vals.
> We benchmarked this optimization on TPC-DS queries whose planning time is 
> longer than 1s. In the benchmark, we warmed up the queries 5 iterations and 
> then took the average of 10 runs. Results showed that this micro-optimization 
> can improve the end-to-end planning time by 25%.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28103) Cannot infer filters from union table with empty local relation table properly

2019-06-18 Thread William Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Wong updated SPARK-28103:
-
Description: 
Basically, the constraints of a union table could be turned empty if any 
subtable is turned into an empty local relation. The side effect is filter 
cannot be inferred correctly (by InferFiltersFromConstrains) 

 
 We may reproduce the issue with the following setup:

1) Prepare two tables: 

 
{code:java}
spark.sql("CREATE TABLE IF NOT EXISTS table1(id string, val string) USING 
PARQUET");
spark.sql("CREATE TABLE IF NOT EXISTS table2(id string, val string) USING 
PARQUET");{code}
 

2) Create a union view on table1. 
{code:java}
spark.sql("""
     | CREATE VIEW partitioned_table_1 AS
     | SELECT * FROM table1 WHERE id = 'a'
     | UNION ALL
     | SELECT * FROM table1 WHERE id = 'b'
     | UNION ALL
     | SELECT * FROM table1 WHERE id = 'c'
     | UNION ALL
     | SELECT * FROM table1 WHERE id NOT IN ('a','b','c')
     | """.stripMargin){code}
 

 3) View the optimized plan of this SQL. The filter '[t2.id = 'a']' cannot be 
inferred. We can see that the constraints of the left table are empty. 
{code:java}
scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id 
[t1.id] = t2.id [t2.id] AND t1.id [t1.id] = 'a'").queryExecution.optimizedPlan

res39: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Join Inner, (id#0 = id#4)
:- Union
:  :- Filter (isnotnull(id#0) && (id#0 = a))
:  :  +- Relation[id#0,val#1] parquet
:  :- LocalRelation , [id#0, val#1]
:  :- LocalRelation , [id#0, val#1]
:  +- Filter ((isnotnull(id#0) && NOT id#0 IN (a,b,c)) && (id#0 = a))
:     +- Relation[id#0,val#1] parquet
+- Filter isnotnull(id#4)
   +- Relation[id#4,val#5] parquet

scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id 
[t1.id] = t2.id [t2.id] AND t1.id [t1.id] = 
'a'").queryExecution.optimizedPlan.children(0).constraints
res40: org.apache.spark.sql.catalyst.expressions.ExpressionSet = Set()
 
{code}
 

4) Modified the query to avoid empty local relation. The filter '[td.id in 
('a','b','c','d')' is then inferred properly. The constraints of the left table 
are not empty as well. 
{code:java}
scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id 
[t1.id] = t2.id [t2.id] AND t1.id [t1.id] IN 
('a','b','c','d')").queryExecution.optimizedPlan
res42: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Join Inner, (id#0 = id#4)
:- Union
:  :- Filter ((isnotnull(id#0) && (id#0 = a)) && id#0 IN (a,b,c,d))
:  :  +- Relation[id#0,val#1] parquet
:  :- Filter ((isnotnull(id#0) && (id#0 = b)) && id#0 IN (a,b,c,d))
:  :  +- Relation[id#0,val#1] parquet
:  :- Filter ((isnotnull(id#0) && (id#0 = c)) && id#0 IN (a,b,c,d))
:  :  +- Relation[id#0,val#1] parquet
:  +- Filter ((NOT id#0 IN (a,b,c) && id#0 IN (a,b,c,d)) && isnotnull(id#0))
:     +- Relation[id#0,val#1] parquet
+- Filter ((id#4 IN (a,b,c,d) && ((isnotnull(id#4) && (((id#4 = a) || (id#4 = 
b)) || (id#4 = c))) || NOT id#4 IN (a,b,c))) && isnotnull(id#4))
   +- Relation[id#4,val#5] parquet
 
scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id 
[t1.id] = t2.id [t2.id] AND t1.id [t1.id] IN 
('a','b','c','d')").queryExecution.optimizedPlan.children(0).constraints
res44: org.apache.spark.sql.catalyst.expressions.ExpressionSet = 
Set(isnotnull(id#0), id#0 IN (a,b,c,d), id#0 = a) || (id#0 = b)) || (id#0 = 
c)) || NOT id#0 IN (a,b,c)))
{code}
 

One of the possible workaround is create a rule to remove all empty local 
relation from a union table. Or, when we convert a relation to into an empty 
local relation, we should preserve those constraints in the empty local 
relation as well. 

 

A side node. Expression in optimized plan is not well optimized. For example, 
the expression 
{code:java}
((id#4 IN (a,b,c,d) && ((isnotnull(id#4) && (((id#4 = a) || (id#4 = b)) || 
(id#4 = c))) || NOT id#4 IN (a,b,c))) && isnotnull(id#4)){code}
could be further optimized into 
{code:java}
(isnotnull(id#4) && (id = d)){code}
We may implement another rule to 

1) convert all 'equal' operators into 'in' operator, and then group all 
expressions by 'attribute reference' 

3) merge all those 'in' (or not in) operators 

4) revert in operator into 'equal' if there is only one element in the set. 

 

  was:
Basically, the constraints of a union table could be turned empty if any 
subtable is turned into an empty local relation. The side effect is filter 
cannot be inferred correctly (by InferFiltersFromConstrains) 

 
We may reproduce the issue with the following setup:

1) Prepare two tables: 

 
{code:java}
spark.sql("CREATE TABLE IF NOT EXISTS table1(id string, val string) USING 
PARQUET");
spark.sql("CREATE TABLE IF NOT EXISTS table2(id string, val string) USING 
PARQUET");{code}
 

2) Create a union view on table1. 
{code:java}
spark.sql("""
     | CREATE VIEW 

[jira] [Created] (SPARK-28103) Cannot infer filters from union table with empty local relation table properly

2019-06-18 Thread William Wong (JIRA)
William Wong created SPARK-28103:


 Summary: Cannot infer filters from union table with empty local 
relation table properly
 Key: SPARK-28103
 URL: https://issues.apache.org/jira/browse/SPARK-28103
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.1, 2.3.2
Reporter: William Wong


Basically, the constraints of a union table could be turned empty if any 
subtable is turned into an empty local relation. The side effect is filter 
cannot be inferred correctly (by InferFiltersFromConstrains) 

 
We may reproduce the issue with the following setup:

1) Prepare two tables: 

 
{code:java}
spark.sql("CREATE TABLE IF NOT EXISTS table1(id string, val string) USING 
PARQUET");
spark.sql("CREATE TABLE IF NOT EXISTS table2(id string, val string) USING 
PARQUET");{code}
 

2) Create a union view on table1. 
{code:java}
spark.sql("""
     | CREATE VIEW partitioned_table_1 AS
     | SELECT * FROM table1 WHERE id = 'a'
     | UNION ALL
     | SELECT * FROM table1 WHERE id = 'b'
     | UNION ALL
     | SELECT * FROM table1 WHERE id = 'c'
     | UNION ALL
     | SELECT * FROM table1 WHERE id NOT IN ('a','b','c')
     | """.stripMargin){code}
 

 3) View the optimized plan of this SQL. The filter '[t2.id 
[t2.id]|https://urldefense.proofpoint.com/v2/url?u=http-3A__t2.id=DwMFaQ=lxzXOFU02467FL7HOPRqCw=QLWkn-MIQZ6wM0VKRZSxipwIbmB7fKk9_zd1_axi-XQ=ezb9buJE3VsOytBu2oydJfvIfdTmVHPIGwaagdYSG98=L-aQUAtCG1PufnRe0Hy0adnmxqny1GitX8OJV9zq2oI=]
 = 'a'' cannot be inferred. We can see that the constraints of the left table 
are empty.

 
{code:java}
 
scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id 
[t1.id] = t2.id [t2.id] AND t1.id [t1.id] = 'a'").queryExecution.optimizedPlan

res39: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Join Inner, (id#0 = id#4)
:- Union
:  :- Filter (isnotnull(id#0) && (id#0 = a))
:  :  +- Relation[id#0,val#1] parquet
:  :- LocalRelation , [id#0, val#1]
:  :- LocalRelation , [id#0, val#1]
:  +- Filter ((isnotnull(id#0) && NOT id#0 IN (a,b,c)) && (id#0 = a))
:     +- Relation[id#0,val#1] parquet
+- Filter isnotnull(id#4)
   +- Relation[id#4,val#5] parquet

scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id 
[t1.id] = t2.id [t2.id] AND t1.id [t1.id] = 
'a'").queryExecution.optimizedPlan.children(0).constraints
res40: org.apache.spark.sql.catalyst.expressions.ExpressionSet = Set()
 
{code}
 

4) Modified the query to avoid empty local relation. The filter '[t2.id 
[t2.id]|https://urldefense.proofpoint.com/v2/url?u=http-3A__t2.id=DwMFaQ=lxzXOFU02467FL7HOPRqCw=QLWkn-MIQZ6wM0VKRZSxipwIbmB7fKk9_zd1_axi-XQ=ezb9buJE3VsOytBu2oydJfvIfdTmVHPIGwaagdYSG98=L-aQUAtCG1PufnRe0Hy0adnmxqny1GitX8OJV9zq2oI=]
 in ('a','b','c','d')' is then inferred properly. The constraints of the left 
table are not empty as well.

 

 
{code:java}
scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id 
[t1.id] = t2.id [t2.id] AND t1.id [t1.id] IN 
('a','b','c','d')").queryExecution.optimizedPlan
res42: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Join Inner, (id#0 = id#4)
:- Union
:  :- Filter ((isnotnull(id#0) && (id#0 = a)) && id#0 IN (a,b,c,d))
:  :  +- Relation[id#0,val#1] parquet
:  :- Filter ((isnotnull(id#0) && (id#0 = b)) && id#0 IN (a,b,c,d))
:  :  +- Relation[id#0,val#1] parquet
:  :- Filter ((isnotnull(id#0) && (id#0 = c)) && id#0 IN (a,b,c,d))
:  :  +- Relation[id#0,val#1] parquet
:  +- Filter ((NOT id#0 IN (a,b,c) && id#0 IN (a,b,c,d)) && isnotnull(id#0))
:     +- Relation[id#0,val#1] parquet
+- Filter ((id#4 IN (a,b,c,d) && ((isnotnull(id#4) && (((id#4 = a) || (id#4 = 
b)) || (id#4 = c))) || NOT id#4 IN (a,b,c))) && isnotnull(id#4))
   +- Relation[id#4,val#5] parquet
 
scala> spark.sql("SELECT * FROM partitioned_table_1 t1, table2 t2 WHERE t1.id 
[t1.id] = t2.id [t2.id] AND t1.id [t1.id] IN 
('a','b','c','d')").queryExecution.optimizedPlan.children(0).constraints
res44: org.apache.spark.sql.catalyst.expressions.ExpressionSet = 
Set(isnotnull(id#0), id#0 IN (a,b,c,d), id#0 = a) || (id#0 = b)) || (id#0 = 
c)) || NOT id#0 IN (a,b,c)))
{code}
 

 

One of the possible workaround is create a rule to remove all empty local 
relation from a union table. Or, when we convert a relation to into an empty 
local relation, we should preserve those constraints in the empty local 
relation as well. 

 

A side node. Expression in optimized plan is not well optimized. For example, 
the expression 
{code:java}
((id#4 IN (a,b,c,d) && ((isnotnull(id#4) && (((id#4 = a) || (id#4 = b)) || 
(id#4 = c))) || NOT id#4 IN (a,b,c))) && isnotnull(id#4)){code}
could be further optimized into 
{code:java}
(isnotnull(id#4) && (id = d)){code}
We may implement another rule to 

1) convert all 'equal' operators into 'in' operator, and then group all 
expressions by 'attribute reference' 

3) merge all those 'in' (or 

[jira] [Resolved] (SPARK-27105) Prevent exponential complexity in ORC `createFilter`

2019-06-18 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27105.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24068
[https://github.com/apache/spark/pull/24068]

> Prevent exponential complexity in ORC `createFilter`  
> --
>
> Key: SPARK-27105
> URL: https://issues.apache.org/jira/browse/SPARK-27105
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ivan Vergiliev
>Assignee: Ivan Vergiliev
>Priority: Major
>  Labels: performance
> Fix For: 3.0.0
>
>
> `OrcFilters.createFilters` currently has complexity that's exponential in the 
> height of the filter tree. There are multiple places in Spark that try to 
> prevent the generation of skewed trees so as to not trigger this behaviour, 
> for example:
> - `org.apache.spark.sql.catalyst.parser.AstBuilder.visitLogicalBinary` 
> combines a number of binary logical expressions into a balanced tree.
> - https://github.com/apache/spark/pull/22313 introduced a change to 
> `OrcFilters` to create a balanced tree instead of a skewed tree.
> However, the underlying exponential behaviour can still be triggered by code 
> paths that don't go through any of the tree balancing methods. For example, 
> if one generates a tree of `Column`s directly in user code, there's nothing 
> in Spark that automatically balances that tree and, hence, skewed trees hit 
> the exponential behaviour. We have hit this in production with jobs 
> mysteriously taking hours on the Spark driver with no worker activity, with 
> as few as ~30 OR filters.
> I have a fix locally that makes the underlying logic have linear complexity 
> instead of exponential complexity. With this fix, the code can handle 
> thousands of filters in milliseconds. I'll send a PR with the fix soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27105) Prevent exponential complexity in ORC `createFilter`

2019-06-18 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27105:
---

Assignee: Ivan Vergiliev

> Prevent exponential complexity in ORC `createFilter`  
> --
>
> Key: SPARK-27105
> URL: https://issues.apache.org/jira/browse/SPARK-27105
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ivan Vergiliev
>Assignee: Ivan Vergiliev
>Priority: Major
>  Labels: performance
>
> `OrcFilters.createFilters` currently has complexity that's exponential in the 
> height of the filter tree. There are multiple places in Spark that try to 
> prevent the generation of skewed trees so as to not trigger this behaviour, 
> for example:
> - `org.apache.spark.sql.catalyst.parser.AstBuilder.visitLogicalBinary` 
> combines a number of binary logical expressions into a balanced tree.
> - https://github.com/apache/spark/pull/22313 introduced a change to 
> `OrcFilters` to create a balanced tree instead of a skewed tree.
> However, the underlying exponential behaviour can still be triggered by code 
> paths that don't go through any of the tree balancing methods. For example, 
> if one generates a tree of `Column`s directly in user code, there's nothing 
> in Spark that automatically balances that tree and, hence, skewed trees hit 
> the exponential behaviour. We have hit this in production with jobs 
> mysteriously taking hours on the Spark driver with no worker activity, with 
> as few as ~30 OR filters.
> I have a fix locally that makes the underlying logic have linear complexity 
> instead of exponential complexity. With this fix, the code can handle 
> thousands of filters in milliseconds. I'll send a PR with the fix soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28102) Add configuration for selecting LZ4 implementation (safe, unsafe, JNI)

2019-06-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28102:


Assignee: Josh Rosen  (was: Apache Spark)

> Add configuration for selecting LZ4 implementation (safe, unsafe, JNI)
> --
>
> Key: SPARK-28102
> URL: https://issues.apache.org/jira/browse/SPARK-28102
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
>
> Spark's use of {{lz4-java}} ends up calling {{Lz4Factory.fastestInstance}}, 
> which attempts to load JNI libraries and falls back on Java implementations 
> in case the JNI library cannot be loaded or initialized.
> I run Spark in a configuration where the JNI libraries don't work, so I'd 
> like to configure LZ4 to not even attempt to use JNI code: if the JNI library 
> loads but cannot be initialized then the fallback code path involves catching 
> an exception and this is slow because the exception is thrown under a static 
> initializer lock (leading to significant lock contention because the filling 
> of stacktraces is done while holding this lock). 
> I propose to introduce a {{spark.io.compression.lz4.factory}} configuration 
> for selecting the LZ4 implementation, allowing users to disable the use of 
> the JNI library without having to recompile Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28102) Add configuration for selecting LZ4 implementation (safe, unsafe, JNI)

2019-06-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28102:


Assignee: Apache Spark  (was: Josh Rosen)

> Add configuration for selecting LZ4 implementation (safe, unsafe, JNI)
> --
>
> Key: SPARK-28102
> URL: https://issues.apache.org/jira/browse/SPARK-28102
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Major
>
> Spark's use of {{lz4-java}} ends up calling {{Lz4Factory.fastestInstance}}, 
> which attempts to load JNI libraries and falls back on Java implementations 
> in case the JNI library cannot be loaded or initialized.
> I run Spark in a configuration where the JNI libraries don't work, so I'd 
> like to configure LZ4 to not even attempt to use JNI code: if the JNI library 
> loads but cannot be initialized then the fallback code path involves catching 
> an exception and this is slow because the exception is thrown under a static 
> initializer lock (leading to significant lock contention because the filling 
> of stacktraces is done while holding this lock). 
> I propose to introduce a {{spark.io.compression.lz4.factory}} configuration 
> for selecting the LZ4 implementation, allowing users to disable the use of 
> the JNI library without having to recompile Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28102) Add configuration for selecting LZ4 implementation (safe, unsafe, JNI)

2019-06-18 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-28102:
--

 Summary: Add configuration for selecting LZ4 implementation (safe, 
unsafe, JNI)
 Key: SPARK-28102
 URL: https://issues.apache.org/jira/browse/SPARK-28102
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Josh Rosen
Assignee: Josh Rosen


Spark's use of {{lz4-java}} ends up calling {{Lz4Factory.fastestInstance}}, 
which attempts to load JNI libraries and falls back on Java implementations in 
case the JNI library cannot be loaded or initialized.

I run Spark in a configuration where the JNI libraries don't work, so I'd like 
to configure LZ4 to not even attempt to use JNI code: if the JNI library loads 
but cannot be initialized then the fallback code path involves catching an 
exception and this is slow because the exception is thrown under a static 
initializer lock (leading to significant lock contention because the filling of 
stacktraces is done while holding this lock). 

I propose to introduce a {{spark.io.compression.lz4.factory}} configuration for 
selecting the LZ4 implementation, allowing users to disable the use of the JNI 
library without having to recompile Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28075) String Functions: Enhance TRIM function

2019-06-18 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-28075:
---

Assignee: Yuming Wang

> String Functions: Enhance TRIM function
> ---
>
> Key: SPARK-28075
> URL: https://issues.apache.org/jira/browse/SPARK-28075
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Add support {{TRIM(BOTH/LEADING/TRAILING FROM str)}} format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28075) String Functions: Enhance TRIM function

2019-06-18 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28075.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24891
[https://github.com/apache/spark/pull/24891]

> String Functions: Enhance TRIM function
> ---
>
> Key: SPARK-28075
> URL: https://issues.apache.org/jira/browse/SPARK-28075
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Add support {{TRIM(BOTH/LEADING/TRAILING FROM str)}} format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27716) Complete the transactions support for part of jdbc datasource operations.

2019-06-18 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27716.
---
Resolution: Won't Fix

> Complete the transactions support for part of jdbc datasource operations.
> -
>
> Key: SPARK-27716
> URL: https://issues.apache.org/jira/browse/SPARK-27716
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: feiwang
>Priority: Major
>  Labels: pull-request-available
>
> With the jdbc datasource, we can save a rdd to the database.
> The comments for the function saveTable is that.
> {code:java}
>   /**
>* Saves the RDD to the database in a single transaction.
>*/
>   def saveTable(
>   df: DataFrame,
>   tableSchema: Option[StructType],
>   isCaseSensitive: Boolean,
>   options: JdbcOptionsInWrite)
> {code}
> In fact, it is not true.
> The savePartition operation is in a single transaction but the saveTable 
> operation is not in a single transaction.
> There are several cases of data transmission:
> case1: Append data to origin existed gptable.
> case2: Overwrite origin gptable, but the table is a cascadingTruncateTable, 
> so we can not drop the gptable, we have to truncate it and append data.
> case3: Overwrite origin existed table and the table is not a 
> cascadingTruncateTable, so we can drop it first.
> case4: For an unexisted table, create and transmit data.
> In this PR, I add a transactions support for case3 and case4.
> For case3 and case4, we can transmit the rdd to a temp table at first.
> We use an accumulator to record the suceessful savePartition operations.
> At last, we compare the value of accumulator with dataFrame's partitionNum.
> If all the savePartition operations are successful, we drop the origin table 
> if it exists, then we alter the temp table rename to origin table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28085) Spark Scala API documentation URLs not working properly in Chrome

2019-06-18 Thread Andrew Leverentz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Leverentz updated SPARK-28085:
-
Description: 
In Chrome version 75, URLs in the Scala API documentation are not working 
properly, which makes them difficult to bookmark.

For example, URLs like the following get redirected to a generic "root" package 
page:

[https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html]

[https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset]

Here's the URL that I get redirected to:

[https://spark.apache.org/docs/latest/api/scala/index.html#package]

This issue seems to have appeared between versions 74 and 75 of Chrome, but the 
documentation URLs still work in Safari.  I suspect that this has something to 
do with security-related changes to how Chrome 75 handles frames and/or 
redirects.  I've reported this issue to the Chrome team via the in-browser help 
menu, but I don't have any visibility into their response, so it's not clear 
whether they'll consider this a bug or "working as intended".

  was:
In Chrome version 75, URLs in the Scala API documentation are not working 
properly, which makes them difficult to bookmark.

For example, URLs like the following get redirected to a generic "root" package 
page:

[https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html]

[https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset]

Here's the URL that I get :

[https://spark.apache.org/docs/latest/api/scala/index.html#package]

This issue seems to have appeared between versions 74 and 75 of Chrome, but the 
documentation URLs still work in Safari.  I suspect that this has something to 
do with security-related changes to how Chrome 75 handles frames and/or 
redirects.  I've reported this issue to the Chrome team via the in-browser help 
menu, but I don't have any visibility into their response, so it's not clear 
whether they'll consider this a bug or "working as intended".


> Spark Scala API documentation URLs not working properly in Chrome
> -
>
> Key: SPARK-28085
> URL: https://issues.apache.org/jira/browse/SPARK-28085
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.3
>Reporter: Andrew Leverentz
>Priority: Minor
>
> In Chrome version 75, URLs in the Scala API documentation are not working 
> properly, which makes them difficult to bookmark.
> For example, URLs like the following get redirected to a generic "root" 
> package page:
> [https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html]
> [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset]
> Here's the URL that I get redirected to:
> [https://spark.apache.org/docs/latest/api/scala/index.html#package]
> This issue seems to have appeared between versions 74 and 75 of Chrome, but 
> the documentation URLs still work in Safari.  I suspect that this has 
> something to do with security-related changes to how Chrome 75 handles frames 
> and/or redirects.  I've reported this issue to the Chrome team via the 
> in-browser help menu, but I don't have any visibility into their response, so 
> it's not clear whether they'll consider this a bug or "working as intended".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28081) word2vec 'large' count value too low for very large corpora

2019-06-18 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28081.
---
   Resolution: Fixed
Fix Version/s: 2.4.4
   2.3.4
   3.0.0

Issue resolved by pull request 24893
[https://github.com/apache/spark/pull/24893]

> word2vec 'large' count value too low for very large corpora
> ---
>
> Key: SPARK-28081
> URL: https://issues.apache.org/jira/browse/SPARK-28081
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 3.0.0, 2.3.4, 2.4.4
>
>
> The word2vec implementation operates on word counts, and uses a hard-coded 
> value of 1e9 to mean "a very large count, larger than any actual count". 
> However this causes the logic to fail if, in fact, a large corpora has some 
> words that really do occur more than this many times. We can probably improve 
> the implementation to better handle very large counts in general.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27823) Add an abstraction layer for accelerator resource handling to avoid manipulating raw confs

2019-06-18 Thread Xingbo Jiang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Jiang resolved SPARK-27823.
--
Resolution: Fixed

> Add an abstraction layer for accelerator resource handling to avoid 
> manipulating raw confs
> --
>
> Key: SPARK-27823
> URL: https://issues.apache.org/jira/browse/SPARK-27823
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> In SPARK-27488, we extract resource requests and allocation by parsing raw 
> Spark confs. This hurts readability because we didn't have the abstraction at 
> resource level. After we merge the core changes, we should do a refactoring 
> and make the code more readable.
> See https://github.com/apache/spark/pull/24615#issuecomment-494580663.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28101) Fix Flaky Test: `InputStreamsSuite.Modified files are correctly detected` in JDK9+

2019-06-18 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28101:
--
Summary: Fix Flaky Test: `InputStreamsSuite.Modified files are correctly 
detected` in JDK9+  (was: Fix Flaky Test: `InputStreamsSuit.Modified files are 
correctly detected` in JDK9+)

> Fix Flaky Test: `InputStreamsSuite.Modified files are correctly detected` in 
> JDK9+
> --
>
> Key: SPARK-28101
> URL: https://issues.apache.org/jira/browse/SPARK-28101
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
> Attachments: error.png
>
>
>  !error.png|width=100%! 
> {code}
> $ build/sbt "streaming/testOnly *.InputStreamsSuite"
> [info] - Modified files are correctly detected. *** FAILED *** (134 
> milliseconds)
> [info]   Set("renamed") did not equal Set() (InputStreamsSuite.scala:312)
> [info]   org.scalatest.exceptions.TestFailedException:
> {code}
> {code}
> Getting new files for time 1560896662000, ignoring files older than 
> 1560896659679
> file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt
>  not selected as mod time 1560896662679 > current time 1560896662000
> file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing
>  ignored as mod time 1560896657679 <= ignore time 1560896659679
> Finding new files took 0 ms
> New files at time 1560896662000 ms:
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28101) Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in JDK9+

2019-06-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28101:


Assignee: (was: Apache Spark)

> Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in 
> JDK9+
> -
>
> Key: SPARK-28101
> URL: https://issues.apache.org/jira/browse/SPARK-28101
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
> Attachments: error.png
>
>
>  !error.png|width=100%! 
> {code}
> $ build/sbt "streaming/testOnly *.InputStreamsSuite"
> [info] - Modified files are correctly detected. *** FAILED *** (134 
> milliseconds)
> [info]   Set("renamed") did not equal Set() (InputStreamsSuite.scala:312)
> [info]   org.scalatest.exceptions.TestFailedException:
> {code}
> {code}
> Getting new files for time 1560896662000, ignoring files older than 
> 1560896659679
> file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt
>  not selected as mod time 1560896662679 > current time 1560896662000
> file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing
>  ignored as mod time 1560896657679 <= ignore time 1560896659679
> Finding new files took 0 ms
> New files at time 1560896662000 ms:
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28101) Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in JDK9+

2019-06-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28101:


Assignee: Apache Spark

> Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in 
> JDK9+
> -
>
> Key: SPARK-28101
> URL: https://issues.apache.org/jira/browse/SPARK-28101
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
> Attachments: error.png
>
>
>  !error.png|width=100%! 
> {code}
> $ build/sbt "streaming/testOnly *.InputStreamsSuite"
> [info] - Modified files are correctly detected. *** FAILED *** (134 
> milliseconds)
> [info]   Set("renamed") did not equal Set() (InputStreamsSuite.scala:312)
> [info]   org.scalatest.exceptions.TestFailedException:
> {code}
> {code}
> Getting new files for time 1560896662000, ignoring files older than 
> 1560896659679
> file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt
>  not selected as mod time 1560896662679 > current time 1560896662000
> file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing
>  ignored as mod time 1560896657679 <= ignore time 1560896659679
> Finding new files took 0 ms
> New files at time 1560896662000 ms:
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27815) do not leak SaveMode to file source v2

2019-06-18 Thread Tony Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867136#comment-16867136
 ] 

Tony Zhang commented on SPARK-27815:


[~cloud_fan] Hi Wenchen, do you think it is viable solution mentioned below? 

Create a new V2WriteCommand case class and its Exec named maybe 
_OverwriteByQueryId_ to replace WriteToDataSourceV2, which accepts a QueryId so 
that tests can pass.

Or should we keep WriteToDataSourceV2?

> do not leak SaveMode to file source v2
> --
>
> Key: SPARK-27815
> URL: https://issues.apache.org/jira/browse/SPARK-27815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> Currently there is a hack in `DataFrameWriter`, which passes `SaveMode` to 
> file source v2. This should be removed and file source v2 should not accept 
> SaveMode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28101) Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in JDK9+

2019-06-18 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28101:
--
Description: 
 !error.png|height=250,width=250! 

{code}
$ build/sbt "streaming/testOnly *.InputStreamsSuite"
[info] - Modified files are correctly detected. *** FAILED *** (134 
milliseconds)
[info]   Set("renamed") did not equal Set() (InputStreamsSuite.scala:312)
[info]   org.scalatest.exceptions.TestFailedException:
{code}

{code}
Getting new files for time 1560896662000, ignoring files older than 
1560896659679
file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt
 not selected as mod time 1560896662679 > current time 1560896662000
file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing
 ignored as mod time 1560896657679 <= ignore time 1560896659679
Finding new files took 0 ms
New files at time 1560896662000 ms:
{code}

  was:
 !error.png! 

{code}
$ build/sbt "streaming/testOnly *.InputStreamsSuite"
[info] - Modified files are correctly detected. *** FAILED *** (134 
milliseconds)
[info]   Set("renamed") did not equal Set() (InputStreamsSuite.scala:312)
[info]   org.scalatest.exceptions.TestFailedException:
{code}

{code}
Getting new files for time 1560896662000, ignoring files older than 
1560896659679
file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt
 not selected as mod time 1560896662679 > current time 1560896662000
file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing
 ignored as mod time 1560896657679 <= ignore time 1560896659679
Finding new files took 0 ms
New files at time 1560896662000 ms:
{code}


> Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in 
> JDK9+
> -
>
> Key: SPARK-28101
> URL: https://issues.apache.org/jira/browse/SPARK-28101
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
> Attachments: error.png
>
>
>  !error.png|height=250,width=250! 
> {code}
> $ build/sbt "streaming/testOnly *.InputStreamsSuite"
> [info] - Modified files are correctly detected. *** FAILED *** (134 
> milliseconds)
> [info]   Set("renamed") did not equal Set() (InputStreamsSuite.scala:312)
> [info]   org.scalatest.exceptions.TestFailedException:
> {code}
> {code}
> Getting new files for time 1560896662000, ignoring files older than 
> 1560896659679
> file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt
>  not selected as mod time 1560896662679 > current time 1560896662000
> file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing
>  ignored as mod time 1560896657679 <= ignore time 1560896659679
> Finding new files took 0 ms
> New files at time 1560896662000 ms:
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28101) Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in JDK9+

2019-06-18 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28101:
--
Description: 
 !error.png|width=100%! 

{code}
$ build/sbt "streaming/testOnly *.InputStreamsSuite"
[info] - Modified files are correctly detected. *** FAILED *** (134 
milliseconds)
[info]   Set("renamed") did not equal Set() (InputStreamsSuite.scala:312)
[info]   org.scalatest.exceptions.TestFailedException:
{code}

{code}
Getting new files for time 1560896662000, ignoring files older than 
1560896659679
file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt
 not selected as mod time 1560896662679 > current time 1560896662000
file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing
 ignored as mod time 1560896657679 <= ignore time 1560896659679
Finding new files took 0 ms
New files at time 1560896662000 ms:
{code}

  was:
 !error.png|height=250,width=250! 

{code}
$ build/sbt "streaming/testOnly *.InputStreamsSuite"
[info] - Modified files are correctly detected. *** FAILED *** (134 
milliseconds)
[info]   Set("renamed") did not equal Set() (InputStreamsSuite.scala:312)
[info]   org.scalatest.exceptions.TestFailedException:
{code}

{code}
Getting new files for time 1560896662000, ignoring files older than 
1560896659679
file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt
 not selected as mod time 1560896662679 > current time 1560896662000
file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing
 ignored as mod time 1560896657679 <= ignore time 1560896659679
Finding new files took 0 ms
New files at time 1560896662000 ms:
{code}


> Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in 
> JDK9+
> -
>
> Key: SPARK-28101
> URL: https://issues.apache.org/jira/browse/SPARK-28101
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
> Attachments: error.png
>
>
>  !error.png|width=100%! 
> {code}
> $ build/sbt "streaming/testOnly *.InputStreamsSuite"
> [info] - Modified files are correctly detected. *** FAILED *** (134 
> milliseconds)
> [info]   Set("renamed") did not equal Set() (InputStreamsSuite.scala:312)
> [info]   org.scalatest.exceptions.TestFailedException:
> {code}
> {code}
> Getting new files for time 1560896662000, ignoring files older than 
> 1560896659679
> file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt
>  not selected as mod time 1560896662679 > current time 1560896662000
> file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing
>  ignored as mod time 1560896657679 <= ignore time 1560896659679
> Finding new files took 0 ms
> New files at time 1560896662000 ms:
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28101) Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in JDK9+

2019-06-18 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28101:
--
Attachment: error.png

> Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in 
> JDK9+
> -
>
> Key: SPARK-28101
> URL: https://issues.apache.org/jira/browse/SPARK-28101
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
> Attachments: error.png
>
>
> {code}
> $ build/sbt "streaming/testOnly *.InputStreamsSuite"
> [info] - Modified files are correctly detected. *** FAILED *** (134 
> milliseconds)
> [info]   Set("renamed") did not equal Set() (InputStreamsSuite.scala:312)
> [info]   org.scalatest.exceptions.TestFailedException:
> {code}
> {code}
> Getting new files for time 1560896662000, ignoring files older than 
> 1560896659679
> file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt
>  not selected as mod time 1560896662679 > current time 1560896662000
> file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing
>  ignored as mod time 1560896657679 <= ignore time 1560896659679
> Finding new files took 0 ms
> New files at time 1560896662000 ms:
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28101) Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in JDK9+

2019-06-18 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28101:
--
Description: 
 !error.png! 

{code}
$ build/sbt "streaming/testOnly *.InputStreamsSuite"
[info] - Modified files are correctly detected. *** FAILED *** (134 
milliseconds)
[info]   Set("renamed") did not equal Set() (InputStreamsSuite.scala:312)
[info]   org.scalatest.exceptions.TestFailedException:
{code}

{code}
Getting new files for time 1560896662000, ignoring files older than 
1560896659679
file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt
 not selected as mod time 1560896662679 > current time 1560896662000
file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing
 ignored as mod time 1560896657679 <= ignore time 1560896659679
Finding new files took 0 ms
New files at time 1560896662000 ms:
{code}

  was:
{code}
$ build/sbt "streaming/testOnly *.InputStreamsSuite"
[info] - Modified files are correctly detected. *** FAILED *** (134 
milliseconds)
[info]   Set("renamed") did not equal Set() (InputStreamsSuite.scala:312)
[info]   org.scalatest.exceptions.TestFailedException:
{code}

{code}
Getting new files for time 1560896662000, ignoring files older than 
1560896659679
file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt
 not selected as mod time 1560896662679 > current time 1560896662000
file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing
 ignored as mod time 1560896657679 <= ignore time 1560896659679
Finding new files took 0 ms
New files at time 1560896662000 ms:
{code}


> Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in 
> JDK9+
> -
>
> Key: SPARK-28101
> URL: https://issues.apache.org/jira/browse/SPARK-28101
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
> Attachments: error.png
>
>
>  !error.png! 
> {code}
> $ build/sbt "streaming/testOnly *.InputStreamsSuite"
> [info] - Modified files are correctly detected. *** FAILED *** (134 
> milliseconds)
> [info]   Set("renamed") did not equal Set() (InputStreamsSuite.scala:312)
> [info]   org.scalatest.exceptions.TestFailedException:
> {code}
> {code}
> Getting new files for time 1560896662000, ignoring files older than 
> 1560896659679
> file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt
>  not selected as mod time 1560896662679 > current time 1560896662000
> file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing
>  ignored as mod time 1560896657679 <= ignore time 1560896659679
> Finding new files took 0 ms
> New files at time 1560896662000 ms:
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28101) Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in JDK9+

2019-06-18 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28101:
--
Summary: Fix Flaky Test: `InputStreamsSuit.Modified files are correctly 
detected` in JDK9+  (was: Fix Flaky Test: `InputStreamsSuit.Modified files are 
correctly detected`)

> Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in 
> JDK9+
> -
>
> Key: SPARK-28101
> URL: https://issues.apache.org/jira/browse/SPARK-28101
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> {code}
> $ build/sbt "streaming/testOnly *.InputStreamsSuite"
> [info] - Modified files are correctly detected. *** FAILED *** (134 
> milliseconds)
> [info]   Set("renamed") did not equal Set() (InputStreamsSuite.scala:312)
> [info]   org.scalatest.exceptions.TestFailedException:
> {code}
> {code}
> Getting new files for time 1560896662000, ignoring files older than 
> 1560896659679
> file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt
>  not selected as mod time 1560896662679 > current time 1560896662000
> file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing
>  ignored as mod time 1560896657679 <= ignore time 1560896659679
> Finding new files took 0 ms
> New files at time 1560896662000 ms:
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28101) Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in JDK9+

2019-06-18 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28101:
--
Attachment: (was: Screen Shot 2019-06-18 at 4.41.11 PM.png)

> Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in 
> JDK9+
> -
>
> Key: SPARK-28101
> URL: https://issues.apache.org/jira/browse/SPARK-28101
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> {code}
> $ build/sbt "streaming/testOnly *.InputStreamsSuite"
> [info] - Modified files are correctly detected. *** FAILED *** (134 
> milliseconds)
> [info]   Set("renamed") did not equal Set() (InputStreamsSuite.scala:312)
> [info]   org.scalatest.exceptions.TestFailedException:
> {code}
> {code}
> Getting new files for time 1560896662000, ignoring files older than 
> 1560896659679
> file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt
>  not selected as mod time 1560896662679 > current time 1560896662000
> file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing
>  ignored as mod time 1560896657679 <= ignore time 1560896659679
> Finding new files took 0 ms
> New files at time 1560896662000 ms:
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28101) Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected`

2019-06-18 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-28101:
-

 Summary: Fix Flaky Test: `InputStreamsSuit.Modified files are 
correctly detected`
 Key: SPARK-28101
 URL: https://issues.apache.org/jira/browse/SPARK-28101
 Project: Spark
  Issue Type: Sub-task
  Components: DStreams, Tests
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


{code}
$ build/sbt "streaming/testOnly *.InputStreamsSuite"
[info] - Modified files are correctly detected. *** FAILED *** (134 
milliseconds)
[info]   Set("renamed") did not equal Set() (InputStreamsSuite.scala:312)
[info]   org.scalatest.exceptions.TestFailedException:
{code}

{code}
Getting new files for time 1560896662000, ignoring files older than 
1560896659679
file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt
 not selected as mod time 1560896662679 > current time 1560896662000
file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing
 ignored as mod time 1560896657679 <= ignore time 1560896659679
Finding new files took 0 ms
New files at time 1560896662000 ms:
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28101) Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in JDK9+

2019-06-18 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28101:
--
Attachment: Screen Shot 2019-06-18 at 4.41.11 PM.png

> Fix Flaky Test: `InputStreamsSuit.Modified files are correctly detected` in 
> JDK9+
> -
>
> Key: SPARK-28101
> URL: https://issues.apache.org/jira/browse/SPARK-28101
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> {code}
> $ build/sbt "streaming/testOnly *.InputStreamsSuite"
> [info] - Modified files are correctly detected. *** FAILED *** (134 
> milliseconds)
> [info]   Set("renamed") did not equal Set() (InputStreamsSuite.scala:312)
> [info]   org.scalatest.exceptions.TestFailedException:
> {code}
> {code}
> Getting new files for time 1560896662000, ignoring files older than 
> 1560896659679
> file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/renamed.txt
>  not selected as mod time 1560896662679 > current time 1560896662000
> file:/Users/dongjoon/PRS/SPARK-STEAM-TEST/target/tmp/spark-f34c23f6-9fb4-4ded-87f3-8cdfb57d85a6/streaming/subdir/existing
>  ignored as mod time 1560896657679 <= ignore time 1560896659679
> Finding new files took 0 ms
> New files at time 1560896662000 ms:
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28039) Add float4.sql

2019-06-18 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-28039.
-
   Resolution: Fixed
 Assignee: Yuming Wang
Fix Version/s: 3.0.0

> Add float4.sql
> --
>
> Key: SPARK-28039
> URL: https://issues.apache.org/jira/browse/SPARK-28039
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> In this ticket, we plan to add the regression test cases of 
> https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/float4.sql.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28088) String Functions: Enhance LPAD/RPAD function

2019-06-18 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28088:
-

Assignee: Yuming Wang

> String Functions: Enhance LPAD/RPAD function
> 
>
> Key: SPARK-28088
> URL: https://issues.apache.org/jira/browse/SPARK-28088
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Enhance LPAD/RPAD function to make {{pad}} parameter optional.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28088) String Functions: Enhance LPAD/RPAD function

2019-06-18 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28088.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24899
[https://github.com/apache/spark/pull/24899]

> String Functions: Enhance LPAD/RPAD function
> 
>
> Key: SPARK-28088
> URL: https://issues.apache.org/jira/browse/SPARK-28088
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Enhance LPAD/RPAD function to make {{pad}} parameter optional.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24936) Better error message when trying a shuffle fetch over 2 GB

2019-06-18 Thread Eric Maynard (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Maynard updated SPARK-24936:
-
Description: 
*strong text*After SPARK-24297, spark will try to fetch shuffle blocks to disk 
if their over 2GB.  However, this will fail with an external shuffle service 
running < spark 2.2, with an unhelpful error message like:

{noformat}
18/07/26 07:15:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.3 
(TID 15, xyz.com, executor 2): FetchFailed(BlockManagerId(1
, xyz.com, 7337, None), shuffleId=0, mapId=1, reduceId=1, message=
org.apache.spark.shuffle.FetchFailedException: 
java.lang.UnsupportedOperationException
at 
org.apache.spark.network.server.StreamManager.openStream(StreamManager.java:60)
at 
org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136)
...
{noformat}

We can't do anything to make the shuffle succeed, in this situation, but we 
should fail with a better error message.

  was:
After SPARK-24297, spark will try to fetch shuffle blocks to disk if their over 
2GB.  However, this will fail with an external shuffle service running < spark 
2.2, with an unhelpful error message like:

{noformat}
18/07/26 07:15:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.3 
(TID 15, xyz.com, executor 2): FetchFailed(BlockManagerId(1
, xyz.com, 7337, None), shuffleId=0, mapId=1, reduceId=1, message=
org.apache.spark.shuffle.FetchFailedException: 
java.lang.UnsupportedOperationException
at 
org.apache.spark.network.server.StreamManager.openStream(StreamManager.java:60)
at 
org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136)
...
{noformat}

We can't do anything to make the shuffle succeed, in this situation, but we 
should fail with a better error message.


> Better error message when trying a shuffle fetch over 2 GB
> --
>
> Key: SPARK-24936
> URL: https://issues.apache.org/jira/browse/SPARK-24936
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>
> *strong text*After SPARK-24297, spark will try to fetch shuffle blocks to 
> disk if their over 2GB.  However, this will fail with an external shuffle 
> service running < spark 2.2, with an unhelpful error message like:
> {noformat}
> 18/07/26 07:15:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.3 
> (TID 15, xyz.com, executor 2): FetchFailed(BlockManagerId(1
> , xyz.com, 7337, None), shuffleId=0, mapId=1, reduceId=1, message=
> org.apache.spark.shuffle.FetchFailedException: 
> java.lang.UnsupportedOperationException
> at 
> org.apache.spark.network.server.StreamManager.openStream(StreamManager.java:60)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136)
> ...
> {noformat}
> We can't do anything to make the shuffle succeed, in this situation, but we 
> should fail with a better error message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2019-06-18 Thread M. Le Bihan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866973#comment-16866973
 ] 

M. Le Bihan edited comment on SPARK-18105 at 6/18/19 8:45 PM:
--

My trick eventually didn't succeed. And I fall back into the bug again.

I've apttemted to upgrade from spark-xxx_2.11 to spark_xxx.2.12 for scala but 
received this kind of stacktrace :

{code:log}
2019-06-18 20:43:54.747  INFO 1539 --- [er for task 547] 
o.a.s.s.ShuffleBlockFetcherIterator  : Started 0 remote fetches in 0 ms
2019-06-18 20:43:59.015 ERROR 1539 --- [er for task 547] 
org.apache.spark.executor.Executor   : Exception in task 93.0 in stage 4.2 
(TID 547)

java.lang.NullPointerException: null
at 
org.apache.spark.rdd.PairRDDFunctions.$anonfun$mapValues$3(PairRDDFunctions.scala:757)
 ~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) 
~[scala-library-2.12.8.jar!/:na]
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
 ~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
 ~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) 
~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) 
~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at org.apache.spark.scheduler.Task.run(Task.scala:121) 
~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
 ~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) 
~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) 
~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[na:1.8.0_212]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[na:1.8.0_212]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_212]

{code}

This issue 18105 prevent from using spark at all. 

Spark with 300 opened and in progress (but stalling) issues is become less and 
less reliable each day.
I'm about to send a message on dev forum to ask if developers can stop 
implementing new features until they have corrected the issues on the features 
they once written.


was (Author: mlebihan):
My trick eventually didn't succeed. And I fall back into the bug again.

I've apttemted to upgrade from spark-xxx_2.11 to spark_xxx.2.12 for scala but 
received this kind of stacktrace :

{code:log}
2019-06-18 20:43:54.747  INFO 1539 --- [er for task 547] 
o.a.s.s.ShuffleBlockFetcherIterator  : Started 0 remote fetches in 0 ms
2019-06-18 20:43:59.015 ERROR 1539 --- [er for task 547] 
org.apache.spark.executor.Executor   : Exception in task 93.0 in stage 4.2 
(TID 547)

java.lang.NullPointerException: null
at 
org.apache.spark.rdd.PairRDDFunctions.$anonfun$mapValues$3(PairRDDFunctions.scala:757)
 ~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) 
~[scala-library-2.12.8.jar!/:na]
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
 ~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
 ~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) 
~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) 
~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at org.apache.spark.scheduler.Task.run(Task.scala:121) 

[jira] [Updated] (SPARK-24936) Better error message when trying a shuffle fetch over 2 GB

2019-06-18 Thread Eric Maynard (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Maynard updated SPARK-24936:
-
Description: 
After SPARK-24297, spark will try to fetch shuffle blocks to disk if their over 
2GB.  However, this will fail with an external shuffle service running < spark 
2.2, with an unhelpful error message like:

{noformat}
18/07/26 07:15:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.3 
(TID 15, xyz.com, executor 2): FetchFailed(BlockManagerId(1
, xyz.com, 7337, None), shuffleId=0, mapId=1, reduceId=1, message=
org.apache.spark.shuffle.FetchFailedException: 
java.lang.UnsupportedOperationException
at 
org.apache.spark.network.server.StreamManager.openStream(StreamManager.java:60)
at 
org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136)
...
{noformat}

We can't do anything to make the shuffle succeed, in this situation, but we 
should fail with a better error message.

  was:
*strong text*After SPARK-24297, spark will try to fetch shuffle blocks to disk 
if their over 2GB.  However, this will fail with an external shuffle service 
running < spark 2.2, with an unhelpful error message like:

{noformat}
18/07/26 07:15:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.3 
(TID 15, xyz.com, executor 2): FetchFailed(BlockManagerId(1
, xyz.com, 7337, None), shuffleId=0, mapId=1, reduceId=1, message=
org.apache.spark.shuffle.FetchFailedException: 
java.lang.UnsupportedOperationException
at 
org.apache.spark.network.server.StreamManager.openStream(StreamManager.java:60)
at 
org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136)
...
{noformat}

We can't do anything to make the shuffle succeed, in this situation, but we 
should fail with a better error message.


> Better error message when trying a shuffle fetch over 2 GB
> --
>
> Key: SPARK-24936
> URL: https://issues.apache.org/jira/browse/SPARK-24936
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>
> After SPARK-24297, spark will try to fetch shuffle blocks to disk if their 
> over 2GB.  However, this will fail with an external shuffle service running < 
> spark 2.2, with an unhelpful error message like:
> {noformat}
> 18/07/26 07:15:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.3 
> (TID 15, xyz.com, executor 2): FetchFailed(BlockManagerId(1
> , xyz.com, 7337, None), shuffleId=0, mapId=1, reduceId=1, message=
> org.apache.spark.shuffle.FetchFailedException: 
> java.lang.UnsupportedOperationException
> at 
> org.apache.spark.network.server.StreamManager.openStream(StreamManager.java:60)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136)
> ...
> {noformat}
> We can't do anything to make the shuffle succeed, in this situation, but we 
> should fail with a better error message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28100) Unable to override JDBC types for ByteType when providing a custom JdbcDialect for Postgres

2019-06-18 Thread Seth Fitzsimmons (JIRA)
Seth Fitzsimmons created SPARK-28100:


 Summary: Unable to override JDBC types for ByteType when providing 
a custom JdbcDialect for Postgres
 Key: SPARK-28100
 URL: https://issues.apache.org/jira/browse/SPARK-28100
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.4.3
Reporter: Seth Fitzsimmons


{{AggregatedDialect}} defines [{{getJDBCType}} 
|https://github.com/apache/spark/blob/1217996f1574f758d81c4e3846452d24b35b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/AggregatedDialect.scala#L41-L43]
 as:

{{dialects.flatMap(_.getJDBCType(dt)).headOption}}

However, when attempting to write a {{ByteType}}, {{PostgreDialect}} currently 
throws: 
https://github.com/apache/spark/blob/1217996f1574f758d81c4e3846452d24b35b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L83

This prevents any other registered dialects from providing a JDBC type for 
{{ByteType}} as the last one (Spark's default {{PostgresDialect}}) will raise 
an uncaught exception.

https://github.com/apache/spark/pull/24845 addresses this by providing a 
mapping, but the general problem holds if any {{JdbcDialect}} implementations 
ever throw in {{getJDBCType}} for any reason.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28099) Assertion when querying unpartitioned Hive table with partition-like naming

2019-06-18 Thread Douglas Drinka (JIRA)
Douglas Drinka created SPARK-28099:
--

 Summary: Assertion when querying unpartitioned Hive table with 
partition-like naming
 Key: SPARK-28099
 URL: https://issues.apache.org/jira/browse/SPARK-28099
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
Reporter: Douglas Drinka


{code:java}
val testData = List(1,2,3,4,5)
val dataFrame = testData.toDF()
dataFrame
.coalesce(1)
.write
.mode(SaveMode.Overwrite)
.format("orc")
.option("compression", "zlib")
.save("s3://ddrinka.sparkbug/testFail/dir1=1/dir2=2/")

spark.sql("DROP TABLE IF EXISTS ddrinka_sparkbug.testFail")
spark.sql("CREATE EXTERNAL TABLE ddrinka_sparkbug.testFail (val INT) STORED AS 
ORC LOCATION 's3://ddrinka.sparkbug/testFail/'")

val queryResponse = spark.sql("SELECT * FROM ddrinka_sparkbug.testFail")
//Throws AssertionError
//at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.convertToLogicalRelation(HiveMetastoreCatalog.scala:214){code}
It looks like the native ORC reader is creating virtual columns named dir1 and 
dir2, which don't exist in the Hive table. [The 
assertion|[https://github.com/apache/spark/blob/c0297dedd829a92cca920ab8983dab399f8f32d5/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L257]]
 is checking that the number of columns match, which fails due to the virtual 
partition columns.

Actually getting data back from this query will be dependent on SPARK-28098, 
supporting subdirectories for Hive queries at all.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28096) Lazy val performance pitfall in Spark SQL LogicalPlans

2019-06-18 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28096:
--
Issue Type: Improvement  (was: Bug)

> Lazy val performance pitfall in Spark SQL LogicalPlans
> --
>
> Key: SPARK-28096
> URL: https://issues.apache.org/jira/browse/SPARK-28096
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Priority: Major
>
> The original {{references}} and {{validConstraints}} implementations in a few 
> QueryPlan and Expression classes are methods, which means unnecessary 
> re-computation can happen at times. This PR resolves this problem by making 
> these method lazy vals.
> We benchmarked this optimization on TPC-DS queries whose planning time is 
> longer than 1s. In the benchmark, we warmed up the queries 5 iterations and 
> then took the average of 10 runs. Results showed that this micro-optimization 
> can improve the end-to-end planning time by 25%.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28098) Native ORC reader doesn't support subdirectories with Hive tables

2019-06-18 Thread Douglas Drinka (JIRA)
Douglas Drinka created SPARK-28098:
--

 Summary: Native ORC reader doesn't support subdirectories with 
Hive tables
 Key: SPARK-28098
 URL: https://issues.apache.org/jira/browse/SPARK-28098
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.3
Reporter: Douglas Drinka


The Hive ORC reader supports recursive directory reads from S3.  Spark's native 
ORC reader supports recursive directory reads, but not when used with Hive.

 
{code:java}
val testData = List(1,2,3,4,5)
val dataFrame = testData.toDF()
dataFrame
.coalesce(1)
.write
.mode(SaveMode.Overwrite)
.format("orc")
.option("compression", "zlib")
.save("s3://ddrinka.sparkbug/dirTest/dir1/dir2/")

spark.sql("DROP TABLE IF EXISTS ddrinka_sparkbug.dirTest")
spark.sql("CREATE EXTERNAL TABLE ddrinka_sparkbug.dirTest (val INT) STORED AS 
ORC LOCATION 's3://ddrinka.sparkbug/dirTest/'")

spark.conf.set("hive.mapred.supports.subdirectories","true")
spark.conf.set("mapred.input.dir.recursive","true")
spark.conf.set("mapreduce.input.fileinputformat.input.dir.recursive","true")

spark.conf.set("spark.sql.hive.convertMetastoreOrc", "true")
println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count)
//0

spark.conf.set("spark.sql.hive.convertMetastoreOrc", "false")
println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count)
//5{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28097) Map ByteType to SMALLINT when using JDBC with PostgreSQL

2019-06-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28097:


Assignee: Apache Spark

> Map ByteType to SMALLINT when using JDBC with PostgreSQL
> 
>
> Key: SPARK-28097
> URL: https://issues.apache.org/jira/browse/SPARK-28097
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Seth Fitzsimmons
>Assignee: Apache Spark
>Priority: Minor
>
> PostgreSQL doesn't have {{TINYINT}}, which would map directly, but 
> {{SMALLINT}}s are sufficient for uni-directional translation (i.e. when 
> writing).
> This is equivalent to a user selecting {{'byteColumn.cast(ShortType)}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28097) Map ByteType to SMALLINT when using JDBC with PostgreSQL

2019-06-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28097:


Assignee: (was: Apache Spark)

> Map ByteType to SMALLINT when using JDBC with PostgreSQL
> 
>
> Key: SPARK-28097
> URL: https://issues.apache.org/jira/browse/SPARK-28097
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Seth Fitzsimmons
>Priority: Minor
>
> PostgreSQL doesn't have {{TINYINT}}, which would map directly, but 
> {{SMALLINT}}s are sufficient for uni-directional translation (i.e. when 
> writing).
> This is equivalent to a user selecting {{'byteColumn.cast(ShortType)}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28097) Map ByteType to SMALLINT when using JDBC with PostgreSQL

2019-06-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867004#comment-16867004
 ] 

Apache Spark commented on SPARK-28097:
--

User 'mojodna' has created a pull request for this issue:
https://github.com/apache/spark/pull/24845

> Map ByteType to SMALLINT when using JDBC with PostgreSQL
> 
>
> Key: SPARK-28097
> URL: https://issues.apache.org/jira/browse/SPARK-28097
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Seth Fitzsimmons
>Priority: Minor
>
> PostgreSQL doesn't have {{TINYINT}}, which would map directly, but 
> {{SMALLINT}}s are sufficient for uni-directional translation (i.e. when 
> writing).
> This is equivalent to a user selecting {{'byteColumn.cast(ShortType)}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28097) Map ByteType to SMALLINT when using JDBC with PostgreSQL

2019-06-18 Thread Seth Fitzsimmons (JIRA)
Seth Fitzsimmons created SPARK-28097:


 Summary: Map ByteType to SMALLINT when using JDBC with PostgreSQL
 Key: SPARK-28097
 URL: https://issues.apache.org/jira/browse/SPARK-28097
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 2.4.3, 2.3.3
Reporter: Seth Fitzsimmons


PostgreSQL doesn't have {{TINYINT}}, which would map directly, but 
{{SMALLINT}}s are sufficient for uni-directional translation (i.e. when 
writing).

This is equivalent to a user selecting {{'byteColumn.cast(ShortType)}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28096) Lazy val performance pitfall in Spark SQL LogicalPlans

2019-06-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866995#comment-16866995
 ] 

Apache Spark commented on SPARK-28096:
--

User 'yeshengm' has created a pull request for this issue:
https://github.com/apache/spark/pull/24866

> Lazy val performance pitfall in Spark SQL LogicalPlans
> --
>
> Key: SPARK-28096
> URL: https://issues.apache.org/jira/browse/SPARK-28096
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Priority: Major
>
> The original {{references}} and {{validConstraints}} implementations in a few 
> QueryPlan and Expression classes are methods, which means unnecessary 
> re-computation can happen at times. This PR resolves this problem by making 
> these method lazy vals.
> We benchmarked this optimization on TPC-DS queries whose planning time is 
> longer than 1s. In the benchmark, we warmed up the queries 5 iterations and 
> then took the average of 10 runs. Results showed that this micro-optimization 
> can improve the end-to-end planning time by 25%.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28096) Lazy val performance pitfall in Spark SQL LogicalPlans

2019-06-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28096:


Assignee: (was: Apache Spark)

> Lazy val performance pitfall in Spark SQL LogicalPlans
> --
>
> Key: SPARK-28096
> URL: https://issues.apache.org/jira/browse/SPARK-28096
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Priority: Major
>
> The original {{references}} and {{validConstraints}} implementations in a few 
> QueryPlan and Expression classes are methods, which means unnecessary 
> re-computation can happen at times. This PR resolves this problem by making 
> these method lazy vals.
> We benchmarked this optimization on TPC-DS queries whose planning time is 
> longer than 1s. In the benchmark, we warmed up the queries 5 iterations and 
> then took the average of 10 runs. Results showed that this micro-optimization 
> can improve the end-to-end planning time by 25%.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28096) Lazy val performance pitfall in Spark SQL LogicalPlans

2019-06-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28096:


Assignee: Apache Spark

> Lazy val performance pitfall in Spark SQL LogicalPlans
> --
>
> Key: SPARK-28096
> URL: https://issues.apache.org/jira/browse/SPARK-28096
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Assignee: Apache Spark
>Priority: Major
>
> The original {{references}} and {{validConstraints}} implementations in a few 
> QueryPlan and Expression classes are methods, which means unnecessary 
> re-computation can happen at times. This PR resolves this problem by making 
> these method lazy vals.
> We benchmarked this optimization on TPC-DS queries whose planning time is 
> longer than 1s. In the benchmark, we warmed up the queries 5 iterations and 
> then took the average of 10 runs. Results showed that this micro-optimization 
> can improve the end-to-end planning time by 25%.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28096) Lazy val performance pitfall in Spark SQL LogicalPlans

2019-06-18 Thread Yesheng Ma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yesheng Ma updated SPARK-28096:
---
Summary: Lazy val performance pitfall in Spark SQL LogicalPlans  (was: 
Performance pitfall in Spark SQL LogicalPlans)

> Lazy val performance pitfall in Spark SQL LogicalPlans
> --
>
> Key: SPARK-28096
> URL: https://issues.apache.org/jira/browse/SPARK-28096
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Priority: Major
>
> The original {{references}} and {{validConstraints}} implementations in a few 
> QueryPlan and Expression classes are methods, which means unnecessary 
> re-computation can happen at times. This PR resolves this problem by making 
> these method lazy vals.
> We benchmarked this optimization on TPC-DS queries whose planning time is 
> longer than 1s. In the benchmark, we warmed up the queries 5 iterations and 
> then took the average of 10 runs. Results showed that this micro-optimization 
> can improve the end-to-end planning time by 25%.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28096) Performance pitfall in Spark SQL LogicalPlans

2019-06-18 Thread Yesheng Ma (JIRA)
Yesheng Ma created SPARK-28096:
--

 Summary: Performance pitfall in Spark SQL LogicalPlans
 Key: SPARK-28096
 URL: https://issues.apache.org/jira/browse/SPARK-28096
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yesheng Ma


The original {{references}} and {{validConstraints}} implementations in a few 
QueryPlan and Expression classes are methods, which means unnecessary 
re-computation can happen at times. This PR resolves this problem by making 
these method lazy vals.

We benchmarked this optimization on TPC-DS queries whose planning time is 
longer than 1s. In the benchmark, we warmed up the queries 5 iterations and 
then took the average of 10 runs. Results showed that this micro-optimization 
can improve the end-to-end planning time by 25%.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28095) Pyspark with kubernetes doesn't parse arguments with spaces as expected.

2019-06-18 Thread Emma Dickson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emma Dickson updated SPARK-28095:
-
Description: 
When passing in arguments to a bash script that sets up spark submit using a 
python file that sets up a pyspark context strings with spaces are processed as 
individual strings. This occurs even when the argument is encased in double 
quotes, using backslashes or unicode escape characters.

 

Example

Command entered: This uses and IBM specific driver, 
[stochater|https://github.com/CODAIT/stocator] hence the cos url
{code:java}
./scripts/spark-k8s.sh v0.0.32 --job-args 
"cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer" 
--job pages{code}
 

Error Message

 
{code:java}
+ exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress=172.30.83.253 --deploy-mode client --properties-file 
/opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner 
/opt/spark/work-dir/main.py --job-args 
cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer --job 
pages
19/06/18 19:28:35 WARN Utils: Kubernetes master URL uses HTTP instead of HTTPS.
19/06/18 19:28:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
usage: main.py [-h] --job JOB --job-args JOB_ARGS
main.py: error: unrecognized arguments: Balancer
{code}

  was:
When passing in arguments to a bash script that sets up spark submit using a 
python file that sets up a pyspark context strings with spaces are processed as 
individual strings. This occurs even when the argument is encased in double 
quotes, using backslashes or unicode escape characters.

 

Example

Command entered: This uses and IBM specific driver hence the cos url
{code:java}
./scripts/spark-k8s.sh v0.0.32 --job-args 
"cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer" 
--job pages{code}
 

Error Message

 
{code:java}
+ exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress=172.30.83.253 --deploy-mode client --properties-file 
/opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner 
/opt/spark/work-dir/main.py --job-args 
cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer --job 
pages
19/06/18 19:28:35 WARN Utils: Kubernetes master URL uses HTTP instead of HTTPS.
19/06/18 19:28:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
usage: main.py [-h] --job JOB --job-args JOB_ARGS
main.py: error: unrecognized arguments: Balancer
{code}


> Pyspark with kubernetes doesn't parse arguments with spaces as expected.
> 
>
> Key: SPARK-28095
> URL: https://issues.apache.org/jira/browse/SPARK-28095
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.3
> Environment: Python 2.7.13
> Spark 2.4.3
> Kubernetes
>  
>Reporter: Emma Dickson
>Priority: Minor
>  Labels: newbie, usability
>
> When passing in arguments to a bash script that sets up spark submit using a 
> python file that sets up a pyspark context strings with spaces are processed 
> as individual strings. This occurs even when the argument is encased in 
> double quotes, using backslashes or unicode escape characters.
>  
> Example
> Command entered: This uses and IBM specific driver, 
> [stochater|https://github.com/CODAIT/stocator] hence the cos url
> {code:java}
> ./scripts/spark-k8s.sh v0.0.32 --job-args 
> "cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer" 
> --job pages{code}
>  
> Error Message
>  
> {code:java}
> + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=172.30.83.253 --deploy-mode client --properties-file 
> /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner 
> /opt/spark/work-dir/main.py --job-args 
> cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer 
> --job pages
> 19/06/18 19:28:35 WARN Utils: Kubernetes master URL uses HTTP instead of 
> HTTPS.
> 19/06/18 19:28:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> usage: main.py [-h] --job JOB --job-args JOB_ARGS
> main.py: error: unrecognized arguments: Balancer
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27783) Add customizable hint error handler

2019-06-18 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-27783.
-
   Resolution: Fixed
 Assignee: Maryann Xue
Fix Version/s: 3.0.0

> Add customizable hint error handler
> ---
>
> Key: SPARK-27783
> URL: https://issues.apache.org/jira/browse/SPARK-27783
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Minor
> Fix For: 3.0.0
>
>
> Add a customizable hint error handler, with default behavior as logging 
> warnings.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28095) Pyspark with kubernetes doesn't parse arguments with spaces as expected.

2019-06-18 Thread Emma Dickson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emma Dickson updated SPARK-28095:
-
Description: 
When passing in arguments to a bash script that sets up spark submit using a 
python file that sets up a pyspark context strings with spaces are processed as 
individual strings. This occurs even when the argument is encased in double 
quotes, using backslashes or unicode escape characters.

 

Example

Command entered: This uses and IBM specific driver hence the cos url
{code:java}
./scripts/spark-k8s.sh v0.0.32 --job-args 
"cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer" 
--job pages{code}
 

Error Message

 
{code:java}
+ exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress=172.30.83.253 --deploy-mode client --properties-file 
/opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner 
/opt/spark/work-dir/main.py --job-args 
cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer --job 
pages
19/06/18 19:28:35 WARN Utils: Kubernetes master URL uses HTTP instead of HTTPS.
19/06/18 19:28:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
usage: main.py [-h] --job JOB --job-args JOB_ARGS
main.py: error: unrecognized arguments: Balancer
{code}

  was:
When passing in arguments to a bash script that sets up spark submit using a 
python file that sets up a pyspark context strings with spaces are processed as 
individual strings. This occurs even when the argument is encased in double 
quotes, using backslashes or unicode escape characters.

 

Example

Command entered
{code:java}
./scripts/spark-k8s.sh v0.0.32 --job-args 
"cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer" 
--job pages{code}
 

Error Message

 
{code:java}
+ exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress=172.30.83.253 --deploy-mode client --properties-file 
/opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner 
/opt/spark/work-dir/main.py --job-args 
cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer --job 
pages
19/06/18 19:28:35 WARN Utils: Kubernetes master URL uses HTTP instead of HTTPS.
19/06/18 19:28:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
usage: main.py [-h] --job JOB --job-args JOB_ARGS
main.py: error: unrecognized arguments: Balancer
{code}


> Pyspark with kubernetes doesn't parse arguments with spaces as expected.
> 
>
> Key: SPARK-28095
> URL: https://issues.apache.org/jira/browse/SPARK-28095
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.3
> Environment: Python 2.7.13
> Spark 2.4.3
> Kubernetes
>  
>Reporter: Emma Dickson
>Priority: Minor
>  Labels: newbie, usability
>
> When passing in arguments to a bash script that sets up spark submit using a 
> python file that sets up a pyspark context strings with spaces are processed 
> as individual strings. This occurs even when the argument is encased in 
> double quotes, using backslashes or unicode escape characters.
>  
> Example
> Command entered: This uses and IBM specific driver hence the cos url
> {code:java}
> ./scripts/spark-k8s.sh v0.0.32 --job-args 
> "cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer" 
> --job pages{code}
>  
> Error Message
>  
> {code:java}
> + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=172.30.83.253 --deploy-mode client --properties-file 
> /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner 
> /opt/spark/work-dir/main.py --job-args 
> cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer 
> --job pages
> 19/06/18 19:28:35 WARN Utils: Kubernetes master URL uses HTTP instead of 
> HTTPS.
> 19/06/18 19:28:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> usage: main.py [-h] --job JOB --job-args JOB_ARGS
> main.py: error: unrecognized arguments: Balancer
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28095) Pyspark with kubernetes doesn't parse arguments with spaces as expected.

2019-06-18 Thread Emma Dickson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emma Dickson updated SPARK-28095:
-
Description: 
When passing in arguments to a bash script that sets up spark submit using a 
python file that sets up a pyspark context strings with spaces are processed as 
individual strings. This occurs even when the argument is encased in double 
quotes, using backslashes or unicode escape characters.

 

Example

Command entered
{code:java}
./scripts/spark-k8s.sh v0.0.32 --job-args 
"cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer" 
--job pages{code}
 

Error Message

 
{code:java}
+ exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress=172.30.83.253 --deploy-mode client --properties-file 
/opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner 
/opt/spark/work-dir/main.py --job-args 
cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer --job 
pages
19/06/18 19:28:35 WARN Utils: Kubernetes master URL uses HTTP instead of HTTPS.
19/06/18 19:28:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
usage: main.py [-h] --job JOB --job-args JOB_ARGS
main.py: error: unrecognized arguments: Balancer
{code}

  was:
When passing in arguments to a bash script that calls a python file which sets 
up a pyspark context strings with spaces are processed as individual strings 
even when encased in double quotes, using backslashes or unicode escape 
characters.

 

Example

Command entered
{code:java}
./scripts/spark-k8s.sh v0.0.32 --job-args 
"cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer" 
--job pages{code}
 

Error Message

 
{code:java}
+ exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress=172.30.83.253 --deploy-mode client --properties-file 
/opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner 
/opt/spark/work-dir/main.py --job-args 
cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer --job 
pages
19/06/18 19:28:35 WARN Utils: Kubernetes master URL uses HTTP instead of HTTPS.
19/06/18 19:28:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
usage: main.py [-h] --job JOB --job-args JOB_ARGS
main.py: error: unrecognized arguments: Balancer
{code}


> Pyspark with kubernetes doesn't parse arguments with spaces as expected.
> 
>
> Key: SPARK-28095
> URL: https://issues.apache.org/jira/browse/SPARK-28095
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.3
> Environment: Python 2.7.13
> Spark 2.4.3
> Kubernetes
>  
>Reporter: Emma Dickson
>Priority: Minor
>  Labels: newbie, usability
>
> When passing in arguments to a bash script that sets up spark submit using a 
> python file that sets up a pyspark context strings with spaces are processed 
> as individual strings. This occurs even when the argument is encased in 
> double quotes, using backslashes or unicode escape characters.
>  
> Example
> Command entered
> {code:java}
> ./scripts/spark-k8s.sh v0.0.32 --job-args 
> "cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer" 
> --job pages{code}
>  
> Error Message
>  
> {code:java}
> + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=172.30.83.253 --deploy-mode client --properties-file 
> /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner 
> /opt/spark/work-dir/main.py --job-args 
> cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer 
> --job pages
> 19/06/18 19:28:35 WARN Utils: Kubernetes master URL uses HTTP instead of 
> HTTPS.
> 19/06/18 19:28:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> usage: main.py [-h] --job JOB --job-args JOB_ARGS
> main.py: error: unrecognized arguments: Balancer
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28095) Pyspark with kubernetes doesn't parse arguments with spaces as expected.

2019-06-18 Thread Emma Dickson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emma Dickson updated SPARK-28095:
-
Summary: Pyspark with kubernetes doesn't parse arguments with spaces as 
expected.  (was: Pyspark on kubernetes doesn't parse arguments with spaces as 
expected.)

> Pyspark with kubernetes doesn't parse arguments with spaces as expected.
> 
>
> Key: SPARK-28095
> URL: https://issues.apache.org/jira/browse/SPARK-28095
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.3
> Environment: Python 2.7.13
> Spark 2.4.3
> Kubernetes
>  
>Reporter: Emma Dickson
>Priority: Minor
>  Labels: newbie, usability
>
> When passing in arguments to a bash script that calls a python file which 
> sets up a pyspark context strings with spaces are processed as individual 
> strings even when encased in double quotes, using backslashes or unicode 
> escape characters.
>  
> Example
> Command entered
> {code:java}
> ./scripts/spark-k8s.sh v0.0.32 --job-args 
> "cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer" 
> --job pages{code}
>  
> Error Message
>  
> {code:java}
> + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=172.30.83.253 --deploy-mode client --properties-file 
> /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner 
> /opt/spark/work-dir/main.py --job-args 
> cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer 
> --job pages
> 19/06/18 19:28:35 WARN Utils: Kubernetes master URL uses HTTP instead of 
> HTTPS.
> 19/06/18 19:28:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> usage: main.py [-h] --job JOB --job-args JOB_ARGS
> main.py: error: unrecognized arguments: Balancer
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28095) Pyspark on kubernetes doesn't parse arguments with spaces as expected.

2019-06-18 Thread Emma Dickson (JIRA)
Emma Dickson created SPARK-28095:


 Summary: Pyspark on kubernetes doesn't parse arguments with spaces 
as expected.
 Key: SPARK-28095
 URL: https://issues.apache.org/jira/browse/SPARK-28095
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, PySpark
Affects Versions: 2.4.3
 Environment: Python 2.7.13

Spark 2.4.3

Kubernetes

 
Reporter: Emma Dickson


When passing in arguments to a bash script that calls a python file which sets 
up a pyspark context strings with spaces are processed as individual strings 
even when encased in double quotes, using backslashes or unicode escape 
characters.

 

Example

Command entered
{code:java}
./scripts/spark-k8s.sh v0.0.32 --job-args 
"cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer" 
--job pages{code}
 

Error Message

 
{code:java}
+ exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress=172.30.83.253 --deploy-mode client --properties-file 
/opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner 
/opt/spark/work-dir/main.py --job-args 
cos://waas-logentries.mycos/Logentries/IBM-b634032e/Github/Load Balancer --job 
pages
19/06/18 19:28:35 WARN Utils: Kubernetes master URL uses HTTP instead of HTTPS.
19/06/18 19:28:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
usage: main.py [-h] --job JOB --job-args JOB_ARGS
main.py: error: unrecognized arguments: Balancer
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2019-06-18 Thread M. Le Bihan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866973#comment-16866973
 ] 

M. Le Bihan edited comment on SPARK-18105 at 6/18/19 7:10 PM:
--

My trick eventually didn't succeed. And I fall back into the bug again.

I've apttemted to upgrade from spark-xxx_2.11 to spark_xxx.2.12 for scala but 
received this kind of stacktrace :

{code:log}
2019-06-18 20:43:54.747  INFO 1539 --- [er for task 547] 
o.a.s.s.ShuffleBlockFetcherIterator  : Started 0 remote fetches in 0 ms
2019-06-18 20:43:59.015 ERROR 1539 --- [er for task 547] 
org.apache.spark.executor.Executor   : Exception in task 93.0 in stage 4.2 
(TID 547)

java.lang.NullPointerException: null
at 
org.apache.spark.rdd.PairRDDFunctions.$anonfun$mapValues$3(PairRDDFunctions.scala:757)
 ~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) 
~[scala-library-2.12.8.jar!/:na]
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
 ~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
 ~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) 
~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) 
~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at org.apache.spark.scheduler.Task.run(Task.scala:121) 
~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
 ~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) 
~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) 
~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[na:1.8.0_212]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[na:1.8.0_212]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_212]

{code}

This issue 18105 prevent from using spark at all. 
{color:#654982}Spark +cannot+ handle large files (1 Go, 5 Go, 10 Go that I ask 
Spark to join for me) : *LOL, LOL, LOL !*{color} When this will be corrected ?! 
It should be raised to urgent.

Spark with 300 opened and in progress (but stalling) issues is become less and 
less reliable each day.
I'm about to send a message on dev forum to ask if developers can stop 
implementing new features until they have corrected the issues on the features 
they once written.

Spark Today cannot be used at all.
At least offer a way to disable the LZ4 feature if it doesn't work !


was (Author: mlebihan):
My trick eventually didn't succeed. And I fall back into the bug again.

I've apttemted to upgrade from spark-xxx_2.11 to spark_xxx.2.12 for scala but 
received this kind of stacktrace :

{code:log}
2019-06-18 20:43:54.747  INFO 1539 --- [er for task 547] 
o.a.s.s.ShuffleBlockFetcherIterator  : Started 0 remote fetches in 0 ms
2019-06-18 20:43:59.015 ERROR 1539 --- [er for task 547] 
org.apache.spark.executor.Executor   : Exception in task 93.0 in stage 4.2 
(TID 547)

java.lang.NullPointerException: null
at 
org.apache.spark.rdd.PairRDDFunctions.$anonfun$mapValues$3(PairRDDFunctions.scala:757)
 ~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) 
~[scala-library-2.12.8.jar!/:na]
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
 ~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
 ~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 

[jira] [Commented] (SPARK-28092) Spark cannot load files with COLON(:) char if not specified full path

2019-06-18 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866978#comment-16866978
 ] 

Steve Loughran commented on SPARK-28092:


BTW, code snippets say ("s3://bucket/prefix/*",; I presume that's really s3a:// 
; as s3:// URLs mean EMR and "amazon's private fork gets to field the support 
call"

> Spark cannot load files with COLON(:) char if not specified full path
> -
>
> Key: SPARK-28092
> URL: https://issues.apache.org/jira/browse/SPARK-28092
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: Cloudera 6.2
> Spark latest parcel (I think 2.4.3)
>Reporter: Ladislav Jech
>Priority: Major
>
> Scenario:
> I have CSV files in S3 bucket like this:
> s3a://bucket/prefix/myfile_2019:04:05.csv
> s3a://bucket/prefix/myfile_2019:04:06.csv
> Now when I try to load files with something like:
> df = spark.read.load("s3://bucket/prefix/*", format="csv", sep=":", 
> inferSchema="true", header="true")
>  
> It fails on error about URI (sorry don't have here exact exception), but when 
> I list all files from S3 and provide path like array:
> df = 
> spark.read.load(path=["s3://bucket/prefix/myfile_2019:04:05.csv","s3://bucket/prefix/myfile_2019:04:05.csv"],
>  format="csv", sep=":", inferSchema="true", header="true")
>  
> It works, the reason is COLON character in the name of files as per my 
> observations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28092) Spark cannot load files with COLON(:) char if not specified full path

2019-06-18 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866977#comment-16866977
 ] 

Steve Loughran commented on SPARK-28092:


You're going to have to see if you can replicate this on the real ASF artifacts 
otherwise push it to your supplier of built spark artifacts

FWIW, problems with ":" in filenames in paths are known: HADOOP-14829 
HADOOP-14217 for example. HDFS hasn't allowed ":" in names, so nobody even 
noticed that Filesystem.globFiles doesn't work

While nobody actually wants to stop this, its  not something which, AFAIK, 
anyone is actively working on. Patches welcome, *with tests*

> Spark cannot load files with COLON(:) char if not specified full path
> -
>
> Key: SPARK-28092
> URL: https://issues.apache.org/jira/browse/SPARK-28092
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: Cloudera 6.2
> Spark latest parcel (I think 2.4.3)
>Reporter: Ladislav Jech
>Priority: Major
>
> Scenario:
> I have CSV files in S3 bucket like this:
> s3a://bucket/prefix/myfile_2019:04:05.csv
> s3a://bucket/prefix/myfile_2019:04:06.csv
> Now when I try to load files with something like:
> df = spark.read.load("s3://bucket/prefix/*", format="csv", sep=":", 
> inferSchema="true", header="true")
>  
> It fails on error about URI (sorry don't have here exact exception), but when 
> I list all files from S3 and provide path like array:
> df = 
> spark.read.load(path=["s3://bucket/prefix/myfile_2019:04:05.csv","s3://bucket/prefix/myfile_2019:04:05.csv"],
>  format="csv", sep=":", inferSchema="true", header="true")
>  
> It works, the reason is COLON character in the name of files as per my 
> observations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2019-06-18 Thread M. Le Bihan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866973#comment-16866973
 ] 

M. Le Bihan edited comment on SPARK-18105 at 6/18/19 7:06 PM:
--

My trick eventually didn't succeed. And I fall back into the bug again.

I've apttemted to upgrade from spark-xxx_2.11 to spark_xxx.2.12 for scala but 
received this kind of stacktrace :

{code:log}
2019-06-18 20:43:54.747  INFO 1539 --- [er for task 547] 
o.a.s.s.ShuffleBlockFetcherIterator  : Started 0 remote fetches in 0 ms
2019-06-18 20:43:59.015 ERROR 1539 --- [er for task 547] 
org.apache.spark.executor.Executor   : Exception in task 93.0 in stage 4.2 
(TID 547)

java.lang.NullPointerException: null
at 
org.apache.spark.rdd.PairRDDFunctions.$anonfun$mapValues$3(PairRDDFunctions.scala:757)
 ~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) 
~[scala-library-2.12.8.jar!/:na]
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
 ~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
 ~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) 
~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) 
~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at org.apache.spark.scheduler.Task.run(Task.scala:121) 
~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
 ~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) 
~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) 
~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[na:1.8.0_212]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[na:1.8.0_212]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_212]

{code}

This issue 18105 prevent from using spark at all. 
{color:#654982}Spark +cannot+ handle large files (1 Go, 5 Go, 10 Go that I ask 
Spark to join for me) : *LOL, LOL, LOL !*{color} When this will be corrected ?! 
It should be raised to urgent.

Spark with 300 opened and in progress (but stalling) issues is become less and 
less reliable each day.
I'm about to send a message on dev forum to ask if developers can stop 
implementing new features until they have corrected the issues on the features 
they once written.

Spark Today cannot be used at all.
At least find a way to disable the LZ4 feature you can't handle properly.


was (Author: mlebihan):
My trick eventually didn't succeed. And I fall back into the bug again.

I've apttemted to upgrade from spark-xxx_2.11 to spark_xxx.2.12 for scala but 
received this kind of stacktrace :

{code:log}
2019-06-18 20:43:54.747  INFO 1539 --- [er for task 547] 
o.a.s.s.ShuffleBlockFetcherIterator  : Started 0 remote fetches in 0 ms
2019-06-18 20:43:59.015 ERROR 1539 --- [er for task 547] 
org.apache.spark.executor.Executor   : Exception in task 93.0 in stage 4.2 
(TID 547)

java.lang.NullPointerException: null
at 
org.apache.spark.rdd.PairRDDFunctions.$anonfun$mapValues$3(PairRDDFunctions.scala:757)
 ~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) 
~[scala-library-2.12.8.jar!/:na]
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
 ~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
 ~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 

[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2019-06-18 Thread M. Le Bihan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866973#comment-16866973
 ] 

M. Le Bihan commented on SPARK-18105:
-

My trick eventually didn't succeed. And I fall back into the bug again.

I've apttemted to upgrade from spark-xxx_2.11 to spark_xxx.2.12 for scala but 
received this kind of stacktrace :

{code:log}
2019-06-18 20:43:54.747  INFO 1539 --- [er for task 547] 
o.a.s.s.ShuffleBlockFetcherIterator  : Started 0 remote fetches in 0 ms
2019-06-18 20:43:59.015 ERROR 1539 --- [er for task 547] 
org.apache.spark.executor.Executor   : Exception in task 93.0 in stage 4.2 
(TID 547)

java.lang.NullPointerException: null
at 
org.apache.spark.rdd.PairRDDFunctions.$anonfun$mapValues$3(PairRDDFunctions.scala:757)
 ~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) 
~[scala-library-2.12.8.jar!/:na]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) 
~[scala-library-2.12.8.jar!/:na]
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
 ~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
 ~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) 
~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) 
~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at org.apache.spark.scheduler.Task.run(Task.scala:121) 
~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
 ~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) 
~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) 
~[spark-core_2.12-2.4.3.jar!/:2.4.3]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[na:1.8.0_212]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[na:1.8.0_212]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_212]

{code}

This issue disallow using spark at all. It should be raised to urgent.
Spark with 300 opened and in progress (but stalling) issues is become less and 
less reliable each day.
I'm about to send a message on dev forum to ask if developers can stop 
implementing new features until they have corrected the issues on the features 
they once written.

Spark Today cannot be used at all.
At least find a way to disable the LZ4 feature you can't handle properly.

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Major
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
> stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at com.google.common.io.ByteStreams.read(ByteStreams.java:828)
>   at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>

[jira] [Assigned] (SPARK-28084) LOAD DATA command resolving the partition column name considering case senstive manner

2019-06-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28084:


Assignee: (was: Apache Spark)

> LOAD DATA command resolving the partition column name considering case 
> senstive manner 
> ---
>
> Key: SPARK-28084
> URL: https://issues.apache.org/jira/browse/SPARK-28084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Sujith Chacko
>Priority: Major
> Attachments: parition_casesensitive.PNG
>
>
> LOAD DATA command resolving the partition column name considering case 
> sensitive manner, where as insert command resolves case-insensitive manner.
> Refer the snapshot for more details.
> !image-2019-06-18-00-04-22-475.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28084) LOAD DATA command resolving the partition column name considering case senstive manner

2019-06-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28084:


Assignee: Apache Spark

> LOAD DATA command resolving the partition column name considering case 
> senstive manner 
> ---
>
> Key: SPARK-28084
> URL: https://issues.apache.org/jira/browse/SPARK-28084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Sujith Chacko
>Assignee: Apache Spark
>Priority: Major
> Attachments: parition_casesensitive.PNG
>
>
> LOAD DATA command resolving the partition column name considering case 
> sensitive manner, where as insert command resolves case-insensitive manner.
> Refer the snapshot for more details.
> !image-2019-06-18-00-04-22-475.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2019-06-18 Thread M. Le Bihan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861803#comment-16861803
 ] 

M. Le Bihan edited comment on SPARK-18105 at 6/18/19 4:37 PM:
--

It _seems_ that exchanging from org.lz4:lz4-java:1.4.0 to 
org.lz4:lz4-java:1.6.0 helps.

 
{code:java}
    
        
            org.apache.spark
            spark-core_2.11
            ${spark.version}
    
            
            
                
                    org.slf4j
                    slf4j-log4j12
                
    
                
                    log4j
                    log4j
                
                
                
                org.lz4
                lz4-java
                
            
        
        
        
        org.lz4
        lz4-java
        1.6.0
        
        
        
            org.slf4j
            log4j-over-slf4j
        

        
        org.apache.spark
        spark-sql_2.11
        ${spark.version}
        
    {code}
If it's right, the cause _could be_ a bug they corrected on 1.4.1.

Didn't succeed.


was (Author: mlebihan):
It _seems_ that exchanging from org.lz4:lz4-java:1.4.0 to 
org.lz4:lz4-java:1.6.0 helps.

 
{code:java}
    
        
            org.apache.spark
            spark-core_2.11
            ${spark.version}
    
            
            
                
                    org.slf4j
                    slf4j-log4j12
                
    
                
                    log4j
                    log4j
                
                
                
                org.lz4
                lz4-java
                
            
        
        
        
        org.lz4
        lz4-java
        1.6.0
        
        
        
            org.slf4j
            log4j-over-slf4j
        

        
        org.apache.spark
        spark-sql_2.11
        ${spark.version}
        
    {code}
If it's right, the cause _could be_ a bug they corrected on 1.4.1.

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Major
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
> stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at com.google.common.io.ByteStreams.read(ByteStreams.java:828)
>   at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 

[jira] [Commented] (SPARK-28079) CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is manually added to the schema

2019-06-18 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866777#comment-16866777
 ] 

Liang-Chi Hsieh commented on SPARK-28079:
-

Isn't it the expected behavior as documented in {{PERMISSIVE}} mode of CSV?

> CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is 
> manually added to the schema
> -
>
> Key: SPARK-28079
> URL: https://issues.apache.org/jira/browse/SPARK-28079
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.3
>Reporter: F Jimenez
>Priority: Major
>
> When reading a CSV with mode = "PERMISSIVE", corrupt records are not flagged 
> as such and read in. Only way to get them flagged is to manually set 
> "columnNameOfCorruptRecord" AND manually setting the schema including this 
> column. Example:
> {code:java}
> // Second row has a 4th column that is not declared in the header/schema
> val csvText = s"""
>  | FieldA, FieldB, FieldC
>  | a1,b1,c1
>  | a2,b2,c2,d*""".stripMargin
> val csvFile = new File("/tmp/file.csv")
> FileUtils.write(csvFile, csvText)
> val reader = sqlContext.read
>   .format("csv")
>   .option("header", "true")
>   .option("mode", "PERMISSIVE")
>   .option("columnNameOfCorruptRecord", "corrupt")
>   .schema("corrupt STRING, fieldA STRING, fieldB STRING, fieldC STRING")
> reader.load(csvFile.getAbsolutePath).show(truncate = false)
> {code}
> This produces the correct result:
> {code:java}
> ++--+--+--+
> |corrupt |fieldA|fieldB|fieldC|
> ++--+--+--+
> |null    | a1   |b1    |c1    |
> | a2,b2,c2,d*| a2   |b2    |c2    |
> ++--+--+--+
> {code}
> However removing the "schema" option and going:
> {code:java}
> val reader = sqlContext.read
>   .format("csv")
>   .option("header", "true")
>   .option("mode", "PERMISSIVE")
>   .option("columnNameOfCorruptRecord", "corrupt")
> reader.load(csvFile.getAbsolutePath).show(truncate = false)
> {code}
> Yields:
> {code:java}
> +---+---+---+
> | FieldA| FieldB| FieldC|
> +---+---+---+
> | a1    |b1 |c1 |
> | a2    |b2 |c2 |
> +---+---+---+
> {code}
> The fourth value "d*" in the second row has been removed and the row not 
> marked as corrupt
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28094) Multiple left joins or aggregations in one query produce incorrect results

2019-06-18 Thread Joe Ammann (JIRA)
Joe Ammann created SPARK-28094:
--

 Summary: Multiple left joins or aggregations in one query produce 
incorrect results
 Key: SPARK-28094
 URL: https://issues.apache.org/jira/browse/SPARK-28094
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.3
Reporter: Joe Ammann


Structured streaming queries which involve more than one non-map-like 
transformations (either left joins or aggregations) produce incorrect (and 
sometimes inconsistent) results.

I have put up a sample here as Gist 
https://gist.github.com/jammann/b58bfbe0f4374b89ecea63c1e32c8f17

The test case involves 4 topics, and I've implement 6 different use cases of 
which 3 fail to produce the expected results

1) A innerjoin B: to topic A_B
expected: 10 inner joined messages
observation: OK

2) A innerjoin B outerjoin C: to topic A_B_outer_C 
expected: 9 inner joined messages, last one in watermark
observation: OK

3) A innerjoin B outerjoin C outerjoin D: to topic A_B_outer_C_outer_D
expected: 9 inner/outer joined messages, 3 match C, 1 match D, last one 
in watermark
observation: NOK, only 3 messages are produced, where the C outer join 
matches

4) B outerjoin C: to topic B_outer_C 
expected: 9 outer joined messages, 3 match C, 1 match D, last one in 
watermark
observation: NOK, same as 3

5) A innerjoin B agg on field of A: to topic A_inner_B_agg
expected: 3 groups with a total of 9 inner joined messages, last one in 
watermark
observation: OK

6) A innerjoin B outerjoin C agg on field of A: to topic A_inner_B_outer_C_agg
expected: 3 groups with a total of 9 inner joined messages, last one in 
watermark
observation: NOK, 3 groups with 1 message each, where C outer join 
matches




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28093) Built-in function trim/ltrim/rtrim has bug when using trimStr

2019-06-18 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28093:

Affects Version/s: (was: 2.4.3)
   2.3.3

> Built-in function trim/ltrim/rtrim has bug when using trimStr
> -
>
> Key: SPARK-28093
> URL: https://issues.apache.org/jira/browse/SPARK-28093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> spark-sql> SELECT trim('yxTomxx', 'xyz'), trim('xxxbarxxx', 'x');
> z
> spark-sql> SELECT ltrim('zzzytest', 'xyz'), ltrim('xyxXxyLAST WORD', 'xy');
> xyz
> spark-sql> SELECT rtrim('testxxzx', 'xyz'), rtrim('TURNERyxXxy', 'xy');
> xy
> spark-sql>
> {noformat}
> {noformat}
> postgres=# SELECT trim('yxTomxx', 'xyz'), trim('xxxbarxxx', 'x');
>  btrim | btrim
> ---+---
>  Tom   | bar
> (1 row)
> postgres=# SELECT ltrim('zzzytest', 'xyz'), ltrim('xyxXxyLAST WORD', 'xy');
>  ltrim |ltrim
> ---+--
>  test  | XxyLAST WORD
> (1 row)
> postgres=# SELECT rtrim('testxxzx', 'xyz'), rtrim('TURNERyxXxy', 'xy');
>  rtrim |   rtrim
> ---+---
>  test  | TURNERyxX
> (1 row)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28093) Built-in function trim/ltrim/rtrim has bug when using trimStr

2019-06-18 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28093:

Affects Version/s: (was: 3.0.0)
   2.4.3

> Built-in function trim/ltrim/rtrim has bug when using trimStr
> -
>
> Key: SPARK-28093
> URL: https://issues.apache.org/jira/browse/SPARK-28093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> spark-sql> SELECT trim('yxTomxx', 'xyz'), trim('xxxbarxxx', 'x');
> z
> spark-sql> SELECT ltrim('zzzytest', 'xyz'), ltrim('xyxXxyLAST WORD', 'xy');
> xyz
> spark-sql> SELECT rtrim('testxxzx', 'xyz'), rtrim('TURNERyxXxy', 'xy');
> xy
> spark-sql>
> {noformat}
> {noformat}
> postgres=# SELECT trim('yxTomxx', 'xyz'), trim('xxxbarxxx', 'x');
>  btrim | btrim
> ---+---
>  Tom   | bar
> (1 row)
> postgres=# SELECT ltrim('zzzytest', 'xyz'), ltrim('xyxXxyLAST WORD', 'xy');
>  ltrim |ltrim
> ---+--
>  test  | XxyLAST WORD
> (1 row)
> postgres=# SELECT rtrim('testxxzx', 'xyz'), rtrim('TURNERyxXxy', 'xy');
>  rtrim |   rtrim
> ---+---
>  test  | TURNERyxX
> (1 row)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28093) Built-in function trim/ltrim/rtrim has bug when using trimStr

2019-06-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28093:


Assignee: Apache Spark

> Built-in function trim/ltrim/rtrim has bug when using trimStr
> -
>
> Key: SPARK-28093
> URL: https://issues.apache.org/jira/browse/SPARK-28093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> {noformat}
> spark-sql> SELECT trim('yxTomxx', 'xyz'), trim('xxxbarxxx', 'x');
> z
> spark-sql> SELECT ltrim('zzzytest', 'xyz'), ltrim('xyxXxyLAST WORD', 'xy');
> xyz
> spark-sql> SELECT rtrim('testxxzx', 'xyz'), rtrim('TURNERyxXxy', 'xy');
> xy
> spark-sql>
> {noformat}
> {noformat}
> postgres=# SELECT trim('yxTomxx', 'xyz'), trim('xxxbarxxx', 'x');
>  btrim | btrim
> ---+---
>  Tom   | bar
> (1 row)
> postgres=# SELECT ltrim('zzzytest', 'xyz'), ltrim('xyxXxyLAST WORD', 'xy');
>  ltrim |ltrim
> ---+--
>  test  | XxyLAST WORD
> (1 row)
> postgres=# SELECT rtrim('testxxzx', 'xyz'), rtrim('TURNERyxXxy', 'xy');
>  rtrim |   rtrim
> ---+---
>  test  | TURNERyxX
> (1 row)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28093) Built-in function trim/ltrim/rtrim has bug when using trimStr

2019-06-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28093:


Assignee: (was: Apache Spark)

> Built-in function trim/ltrim/rtrim has bug when using trimStr
> -
>
> Key: SPARK-28093
> URL: https://issues.apache.org/jira/browse/SPARK-28093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> spark-sql> SELECT trim('yxTomxx', 'xyz'), trim('xxxbarxxx', 'x');
> z
> spark-sql> SELECT ltrim('zzzytest', 'xyz'), ltrim('xyxXxyLAST WORD', 'xy');
> xyz
> spark-sql> SELECT rtrim('testxxzx', 'xyz'), rtrim('TURNERyxXxy', 'xy');
> xy
> spark-sql>
> {noformat}
> {noformat}
> postgres=# SELECT trim('yxTomxx', 'xyz'), trim('xxxbarxxx', 'x');
>  btrim | btrim
> ---+---
>  Tom   | bar
> (1 row)
> postgres=# SELECT ltrim('zzzytest', 'xyz'), ltrim('xyxXxyLAST WORD', 'xy');
>  ltrim |ltrim
> ---+--
>  test  | XxyLAST WORD
> (1 row)
> postgres=# SELECT rtrim('testxxzx', 'xyz'), rtrim('TURNERyxXxy', 'xy');
>  rtrim |   rtrim
> ---+---
>  test  | TURNERyxX
> (1 row)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28093) Built-in function trim/ltrim/rtrim has bug when using trimStr

2019-06-18 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-28093:
---

 Summary: Built-in function trim/ltrim/rtrim has bug when using 
trimStr
 Key: SPARK-28093
 URL: https://issues.apache.org/jira/browse/SPARK-28093
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


{noformat}
spark-sql> SELECT trim('yxTomxx', 'xyz'), trim('xxxbarxxx', 'x');
z
spark-sql> SELECT ltrim('zzzytest', 'xyz'), ltrim('xyxXxyLAST WORD', 'xy');
xyz
spark-sql> SELECT rtrim('testxxzx', 'xyz'), rtrim('TURNERyxXxy', 'xy');
xy
spark-sql>
{noformat}

{noformat}
postgres=# SELECT trim('yxTomxx', 'xyz'), trim('xxxbarxxx', 'x');
 btrim | btrim
---+---
 Tom   | bar
(1 row)

postgres=# SELECT ltrim('zzzytest', 'xyz'), ltrim('xyxXxyLAST WORD', 'xy');
 ltrim |ltrim
---+--
 test  | XxyLAST WORD
(1 row)

postgres=# SELECT rtrim('testxxzx', 'xyz'), rtrim('TURNERyxXxy', 'xy');
 rtrim |   rtrim
---+---
 test  | TURNERyxX
(1 row)
{noformat}






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28077) ANSI SQL: OVERLAY function(T312)

2019-06-18 Thread jiaan.geng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866351#comment-16866351
 ] 

jiaan.geng commented on SPARK-28077:


I'm working on.

> ANSI SQL: OVERLAY function(T312)
> 
>
> Key: SPARK-28077
> URL: https://issues.apache.org/jira/browse/SPARK-28077
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> ||Function||Return Type||Description||Example||Result||
> |{{overlay(_string_ placing _string_ from }}{{int}}{{[for 
> {{int}}])}}|{{text}}|Replace substring|{{overlay('Tas' placing 'hom' from 
> 2 for 4)}}|{{Thomas}}|
> For example:
> {code:sql}
> SELECT OVERLAY('abcdef' PLACING '45' FROM 4) AS "abc45f";
> SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5) AS "yabadaba";
> SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5 FOR 0) AS "yabadabadoo";
> SELECT OVERLAY('babosa' PLACING 'ubb' FROM 2 FOR 4) AS "bubba";
> {code}
> https://www.postgresql.org/docs/11/functions-string.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28073) ANSI SQL: Character literals

2019-06-18 Thread jiaan.geng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866344#comment-16866344
 ] 

jiaan.geng commented on SPARK-28073:


I'm working on.

> ANSI SQL: Character literals
> 
>
> Key: SPARK-28073
> URL: https://issues.apache.org/jira/browse/SPARK-28073
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> ||Feature ID||Feature Name||Feature Description||
> |E021-03|Character literals|— Subclause 5.3, “”:  [ 
> ... ] |
> Example:
> {code:sql}
> SELECT 'first line'
> ' - next line'
>   ' - third line'
>   AS "Three lines to one";
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28077) ANSI SQL: OVERLAY function(T312)

2019-06-18 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28077:

Summary: ANSI SQL: OVERLAY function(T312)  (was: String Functions: Add 
support OVERLAY)

> ANSI SQL: OVERLAY function(T312)
> 
>
> Key: SPARK-28077
> URL: https://issues.apache.org/jira/browse/SPARK-28077
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> ||Function||Return Type||Description||Example||Result||
> |{{overlay(_string_ placing _string_ from }}{{int}}{{[for 
> {{int}}])}}|{{text}}|Replace substring|{{overlay('Tas' placing 'hom' from 
> 2 for 4)}}|{{Thomas}}|
> For example:
> {code:sql}
> SELECT OVERLAY('abcdef' PLACING '45' FROM 4) AS "abc45f";
> SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5) AS "yabadaba";
> SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5 FOR 0) AS "yabadabadoo";
> SELECT OVERLAY('babosa' PLACING 'ubb' FROM 2 FOR 4) AS "bubba";
> {code}
> https://www.postgresql.org/docs/11/functions-string.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28091) Extend Spark metrics system with executor plugin metrics

2019-06-18 Thread Luca Canali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866336#comment-16866336
 ] 

Luca Canali edited comment on SPARK-28091 at 6/18/19 8:07 AM:
--

[~irashid] given your work on SPARK-24918 you may be interested to comment on 
this?


was (Author: lucacanali):
@irashid given your work on SPARK-24918 you may be interested to comment on 
this?

> Extend Spark metrics system with executor plugin metrics
> 
>
> Key: SPARK-28091
> URL: https://issues.apache.org/jira/browse/SPARK-28091
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes to improve Spark instrumentation by adding a hook for Spark 
> executor plugin metrics to the Spark metrics systems implemented with the 
> Dropwizard/Codahale library.
> Context: The Spark metrics system provides a large variety of metrics, see 
> also SPARK-26890, useful to  monitor and troubleshoot Spark workloads. A 
> typical workflow is to sink the metrics to a storage system and build 
> dashboards on top of that.
> Improvement: The original goal of this work was to add instrumentation for S3 
> filesystem access metrics by Spark job. Currently, [[ExecutorSource]] 
> instruments HDFS and local filesystem metrics. Rather than extending the code 
> there, we proposes to add a metrics plugin system which is of more flexible 
> and general use.
> Advantages:
>  * The metric plugin system makes it easy to implement instrumentation for S3 
> access by Spark jobs.
>  * The metrics plugin system allows for easy extensions of how Spark collects 
> HDFS-related workload metrics. This is currently done using the Hadoop 
> Filesystem GetAllStatistics method, which is deprecated in recent versions of 
> Hadoop. Recent versions of Hadoop Filesystem recommend using method 
> GetGlobalStorageStatistics, which also provides several additional metrics. 
> GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been 
> introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an 
> easy way to “opt in” using such new API calls for those deploying suitable 
> Hadoop versions.
>  * We also have the use case of adding Hadoop filesystem monitoring for a 
> custom Hadoop compliant filesystem in use in our organization (EOS using the 
> XRootD protocol). The metrics plugin infrastructure makes this easy to do. 
> Others may have similar use cases.
>  * More generally, this method makes it straightforward to plug in Filesystem 
> and other metrics to the Spark monitoring system. Future work on plugin 
> implementation can address extending monitoring to measure usage of external 
> resources (OS, filesystem, network, accelerator cards, etc), that maybe would 
> not normally be considered general enough for inclusion in Apache Spark code, 
> but that can be nevertheless useful for specialized use cases, tests or 
> troubleshooting.
> Implementation:
> The proposed implementation is currently a WIP open for comments and 
> improvements. It is based on the work on Executor Plugin of SPARK-24918 and 
> builds on recent work on extending Spark executor metrics, such as SPARK-25228
> Tests and examples:
> This has been so far manually tested running Spark on YARN and K8S clusters, 
> in particular for monitoring S3 and for extending HDFS instrumentation with 
> the Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric 
> plugin example and code used for testing are available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28091) Extend Spark metrics system with executor plugin metrics

2019-06-18 Thread Luca Canali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866336#comment-16866336
 ] 

Luca Canali commented on SPARK-28091:
-

@irashid given your work on SPARK-24918 you may be interested to comment on 
this?

> Extend Spark metrics system with executor plugin metrics
> 
>
> Key: SPARK-28091
> URL: https://issues.apache.org/jira/browse/SPARK-28091
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes to improve Spark instrumentation by adding a hook for Spark 
> executor plugin metrics to the Spark metrics systems implemented with the 
> Dropwizard/Codahale library.
> Context: The Spark metrics system provides a large variety of metrics, see 
> also SPARK-26890, useful to  monitor and troubleshoot Spark workloads. A 
> typical workflow is to sink the metrics to a storage system and build 
> dashboards on top of that.
> Improvement: The original goal of this work was to add instrumentation for S3 
> filesystem access metrics by Spark job. Currently, [[ExecutorSource]] 
> instruments HDFS and local filesystem metrics. Rather than extending the code 
> there, we proposes to add a metrics plugin system which is of more flexible 
> and general use.
> Advantages:
>  * The metric plugin system makes it easy to implement instrumentation for S3 
> access by Spark jobs.
>  * The metrics plugin system allows for easy extensions of how Spark collects 
> HDFS-related workload metrics. This is currently done using the Hadoop 
> Filesystem GetAllStatistics method, which is deprecated in recent versions of 
> Hadoop. Recent versions of Hadoop Filesystem recommend using method 
> GetGlobalStorageStatistics, which also provides several additional metrics. 
> GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been 
> introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an 
> easy way to “opt in” using such new API calls for those deploying suitable 
> Hadoop versions.
>  * We also have the use case of adding Hadoop filesystem monitoring for a 
> custom Hadoop compliant filesystem in use in our organization (EOS using the 
> XRootD protocol). The metrics plugin infrastructure makes this easy to do. 
> Others may have similar use cases.
>  * More generally, this method makes it straightforward to plug in Filesystem 
> and other metrics to the Spark monitoring system. Future work on plugin 
> implementation can address extending monitoring to measure usage of external 
> resources (OS, filesystem, network, accelerator cards, etc), that maybe would 
> not normally be considered general enough for inclusion in Apache Spark code, 
> but that can be nevertheless useful for specialized use cases, tests or 
> troubleshooting.
> Implementation:
> The proposed implementation is currently a WIP open for comments and 
> improvements. It is based on the work on Executor Plugin of SPARK-24918 and 
> builds on recent work on extending Spark executor metrics, such as SPARK-25228
> Tests and examples:
> This has been so far manually tested running Spark on YARN and K8S clusters, 
> in particular for monitoring S3 and for extending HDFS instrumentation with 
> the Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric 
> plugin example and code used for testing are available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28092) Spark cannot load files with COLON(:) char if not specified full path

2019-06-18 Thread Ladislav Jech (JIRA)
Ladislav Jech created SPARK-28092:
-

 Summary: Spark cannot load files with COLON(:) char if not 
specified full path
 Key: SPARK-28092
 URL: https://issues.apache.org/jira/browse/SPARK-28092
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.3
 Environment: Cloudera 6.2

Spark latest parcel (I think 2.4.3)
Reporter: Ladislav Jech


Scenario:

I have CSV files in S3 bucket like this:

s3a://bucket/prefix/myfile_2019:04:05.csv

s3a://bucket/prefix/myfile_2019:04:06.csv

Now when I try to load files with something like:
df = spark.read.load("s3://bucket/prefix/*", format="csv", sep=":", 
inferSchema="true", header="true")
 
It fails on error about URI (sorry don't have here exact exception), but when I 
list all files from S3 and provide path like array:
df = 
spark.read.load(path=["s3://bucket/prefix/myfile_2019:04:05.csv","s3://bucket/prefix/myfile_2019:04:05.csv"],
 format="csv", sep=":", inferSchema="true", header="true")
 
It works, the reason is COLON character in the name of files as per my 
observations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28091) Extend Spark metrics system with executor plugin metrics

2019-06-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28091:


Assignee: (was: Apache Spark)

> Extend Spark metrics system with executor plugin metrics
> 
>
> Key: SPARK-28091
> URL: https://issues.apache.org/jira/browse/SPARK-28091
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes to improve Spark instrumentation by adding a hook for Spark 
> executor plugin metrics to the Spark metrics systems implemented with the 
> Dropwizard/Codahale library.
> Context: The Spark metrics system provides a large variety of metrics, see 
> also SPARK-26890, useful to  monitor and troubleshoot Spark workloads. A 
> typical workflow is to sink the metrics to a storage system and build 
> dashboards on top of that.
> Improvement: The original goal of this work was to add instrumentation for S3 
> filesystem access metrics by Spark job. Currently, [[ExecutorSource]] 
> instruments HDFS and local filesystem metrics. Rather than extending the code 
> there, we proposes to add a metrics plugin system which is of more flexible 
> and general use.
> Advantages:
>  * The metric plugin system makes it easy to implement instrumentation for S3 
> access by Spark jobs.
>  * The metrics plugin system allows for easy extensions of how Spark collects 
> HDFS-related workload metrics. This is currently done using the Hadoop 
> Filesystem GetAllStatistics method, which is deprecated in recent versions of 
> Hadoop. Recent versions of Hadoop Filesystem recommend using method 
> GetGlobalStorageStatistics, which also provides several additional metrics. 
> GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been 
> introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an 
> easy way to “opt in” using such new API calls for those deploying suitable 
> Hadoop versions.
>  * We also have the use case of adding Hadoop filesystem monitoring for a 
> custom Hadoop compliant filesystem in use in our organization (EOS using the 
> XRootD protocol). The metrics plugin infrastructure makes this easy to do. 
> Others may have similar use cases.
>  * More generally, this method makes it straightforward to plug in Filesystem 
> and other metrics to the Spark monitoring system. Future work on plugin 
> implementation can address extending monitoring to measure usage of external 
> resources (OS, filesystem, network, accelerator cards, etc), that maybe would 
> not normally be considered general enough for inclusion in Apache Spark code, 
> but that can be nevertheless useful for specialized use cases, tests or 
> troubleshooting.
> Implementation:
> The proposed implementation is currently a WIP open for comments and 
> improvements. It is based on the work on Executor Plugin of SPARK-24918 and 
> builds on recent work on extending Spark executor metrics, such as SPARK-25228
> Tests and examples:
> This has been so far manually tested running Spark on YARN and K8S clusters, 
> in particular for monitoring S3 and for extending HDFS instrumentation with 
> the Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric 
> plugin example and code used for testing are available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28091) Extend Spark metrics system with executor plugin metrics

2019-06-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28091:


Assignee: Apache Spark

> Extend Spark metrics system with executor plugin metrics
> 
>
> Key: SPARK-28091
> URL: https://issues.apache.org/jira/browse/SPARK-28091
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Assignee: Apache Spark
>Priority: Minor
>
> This proposes to improve Spark instrumentation by adding a hook for Spark 
> executor plugin metrics to the Spark metrics systems implemented with the 
> Dropwizard/Codahale library.
> Context: The Spark metrics system provides a large variety of metrics, see 
> also SPARK-26890, useful to  monitor and troubleshoot Spark workloads. A 
> typical workflow is to sink the metrics to a storage system and build 
> dashboards on top of that.
> Improvement: The original goal of this work was to add instrumentation for S3 
> filesystem access metrics by Spark job. Currently, [[ExecutorSource]] 
> instruments HDFS and local filesystem metrics. Rather than extending the code 
> there, we proposes to add a metrics plugin system which is of more flexible 
> and general use.
> Advantages:
>  * The metric plugin system makes it easy to implement instrumentation for S3 
> access by Spark jobs.
>  * The metrics plugin system allows for easy extensions of how Spark collects 
> HDFS-related workload metrics. This is currently done using the Hadoop 
> Filesystem GetAllStatistics method, which is deprecated in recent versions of 
> Hadoop. Recent versions of Hadoop Filesystem recommend using method 
> GetGlobalStorageStatistics, which also provides several additional metrics. 
> GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been 
> introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an 
> easy way to “opt in” using such new API calls for those deploying suitable 
> Hadoop versions.
>  * We also have the use case of adding Hadoop filesystem monitoring for a 
> custom Hadoop compliant filesystem in use in our organization (EOS using the 
> XRootD protocol). The metrics plugin infrastructure makes this easy to do. 
> Others may have similar use cases.
>  * More generally, this method makes it straightforward to plug in Filesystem 
> and other metrics to the Spark monitoring system. Future work on plugin 
> implementation can address extending monitoring to measure usage of external 
> resources (OS, filesystem, network, accelerator cards, etc), that maybe would 
> not normally be considered general enough for inclusion in Apache Spark code, 
> but that can be nevertheless useful for specialized use cases, tests or 
> troubleshooting.
> Implementation:
> The proposed implementation is currently a WIP open for comments and 
> improvements. It is based on the work on Executor Plugin of SPARK-24918 and 
> builds on recent work on extending Spark executor metrics, such as SPARK-25228
> Tests and examples:
> This has been so far manually tested running Spark on YARN and K8S clusters, 
> in particular for monitoring S3 and for extending HDFS instrumentation with 
> the Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric 
> plugin example and code used for testing are available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2019-06-18 Thread Piotr Chowaniec (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856558#comment-16856558
 ] 

Piotr Chowaniec edited comment on SPARK-18105 at 6/18/19 7:21 AM:
--

I have a similar issue with Spark 2.3.2.

Here is a stack trace:
{code:java}
org.apache.spark.scheduler.DAGScheduler  : ShuffleMapStage 647 (count at 
Step.java:20) failed in 1.908 s due to 
org.apache.spark.shuffle.FetchFailedException: Stream is corrupted
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:528)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:444)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:62)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.agg_doAggregateWithKeys_1$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.agg_doAggregateWithKeys_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Stream is corrupted
at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:252)
at 
net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157)
at 
net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:170)
at 
org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:349)
at 
org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:336)
at 
org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:336)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1381)
at org.apache.spark.util.Utils$.copyStream(Utils.scala:357)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:436)
... 21 more
Caused by: net.jpountz.lz4.LZ4Exception: Error decoding offset 2010 of input 
buffer
at 
net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:39)
at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:247)
... 29 more
{code}
It happens during ETL process that has about 200 steps. It looks like it 
depends on the input data because we have exceptions only on the production 
environment (on test and dev machines same process with different data is 
running without problems). Unfortunately there is no way to use production data 
on other environment, so we cannot find differences.

Changing compression codec to Snappy gives:
{code:java}
o.apache.spark.scheduler.TaskSetManager  : Lost task 0.0 in stage 852.3 (TID 308
36, localhost, executor driver): FetchFailed(BlockManagerId(driver, DNS.domena, 
33588, None), shuffleId=298, mapId=2, reduceId=3, message=
org.apache.spark.shuffle.FetchFailedException: FAILED_TO_UNCOMPRESS(5)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:528)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:444)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:62)
at 

[jira] [Created] (SPARK-28091) Extend Spark metrics system with executor plugin metrics

2019-06-18 Thread Luca Canali (JIRA)
Luca Canali created SPARK-28091:
---

 Summary: Extend Spark metrics system with executor plugin metrics
 Key: SPARK-28091
 URL: https://issues.apache.org/jira/browse/SPARK-28091
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Luca Canali


This proposes to improve Spark instrumentation by adding a hook for Spark 
executor plugin metrics to the Spark metrics systems implemented with the 
Dropwizard/Codahale library.

Context: The Spark metrics system provides a large variety of metrics, see also 
SPARK-26890, useful to  monitor and troubleshoot Spark workloads. A typical 
workflow is to sink the metrics to a storage system and build dashboards on top 
of that.

Improvement: The original goal of this work was to add instrumentation for S3 
filesystem access metrics by Spark job. Currently, [[ExecutorSource]] 
instruments HDFS and local filesystem metrics. Rather than extending the code 
there, we proposes to add a metrics plugin system which is of more flexible and 
general use.

Advantages:
 * The metric plugin system makes it easy to implement instrumentation for S3 
access by Spark jobs.
 * The metrics plugin system allows for easy extensions of how Spark collects 
HDFS-related workload metrics. This is currently done using the Hadoop 
Filesystem GetAllStatistics method, which is deprecated in recent versions of 
Hadoop. Recent versions of Hadoop Filesystem recommend using method 
GetGlobalStorageStatistics, which also provides several additional metrics. 
GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been introduced 
in Hadoop 2.8). Using a metric plugin for Spark would allow an easy way to “opt 
in” using such new API calls for those deploying suitable Hadoop versions.
 * We also have the use case of adding Hadoop filesystem monitoring for a 
custom Hadoop compliant filesystem in use in our organization (EOS using the 
XRootD protocol). The metrics plugin infrastructure makes this easy to do. 
Others may have similar use cases.
 * More generally, this method makes it straightforward to plug in Filesystem 
and other metrics to the Spark monitoring system. Future work on plugin 
implementation can address extending monitoring to measure usage of external 
resources (OS, filesystem, network, accelerator cards, etc), that maybe would 
not normally be considered general enough for inclusion in Apache Spark code, 
but that can be nevertheless useful for specialized use cases, tests or 
troubleshooting.

Implementation:

The proposed implementation is currently a WIP open for comments and 
improvements. It is based on the work on Executor Plugin of SPARK-24918 and 
builds on recent work on extending Spark executor metrics, such as SPARK-25228

Tests and examples:

This has been so far manually tested running Spark on YARN and K8S clusters, in 
particular for monitoring S3 and for extending HDFS instrumentation with the 
Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric plugin 
example and code used for testing are available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28072) Fix IncompatibleClassChangeError in `FromUnixTime` codegen on JDK9+

2019-06-18 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28072.
---
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24889

> Fix IncompatibleClassChangeError in `FromUnixTime` codegen on JDK9+
> ---
>
> Key: SPARK-28072
> URL: https://issues.apache.org/jira/browse/SPARK-28072
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> With JDK9+, the generate bytecode of `FromUnixTime` raise 
> `java.lang.IncompatibleClassChangeError` due to 
> [JDK-8145148|https://bugs.openjdk.java.net/browse/JDK-8145148] .
> This is a blocker in our JDK11 Jenkins job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28090) Spark hangs when an execution plan has many projections on nested structs

2019-06-18 Thread Ruslan Yushchenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Yushchenko updated SPARK-28090:
--
Description: 
This was already posted (#28016), but the provided example didn't always 
reproduce the error. This example consistently reproduces the issue.

Spark applications freeze on execution plan optimization stage (Catalyst) when 
a logical execution plan contains a lot of projections that operate on nested 
struct fields.

The code listed below demonstrates the issue.

To reproduce the Spark App does the following:
 * A small dataframe is created from a JSON example.
 * Several nested transformations (negation of a number) are applied on struct 
fields and each time a new struct field is created. 
 * Once more than 9 such transformations are applied the Catalyst optimizer 
freezes on optimizing the execution plan.
 * You can control the freezing by choosing different upper bound for the 
Range. E.g. it will work file if the upper bound is 5, but will hang is the 
bound is 10.

{code:java}
package com.example

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{StructField, StructType}
import scala.collection.mutable.ListBuffer

object SparkApp1IssueSelfContained {

  // A sample data for a dataframe with nested structs
  val sample: List[String] =
""" { "numerics": {"num1": 101, "num2": 102, "num3": 103, "num4": 104, 
"num5": 105, "num6": 106, "num7": 107, "num8": 108, "num9": 109, "num10": 110, 
"num11": 111, "num12": 112, "num13": 113, "num14": 114, "num15": 115} } """ ::
""" { "numerics": {"num1": 201, "num2": 202, "num3": 203, "num4": 204, 
"num5": 205, "num6": 206, "num7": 207, "num8": 208, "num9": 209, "num10": 210, 
"num11": 211, "num12": 212, "num13": 213, "num14": 214, "num15": 215} } """ ::
""" { "numerics": {"num1": 301, "num2": 302, "num3": 303, "num4": 304, 
"num5": 305, "num6": 306, "num7": 307, "num8": 308, "num9": 309, "num10": 310, 
"num11": 311, "num12": 312, "num13": 313, "num14": 314, "num15": 315} } """ ::
Nil

  /**
* Transforms a column inside a nested struct. The transformed value will be 
put into a new field of that nested struct
*
* The output column name can omit the full path as the field will be 
created at the same level of nesting as the input column.
*
* @param inputColumnName  A column name for which to apply the 
transformation, e.g. `company.employee.firstName`.
* @param outputColumnName The output column name. The path is optional, 
e.g. you can use `transformedName` instead of 
`company.employee.transformedName`.
* @param expression   A function that applies a transformation to a 
column as a Spark expression.
* @return A dataframe with a new field that contains transformed values.
*/
  def transformInsideNestedStruct(df: DataFrame,
  inputColumnName: String,
  outputColumnName: String,
  expression: Column => Column): DataFrame = {
def mapStruct(schema: StructType, path: Seq[String], parentColumn: 
Option[Column] = None): Seq[Column] = {
  val mappedFields = new ListBuffer[Column]()

  def handleMatchedLeaf(field: StructField, curColumn: Column): Seq[Column] 
= {
val newColumn = expression(curColumn).as(outputColumnName)
mappedFields += newColumn
Seq(curColumn)
  }

  def handleMatchedNonLeaf(field: StructField, curColumn: Column): 
Seq[Column] = {
// Non-leaf columns need to be further processed recursively
field.dataType match {
  case dt: StructType => Seq(struct(mapStruct(dt, path.tail, 
Some(curColumn)): _*).as(field.name))
  case _ => throw new IllegalArgumentException(s"Field '${field.name}' 
is not a struct type.")
}
  }

  val fieldName = path.head
  val isLeaf = path.lengthCompare(2) < 0

  val newColumns = schema.fields.flatMap(field => {
// This is the original column (struct field) we want to process
val curColumn = parentColumn match {
  case None => new Column(field.name)
  case Some(col) => col.getField(field.name).as(field.name)
}

if (field.name.compareToIgnoreCase(fieldName) != 0) {
  // Copy unrelated fields as they were
  Seq(curColumn)
} else {
  // We have found a match
  if (isLeaf) {
handleMatchedLeaf(field, curColumn)
  } else {
handleMatchedNonLeaf(field, curColumn)
  }
}
  })

  newColumns ++ mappedFields
}

val schema = df.schema
val path = inputColumnName.split('.')
df.select(mapStruct(schema, path): _*)
  }

  /**
* This Spark Job demonstrates an issue of execution plan freezing when 
there are a lot of projections
* involving nested structs in an execution 

[jira] [Created] (SPARK-28090) Spark hangs when an execution plan has many projections on nested structs

2019-06-18 Thread Ruslan Yushchenko (JIRA)
Ruslan Yushchenko created SPARK-28090:
-

 Summary: Spark hangs when an execution plan has many projections 
on nested structs
 Key: SPARK-28090
 URL: https://issues.apache.org/jira/browse/SPARK-28090
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 2.4.3
 Environment: Tried in
 * Spark 2.2.1, Spark 2.4.3 in local mode on Linux, MasOS and Windows
 * Spark 2.4.3 / Yarn on a Linux cluster
Reporter: Ruslan Yushchenko


This was already posted (#28016), but the provided example didn't always 
reproduce the error.

Spark applications freeze on execution plan optimization stage (Catalyst) when 
a logical execution plan contains a lot of projections that operate on nested 
struct fields.

The code listed below demonstrates the issue.

To reproduce the Spark App does the following:
 * A small dataframe is created from a JSON example.
 * Several nested transformations (negation of a number) are applied on struct 
fields and each time a new struct field is created. 
 * Once more than 9 such transformations are applied the Catalyst optimizer 
freezes on optimizing the execution plan.
 * You can control the freezing by choosing different upper bound for the 
Range. E.g. it will work file if the upper bound is 5, but will hang is the 
bound is 10.

{code}
package com.example

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{StructField, StructType}
import scala.collection.mutable.ListBuffer

object SparkApp1IssueSelfContained {

  // A sample data for a dataframe with nested structs
  val sample: List[String] =
""" { "numerics": {"num1": 101, "num2": 102, "num3": 103, "num4": 104, 
"num5": 105, "num6": 106, "num7": 107, "num8": 108, "num9": 109, "num10": 110, 
"num11": 111, "num12": 112, "num13": 113, "num14": 114, "num15": 115} } """ ::
""" { "numerics": {"num1": 201, "num2": 202, "num3": 203, "num4": 204, 
"num5": 205, "num6": 206, "num7": 207, "num8": 208, "num9": 209, "num10": 210, 
"num11": 211, "num12": 212, "num13": 213, "num14": 214, "num15": 215} } """ ::
""" { "numerics": {"num1": 301, "num2": 302, "num3": 303, "num4": 304, 
"num5": 305, "num6": 306, "num7": 307, "num8": 308, "num9": 309, "num10": 310, 
"num11": 311, "num12": 312, "num13": 313, "num14": 314, "num15": 315} } """ ::
Nil

  /**
* Transforms a column inside a nested struct. The transformed value will be 
put into a new field of that nested struct
*
* The output column name can omit the full path as the field will be 
created at the same level of nesting as the input column.
*
* @param inputColumnName  A column name for which to apply the 
transformation, e.g. `company.employee.firstName`.
* @param outputColumnName The output column name. The path is optional, 
e.g. you can use `transformedName` instead of 
`company.employee.transformedName`.
* @param expression   A function that applies a transformation to a 
column as a Spark expression.
* @return A dataframe with a new field that contains transformed values.
*/
  def transformInsideNestedStruct(df: DataFrame,
  inputColumnName: String,
  outputColumnName: String,
  expression: Column => Column): DataFrame = {
def mapStruct(schema: StructType, path: Seq[String], parentColumn: 
Option[Column] = None): Seq[Column] = {
  val mappedFields = new ListBuffer[Column]()

  def handleMatchedLeaf(field: StructField, curColumn: Column): Seq[Column] 
= {
val newColumn = expression(curColumn).as(outputColumnName)
mappedFields += newColumn
Seq(curColumn)
  }

  def handleMatchedNonLeaf(field: StructField, curColumn: Column): 
Seq[Column] = {
// Non-leaf columns need to be further processed recursively
field.dataType match {
  case dt: StructType => Seq(struct(mapStruct(dt, path.tail, 
Some(curColumn)): _*).as(field.name))
  case _ => throw new IllegalArgumentException(s"Field '${field.name}' 
is not a struct type.")
}
  }

  val fieldName = path.head
  val isLeaf = path.lengthCompare(2) < 0

  val newColumns = schema.fields.flatMap(field => {
// This is the original column (struct field) we want to process
val curColumn = parentColumn match {
  case None => new Column(field.name)
  case Some(col) => col.getField(field.name).as(field.name)
}

if (field.name.compareToIgnoreCase(fieldName) != 0) {
  // Copy unrelated fields as they were
  Seq(curColumn)
} else {
  // We have found a match
  if (isLeaf) {
handleMatchedLeaf(field, curColumn)
  } else {
handleMatchedNonLeaf(field, curColumn)
  }
}
  })

  

[jira] [Updated] (SPARK-28088) String Functions: Enhance LPAD/RPAD function

2019-06-18 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28088:

Description: Enhance LPAD/RPAD function to make {{pad}} parameter optional.

> String Functions: Enhance LPAD/RPAD function
> 
>
> Key: SPARK-28088
> URL: https://issues.apache.org/jira/browse/SPARK-28088
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Enhance LPAD/RPAD function to make {{pad}} parameter optional.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org