[jira] [Commented] (SPARK-33152) Constraint Propagation code causes OOM issues or increasing compilation time to hours

2021-09-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414642#comment-17414642
 ] 

Apache Spark commented on SPARK-33152:
--

User 'ahshahid' has created a pull request for this issue:
https://github.com/apache/spark/pull/33983

> Constraint Propagation code causes OOM issues or increasing compilation time 
> to hours
> -
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.1
>Reporter: Asif
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We encountered this issue at Workday. 
> The issue is that current Constraints Propagation code pessimistically 
> generates all the possible permutations of base constraint for the aliases in 
> the project node.
> This causes blow up of the number of constraints generated causing OOM issues 
> at compile time of sql query, or queries taking 18 min to 2 hrs to compile.
> The problematic piece of code is in LogicalPlan.getAliasedConstraints
> projectList.foreach {
>  case a @ Alias(l: Literal, _) =>
>  allConstraints += EqualNullSafe(a.toAttribute, l)
>  case a @ Alias(e, _) =>
>  // For every alias in `projectList`,replace the reference in
>  // constraints by its attribute.
>  allConstraints ++= allConstraints.map(_ transform {
>  case expr: Expression if expr.semanticEquals(e) =>
>  a.toAttribute
>  })
>  allConstraints += EqualNullSafe(e, a.toAttribute)
>  case _ => // Don't change.
>  }
> so consider a hypothetical plan
>  
> Project (a, a as a1, a. as a2, a as a3, b, b as b1, b as b2, c, c as c1, c as 
> c2 , c as c3)
>    |
> Filter f(a, b, c)
> |
> Base Relation (a, b, c)
> and so we have projection as
> a, a1, a2, a3
> b, b1, b2
> c, c1, c2, c3
> Lets say hypothetically f(a, b, c) has a occurring 1 times, b occurring 2 
> times, and C occurring 3 times.
> So at project node the number of constraints for a single base constraint 
> f(a, b, c) will be
> 4C1 * 3C2 * 4C3 = 48
> In our case, we have seen number of constraints going up to > 3 or more, 
> as there are complex case statements in the projection.
> Spark generates all these constraints pessimistically for pruning filters or 
> push down predicates for join , it may encounter when the optimizer traverses 
> up the tree.
>  
> This is issue is solved at our end by modifying the spark code to use a 
> different logic.
> The idea is simple. 
> Instead of generating pessimistically all possible combinations of base 
> constraint, just store the original base constraints & track the aliases at 
> each level.
> The principal followed is this:
> 1) Store the base constraint and keep the track of the aliases for the 
> underlying attribute.
> 2) If the base attribute composing the constraint is not in the output set, 
> see if the constraint survives by substituting the attribute getting removed 
> with the next available alias's attribute.
>  
> For checking if a filter can be pruned , just canonicalize the filter with 
> the attribute at 0th position of the tracking list & compare with the 
> underlying base constraint.
> To elaborate using  the plan above.
> At project node
> We have constraint f(a,b,c)
> we keep track of alias
> List 1  : a, a1.attribute, a2.attribute, a3.attribute
> List2 :  b, b1.attribute, b2.attribute 
> List3: c, c1.attribute, c2.attribute, c3.attribute
> Lets say above the project node, we encounter a filter
> f(a1, b2, c3)
> So canonicalize the filter by using the above list data, to convert it to 
> f(a,b c) & compare it with the stored base constraints.
>  
> For predicate push down , instead of generating all the redundant 
> combinations of constraints , just generate one constraint per element of the 
> alias.
> In the current spark code , in any case, filter push down happens only for 1 
> variable at a time.
> So just expanding the filter (a,b,c) to
> f(a, b, c), f(a1, b, c), f(a2, b, c), f(a3, , b ,c), f (a, b1, c), f(a, b2, 
> c) , f(a, b, c1), f(a, b, c2), f(a, b, c3) 
> would suffice, rather than generating all the redundant combinations.
> In fact the code can be easily modified to generate only those constraints 
> which involve variables forming the join condition. so the number of  
> constraints generated on expand are further reduced.
> We already have code to generate compound filters for push down ( join on 
> multiple conditions), which can be used for single variable condition, push 
> down too.
> Just to elaborate the logic further, if we consider the above hypothetical 
> plan (assume collapse project rule is not there)
>  
> Project (a1, a1. as a4, b,  c1, c1 as c4)
>   |
> Project (a, a as a1, a. as a2, a as a3, b, b as b1, b as b2, c, c as 

[jira] [Commented] (SPARK-33152) Constraint Propagation code causes OOM issues or increasing compilation time to hours

2020-12-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253732#comment-17253732
 ] 

Apache Spark commented on SPARK-33152:
--

User 'tanelk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30894

> Constraint Propagation code causes OOM issues or increasing compilation time 
> to hours
> -
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.1
>Reporter: Asif
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We encountered this issue at Workday. 
> The issue is that current Constraints Propagation code pessimistically 
> generates all the possible permutations of base constraint for the aliases in 
> the project node.
> This causes blow up of the number of constraints generated causing OOM issues 
> at compile time of sql query, or queries taking 18 min to 2 hrs to compile.
> The problematic piece of code is in LogicalPlan.getAliasedConstraints
> projectList.foreach {
>  case a @ Alias(l: Literal, _) =>
>  allConstraints += EqualNullSafe(a.toAttribute, l)
>  case a @ Alias(e, _) =>
>  // For every alias in `projectList`,replace the reference in
>  // constraints by its attribute.
>  allConstraints ++= allConstraints.map(_ transform {
>  case expr: Expression if expr.semanticEquals(e) =>
>  a.toAttribute
>  })
>  allConstraints += EqualNullSafe(e, a.toAttribute)
>  case _ => // Don't change.
>  }
> so consider a hypothetical plan
>  
> Project (a, a as a1, a. as a2, a as a3, b, b as b1, b as b2, c, c as c1, c as 
> c2 , c as c3)
>    |
> Filter f(a, b, c)
> |
> Base Relation (a, b, c)
> and so we have projection as
> a, a1, a2, a3
> b, b1, b2
> c, c1, c2, c3
> Lets say hypothetically f(a, b, c) has a occurring 1 times, b occurring 2 
> times, and C occurring 3 times.
> So at project node the number of constraints for a single base constraint 
> f(a, b, c) will be
> 4C1 * 3C2 * 4C3 = 48
> In our case, we have seen number of constraints going up to > 3 or more, 
> as there are complex case statements in the projection.
> Spark generates all these constraints pessimistically for pruning filters or 
> push down predicates for join , it may encounter when the optimizer traverses 
> up the tree.
>  
> This is issue is solved at our end by modifying the spark code to use a 
> different logic.
> The idea is simple. 
> Instead of generating pessimistically all possible combinations of base 
> constraint, just store the original base constraints & track the aliases at 
> each level.
> The principal followed is this:
> 1) Store the base constraint and keep the track of the aliases for the 
> underlying attribute.
> 2) If the base attribute composing the constraint is not in the output set, 
> see if the constraint survives by substituting the attribute getting removed 
> with the next available alias's attribute.
>  
> For checking if a filter can be pruned , just canonicalize the filter with 
> the attribute at 0th position of the tracking list & compare with the 
> underlying base constraint.
> To elaborate using  the plan above.
> At project node
> We have constraint f(a,b,c)
> we keep track of alias
> List 1  : a, a1.attribute, a2.attribute, a3.attribute
> List2 :  b, b1.attribute, b2.attribute 
> List3: c, c1.attribute, c2.attribute, c3.attribute
> Lets say above the project node, we encounter a filter
> f(a1, b2, c3)
> So canonicalize the filter by using the above list data, to convert it to 
> f(a,b c) & compare it with the stored base constraints.
>  
> For predicate push down , instead of generating all the redundant 
> combinations of constraints , just generate one constraint per element of the 
> alias.
> In the current spark code , in any case, filter push down happens only for 1 
> variable at a time.
> So just expanding the filter (a,b,c) to
> f(a, b, c), f(a1, b, c), f(a2, b, c), f(a3, , b ,c), f (a, b1, c), f(a, b2, 
> c) , f(a, b, c1), f(a, b, c2), f(a, b, c3) 
> would suffice, rather than generating all the redundant combinations.
> In fact the code can be easily modified to generate only those constraints 
> which involve variables forming the join condition. so the number of  
> constraints generated on expand are further reduced.
> We already have code to generate compound filters for push down ( join on 
> multiple conditions), which can be used for single variable condition, push 
> down too.
> Just to elaborate the logic further, if we consider the above hypothetical 
> plan (assume collapse project rule is not there)
>  
> Project (a1, a1. as a4, b,  c1, c1 as c4)
>   |
> Project (a, a as a1, a. as a2, a as a3, b, b as b1, b as b2, c, c as c1, 

[jira] [Commented] (SPARK-33152) Constraint Propagation code causes OOM issues or increasing compilation time to hours

2020-12-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253731#comment-17253731
 ] 

Apache Spark commented on SPARK-33152:
--

User 'tanelk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30894

> Constraint Propagation code causes OOM issues or increasing compilation time 
> to hours
> -
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.1
>Reporter: Asif
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We encountered this issue at Workday. 
> The issue is that current Constraints Propagation code pessimistically 
> generates all the possible permutations of base constraint for the aliases in 
> the project node.
> This causes blow up of the number of constraints generated causing OOM issues 
> at compile time of sql query, or queries taking 18 min to 2 hrs to compile.
> The problematic piece of code is in LogicalPlan.getAliasedConstraints
> projectList.foreach {
>  case a @ Alias(l: Literal, _) =>
>  allConstraints += EqualNullSafe(a.toAttribute, l)
>  case a @ Alias(e, _) =>
>  // For every alias in `projectList`,replace the reference in
>  // constraints by its attribute.
>  allConstraints ++= allConstraints.map(_ transform {
>  case expr: Expression if expr.semanticEquals(e) =>
>  a.toAttribute
>  })
>  allConstraints += EqualNullSafe(e, a.toAttribute)
>  case _ => // Don't change.
>  }
> so consider a hypothetical plan
>  
> Project (a, a as a1, a. as a2, a as a3, b, b as b1, b as b2, c, c as c1, c as 
> c2 , c as c3)
>    |
> Filter f(a, b, c)
> |
> Base Relation (a, b, c)
> and so we have projection as
> a, a1, a2, a3
> b, b1, b2
> c, c1, c2, c3
> Lets say hypothetically f(a, b, c) has a occurring 1 times, b occurring 2 
> times, and C occurring 3 times.
> So at project node the number of constraints for a single base constraint 
> f(a, b, c) will be
> 4C1 * 3C2 * 4C3 = 48
> In our case, we have seen number of constraints going up to > 3 or more, 
> as there are complex case statements in the projection.
> Spark generates all these constraints pessimistically for pruning filters or 
> push down predicates for join , it may encounter when the optimizer traverses 
> up the tree.
>  
> This is issue is solved at our end by modifying the spark code to use a 
> different logic.
> The idea is simple. 
> Instead of generating pessimistically all possible combinations of base 
> constraint, just store the original base constraints & track the aliases at 
> each level.
> The principal followed is this:
> 1) Store the base constraint and keep the track of the aliases for the 
> underlying attribute.
> 2) If the base attribute composing the constraint is not in the output set, 
> see if the constraint survives by substituting the attribute getting removed 
> with the next available alias's attribute.
>  
> For checking if a filter can be pruned , just canonicalize the filter with 
> the attribute at 0th position of the tracking list & compare with the 
> underlying base constraint.
> To elaborate using  the plan above.
> At project node
> We have constraint f(a,b,c)
> we keep track of alias
> List 1  : a, a1.attribute, a2.attribute, a3.attribute
> List2 :  b, b1.attribute, b2.attribute 
> List3: c, c1.attribute, c2.attribute, c3.attribute
> Lets say above the project node, we encounter a filter
> f(a1, b2, c3)
> So canonicalize the filter by using the above list data, to convert it to 
> f(a,b c) & compare it with the stored base constraints.
>  
> For predicate push down , instead of generating all the redundant 
> combinations of constraints , just generate one constraint per element of the 
> alias.
> In the current spark code , in any case, filter push down happens only for 1 
> variable at a time.
> So just expanding the filter (a,b,c) to
> f(a, b, c), f(a1, b, c), f(a2, b, c), f(a3, , b ,c), f (a, b1, c), f(a, b2, 
> c) , f(a, b, c1), f(a, b, c2), f(a, b, c3) 
> would suffice, rather than generating all the redundant combinations.
> In fact the code can be easily modified to generate only those constraints 
> which involve variables forming the join condition. so the number of  
> constraints generated on expand are further reduced.
> We already have code to generate compound filters for push down ( join on 
> multiple conditions), which can be used for single variable condition, push 
> down too.
> Just to elaborate the logic further, if we consider the above hypothetical 
> plan (assume collapse project rule is not there)
>  
> Project (a1, a1. as a4, b,  c1, c1 as c4)
>   |
> Project (a, a as a1, a. as a2, a as a3, b, b as b1, b as b2, c, c as c1, 

[jira] [Commented] (SPARK-33152) Constraint Propagation code causes OOM issues or increasing compilation time to hours

2020-10-29 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223387#comment-17223387
 ] 

Asif commented on SPARK-33152:
--

PR opened:

[pr-33152|https://github.com/apache/spark/pull/30185]

> Constraint Propagation code causes OOM issues or increasing compilation time 
> to hours
> -
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Asif
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We encountered this issue at Workday. 
> The issue is that current Constraints Propagation code pessimistically 
> generates all the possible permutations of base constraint for the aliases in 
> the project node.
> This causes blow up of the number of constraints generated causing OOM issues 
> at compile time of sql query, or queries taking 18 min to 2 hrs to compile.
> The problematic piece of code is in LogicalPlan.getAliasedConstraints
> projectList.foreach {
>  case a @ Alias(l: Literal, _) =>
>  allConstraints += EqualNullSafe(a.toAttribute, l)
>  case a @ Alias(e, _) =>
>  // For every alias in `projectList`,replace the reference in
>  // constraints by its attribute.
>  allConstraints ++= allConstraints.map(_ transform {
>  case expr: Expression if expr.semanticEquals(e) =>
>  a.toAttribute
>  })
>  allConstraints += EqualNullSafe(e, a.toAttribute)
>  case _ => // Don't change.
>  }
> so consider a hypothetical plan
>  
> Project (a, a as a1, a. as a2, a as a3, b, b as b1, b as b2, c, c as c1, c as 
> c2 , c as c3)
>    |
> Filter f(a, b, c)
> |
> Base Relation (a, b, c)
> and so we have projection as
> a, a1, a2, a3
> b, b1, b2
> c, c1, c2, c3
> Lets say hypothetically f(a, b, c) has a occurring 1 times, b occurring 2 
> times, and C occurring 3 times.
> So at project node the number of constraints for a single base constraint 
> f(a, b, c) will be
> 4C1 * 3C2 * 4C3 = 48
> In our case, we have seen number of constraints going up to > 3 or more, 
> as there are complex case statements in the projection.
> Spark generates all these constraints pessimistically for pruning filters or 
> push down predicates for join , it may encounter when the optimizer traverses 
> up the tree.
>  
> This is issue is solved at our end by modifying the spark code to use a 
> different logic.
> The idea is simple. 
> Instead of generating pessimistically all possible combinations of base 
> constraint, just store the original base constraints & track the aliases at 
> each level.
> The principal followed is this:
> 1) Store the base constraint and keep the track of the aliases for the 
> underlying attribute.
> 2) If the base attribute composing the constraint is not in the output set, 
> see if the constraint survives by substituting the attribute getting removed 
> with the next available alias's attribute.
>  
> For checking if a filter can be pruned , just canonicalize the filter with 
> the attribute at 0th position of the tracking list & compare with the 
> underlying base constraint.
> To elaborate using  the plan above.
> At project node
> We have constraint f(a,b,c)
> we keep track of alias
> List 1  : a, a1.attribute, a2.attribute, a3.attribute
> List2 :  b, b1.attribute, b2.attribute 
> List3: c, c1.attribute, c2.attribute, c3.attribute
> Lets say above the project node, we encounter a filter
> f(a1, b2, c3)
> So canonicalize the filter by using the above list data, to convert it to 
> f(a,b c) & compare it with the stored base constraints.
>  
> For predicate push down , instead of generating all the redundant 
> combinations of constraints , just generate one constraint per element of the 
> alias.
> In the current spark code , in any case, filter push down happens only for 1 
> variable at a time.
> So just expanding the filter (a,b,c) to
> f(a, b, c), f(a1, b, c), f(a2, b, c), f(a3, , b ,c), f (a, b1, c), f(a, b2, 
> c) , f(a, b, c1), f(a, b, c2), f(a, b, c3) 
> would suffice, rather than generating all the redundant combinations.
> In fact the code can be easily modified to generate only those constraints 
> which involve variables forming the join condition. so the number of  
> constraints generated on expand are further reduced.
> We already have code to generate compound filters for push down ( join on 
> multiple conditions), which can be used for single variable condition, push 
> down too.
> Just to elaborate the logic further, if we consider the above hypothetical 
> plan (assume collapse project rule is not there)
>  
> Project (a1, a1. as a4, b,  c1, c1 as c4)
>   |
> Project (a, a as a1, a. as a2, a as a3, b, b as b1, b as b2, c, c as c1, c as 
> c2 , c as c3)
>    |
> Filter f(a, b, c)
> |
> 

[jira] [Commented] (SPARK-33152) Constraint Propagation code causes OOM issues or increasing compilation time to hours

2020-10-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17222946#comment-17222946
 ] 

Apache Spark commented on SPARK-33152:
--

User 'ahshahid' has created a pull request for this issue:
https://github.com/apache/spark/pull/30185

> Constraint Propagation code causes OOM issues or increasing compilation time 
> to hours
> -
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Asif
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We encountered this issue at Workday. 
> The issue is that current Constraints Propagation code pessimistically 
> generates all the possible permutations of base constraint for the aliases in 
> the project node.
> This causes blow up of the number of constraints generated causing OOM issues 
> at compile time of sql query, or queries taking 18 min to 2 hrs to compile.
> The problematic piece of code is in LogicalPlan.getAliasedConstraints
> projectList.foreach {
>  case a @ Alias(l: Literal, _) =>
>  allConstraints += EqualNullSafe(a.toAttribute, l)
>  case a @ Alias(e, _) =>
>  // For every alias in `projectList`,replace the reference in
>  // constraints by its attribute.
>  allConstraints ++= allConstraints.map(_ transform {
>  case expr: Expression if expr.semanticEquals(e) =>
>  a.toAttribute
>  })
>  allConstraints += EqualNullSafe(e, a.toAttribute)
>  case _ => // Don't change.
>  }
> so consider a hypothetical plan
>  
> Project (a, a as a1, a. as a2, a as a3, b, b as b1, b as b2, c, c as c1, c as 
> c2 , c as c3)
>    |
> Filter f(a, b, c)
> |
> Base Relation (a, b, c)
> and so we have projection as
> a, a1, a2, a3
> b, b1, b2
> c, c1, c2, c3
> Lets say hypothetically f(a, b, c) has a occurring 1 times, b occurring 2 
> times, and C occurring 3 times.
> So at project node the number of constraints for a single base constraint 
> f(a, b, c) will be
> 4C1 * 3C2 * 4C3 = 48
> In our case, we have seen number of constraints going up to > 3 or more, 
> as there are complex case statements in the projection.
> Spark generates all these constraints pessimistically for pruning filters or 
> push down predicates for join , it may encounter when the optimizer traverses 
> up the tree.
>  
> This is issue is solved at our end by modifying the spark code to use a 
> different logic.
> The idea is simple. 
> Instead of generating pessimistically all possible combinations of base 
> constraint, just store the original base constraints & track the aliases at 
> each level.
> The principal followed is this:
> 1) Store the base constraint and keep the track of the aliases for the 
> underlying attribute.
> 2) If the base attribute composing the constraint is not in the output set, 
> see if the constraint survives by substituting the attribute getting removed 
> with the next available alias's attribute.
>  
> For checking if a filter can be pruned , just canonicalize the filter with 
> the attribute at 0th position of the tracking list & compare with the 
> underlying base constraint.
> To elaborate using  the plan above.
> At project node
> We have constraint f(a,b,c)
> we keep track of alias
> List 1  : a, a1.attribute, a2.attribute, a3.attribute
> List2 :  b, b1.attribute, b2.attribute 
> List3: c, c1.attribute, c2.attribute, c3.attribute
> Lets say above the project node, we encounter a filter
> f(a1, b2, c3)
> So canonicalize the filter by using the above list data, to convert it to 
> f(a,b c) & compare it with the stored base constraints.
>  
> For predicate push down , instead of generating all the redundant 
> combinations of constraints , just generate one constraint per element of the 
> alias.
> In the current spark code , in any case, filter push down happens only for 1 
> variable at a time.
> So just expanding the filter (a,b,c) to
> f(a, b, c), f(a1, b, c), f(a2, b, c), f(a3, , b ,c), f (a, b1, c), f(a, b2, 
> c) , f(a, b, c1), f(a, b, c2), f(a, b, c3) 
> would suffice, rather than generating all the redundant combinations.
> In fact the code can be easily modified to generate only those constraints 
> which involve variables forming the join condition. so the number of  
> constraints generated on expand are further reduced.
> We already have code to generate compound filters for push down ( join on 
> multiple conditions), which can be used for single variable condition, push 
> down too.
> Just to elaborate the logic further, if we consider the above hypothetical 
> plan (assume collapse project rule is not there)
>  
> Project (a1, a1. as a4, b,  c1, c1 as c4)
>   |
> Project (a, a as a1, a. as a2, a as a3, b, b as b1, b as b2, c, c as c1, c as 

[jira] [Commented] (SPARK-33152) Constraint Propagation code causes OOM issues or increasing compilation time to hours

2020-10-16 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215645#comment-17215645
 ] 

Asif commented on SPARK-33152:
--

[~yumwang], Hi. I need some help.

I started the tests run via 

dev/run-tests.

The run exited after running for 6 + hrs, with the following tail end logging.

ble `default`.`javasavedtable` into Hive metastore in Spark SQL specific 
format, which is NOT compatible with Hive.\n"b'\x1b[0m[\x1b[0minfo\x1b[0m] 
\x1b[0mTest 
org.apache.spark.sql.hive.\x1b[33mJavaMetastoreDataSourcesSuite\x1b[0m.\x1b[36msaveExternalTableWithSchemaAndQueryIt\x1b[0m
 started\x1b[0m\n'b"13:16:18.288 WARN 
org.apache.spark.sql.hive.test.TestHiveExternalCatalog: Couldn't find 
corresponding Hive SerDe for data source provider org.apache.spark.sql.json. 
Persisting data source table `default`.`javasavedtable` into Hive metastore in 
Spark SQL specific format, which is NOT compatible with Hive.\n"b"13:16:18.705 
WARN org.apache.spark.sql.hive.test.TestHiveExternalCatalog: Couldn't find 
corresponding Hive SerDe for data source provider org.apache.spark.sql.json. 
Persisting data source table `default`.`externaltable` into Hive metastore in 
Spark SQL specific format, which is NOT compatible with 
Hive.\n"b'\x1b[0m[\x1b[0minfo\x1b[0m] \x1b[0m\x1b[34mTest run finished: 
\x1b[0m\x1b[34m0 failed\x1b[0m\x1b[34m, \x1b[0m\x1b[34m0 
ignored\x1b[0m\x1b[34m, 3 total, 
6.17s\x1b[0m\x1b[0m\n'b'\x1b[0m[\x1b[0minfo\x1b[0m] 
\x1b[0mScalaTest\x1b[0m\n'b'\x1b[0m[\x1b[0minfo\x1b[0m] \x1b[0m\x1b[36mRun 
completed in 5 hours, 59 minutes, 29 
seconds.\x1b[0m\x1b[0m\n'b'\x1b[0m[\x1b[0minfo\x1b[0m] \x1b[0m\x1b[36mTotal 
number of tests run: 3250\x1b[0m\x1b[0m\n'b'\x1b[0m[\x1b[0minfo\x1b[0m] 
\x1b[0m\x1b[36mSuites: completed 97, aborted 
1\x1b[0m\x1b[0m\n'b'\x1b[0m[\x1b[0minfo\x1b[0m] \x1b[0m\x1b[36mTests: succeeded 
3250, failed 0, canceled 0, ignored 596, pending 
0\x1b[0m\x1b[0m\n'b'\x1b[0m[\x1b[0minfo\x1b[0m] \x1b[0m\x1b[31m*** 1 SUITE 
ABORTED ***\x1b[0m\x1b[0m\n'b'\x1b[0m[\x1b[31merror\x1b[0m] \x1b[0mError: Total 
3256, Failed 0, Errors 1, Passed 3255, Ignored 
596\x1b[0m\n'b'\x1b[0m[\x1b[31merror\x1b[0m] \x1b[0mError during 
tests:\x1b[0m\n'b'\x1b[0m[\x1b[31merror\x1b[0m] 
\x1b[0m\torg.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite\x1b[0m\n'b'\x1b[0m[\x1b[31merror\x1b[0m]
 \x1b[0m(hive-thriftserver/test:\x1b[31mtest\x1b[0m) sbt.TestsFailedException: 
Tests unsuccessful\x1b[0m\n'b'\x1b[0m[\x1b[31merror\x1b[0m] 
\x1b[0m(sql-kafka-0-10/test:\x1b[31mtest\x1b[0m) sbt.TestsFailedException: 
Tests unsuccessful\x1b[0m\n'b'\x1b[0m[\x1b[31merror\x1b[0m] 
\x1b[0m(yarn/test:\x1b[31mtest\x1b[0m) sbt.TestsFailedException: Tests 
unsuccessful\x1b[0m\n'b'\x1b[0m[\x1b[31merror\x1b[0m] 
\x1b[0m(core/test:\x1b[31mtest\x1b[0m) sbt.TestsFailedException: Tests 
unsuccessful\x1b[0m\n'b'\x1b[0m[\x1b[31merror\x1b[0m] 
\x1b[0m(hive/test:\x1b[31mtest\x1b[0m) sbt.TestsFailedException: Tests 
unsuccessful\x1b[0m\n'b'\x1b[0m[\x1b[31merror\x1b[0m] 
\x1b[0m(streaming/test:\x1b[31mtest\x1b[0m) sbt.TestsFailedException: Tests 
unsuccessful\x1b[0m\n'b'\x1b[0m[\x1b[31merror\x1b[0m] 
\x1b[0m(sql/test:\x1b[31mtest\x1b[0m) sbt.TestsFailedException: Tests 
unsuccessful\x1b[0m\n'b'\x1b[0m[\x1b[31merror\x1b[0m] \x1b[0mTotal time: 21853 
s, completed Oct 16, 2020 1:17:17 PM\x1b[0m\n'[error] running 
/Users/asif.shahid/workspace/code/stock-spark/spark/build/sbt -Phadoop-2.6 
-Pkafka-0-8 -Phive -Pkinesis-asl -Phive-thriftserver -Pmesos 
-Pspark-ganglia-lgpl -Pyarn -Pkubernetes -Pflume test ; received return code 1

 

 

I am not sure if the failure has got anything to do with my changes or not.

Any pointers as to how to proceed further or debug the issue, or which logs to 
check? 

 

> Constraint Propagation code causes OOM issues or increasing compilation time 
> to hours
> -
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Asif
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We encountered this issue at Workday. 
> The issue is that current Constraints Propagation code pessimistically 
> generates all the possible permutations of base constraint for the aliases in 
> the project node.
> This causes blow up of the number of constraints generated causing OOM issues 
> at compile time of sql query, or queries taking 18 min to 2 hrs to compile.
> The problematic piece of code is in LogicalPlan.getAliasedConstraints
> projectList.foreach {
>  case a @ Alias(l: Literal, _) =>
>  allConstraints += EqualNullSafe(a.toAttribute, l)
>  case a @ Alias(e, _) =>
>  // For every alias in `projectList`,replace the reference in
>  // constraints by its attribute.

[jira] [Commented] (SPARK-33152) Constraint Propagation code causes OOM issues or increasing compilation time to hours

2020-10-15 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214927#comment-17214927
 ] 

Asif commented on SPARK-33152:
--

sure. working on the PR

> Constraint Propagation code causes OOM issues or increasing compilation time 
> to hours
> -
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Asif
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We encountered this issue at Workday. 
> The issue is that current Constraints Propagation code pessimistically 
> generates all the possible permutations of base constraint for the aliases in 
> the project node.
> This causes blow up of the number of constraints generated causing OOM issues 
> at compile time of sql query, or queries taking 18 min to 2 hrs to compile.
> The problematic piece of code is in LogicalPlan.getAliasedConstraints
> projectList.foreach {
>  case a @ Alias(l: Literal, _) =>
>  allConstraints += EqualNullSafe(a.toAttribute, l)
>  case a @ Alias(e, _) =>
>  // For every alias in `projectList`,replace the reference in
>  // constraints by its attribute.
>  allConstraints ++= allConstraints.map(_ transform {
>  case expr: Expression if expr.semanticEquals(e) =>
>  a.toAttribute
>  })
>  allConstraints += EqualNullSafe(e, a.toAttribute)
>  case _ => // Don't change.
>  }
> so consider a hypothetical plan
>  
> Project (a, a as a1, a. as a2, a as a3, b, b as b1, b as b2, c, c as c1, c as 
> c2 , c as c3)
>    |
> Filter f(a, b, c)
> |
> Base Relation (a, b, c)
> and so we have projection as
> a, a1, a2, a3
> b, b1, b2
> c, c1, c2, c3
> Lets say hypothetically f(a, b, c) has a occurring 1 times, b occurring 2 
> times, and C occurring 3 times.
> So at project node the number of constraints for a single base constraint 
> f(a, b, c) will be
> 4C1 * 3C2 * 4C3 = 48
> In our case, we have seen number of constraints going up to > 3 or more, 
> as there are complex case statements in the projection.
> Spark generates all these constraints pessimistically for pruning filters or 
> push down predicates for join , it may encounter when the optimizer traverses 
> up the tree.
>  
> This is issue is solved at our end by modifying the spark code to use a 
> different logic.
> The idea is simple. 
> Instead of generating pessimistically all possible combinations of base 
> constraint, just store the original base constraints & track the aliases at 
> each level.
> The principal followed is this:
> 1) Store the base constraint and keep the track of the aliases for the 
> underlying attribute.
> 2) If the base attribute composing the constraint is not in the output set, 
> see if the constraint survives by substituting the attribute getting removed 
> with the next available alias's attribute.
>  
> For checking if a filter can be pruned , just canonicalize the filter with 
> the attribute at 0th position of the tracking list & compare with the 
> underlying base constraint.
> To elaborate using  the plan above.
> At project node
> We have constraint f(a,b,c)
> we keep track of alias
> List 1  : a, a1.attribute, a2.attribute, a3.attribute
> List2 :  b, b1.attribute, b2.attribute 
> List3: c, c1.attribute, c2.attribute, c3.attribute
> Lets say above the project node, we encounter a filter
> f(a1, b2, c3)
> So canonicalize the filter by using the above list data, to convert it to 
> f(a,b c) & compare it with the stored base constraints.
>  
> For predicate push down , instead of generating all the redundant 
> combinations of constraints , just generate one constraint per element of the 
> alias.
> In the current spark code , in any case, filter push down happens only for 1 
> variable at a time.
> So just expanding the filter (a,b,c) to
> f(a, b, c), f(a1, b, c), f(a2, b, c), f(a3, , b ,c), f (a, b1, c), f(a, b2, 
> c) , f(a, b, c1), f(a, b, c2), f(a, b, c3) 
> would suffice, rather than generating all the redundant combinations.
> In fact the code can be easily modified to generate only those constraints 
> which involve variables forming the join condition. so the number of  
> constraints generated on expand are further reduced.
> We already have code to generate compound filters for push down ( join on 
> multiple conditions), which can be used for single variable condition, push 
> down too.
> Just to elaborate the logic further, if we consider the above hypothetical 
> plan (assume collapse project rule is not there)
>  
> Project (a1, a1. as a4, b,  c1, c1 as c4)
>   |
> Project (a, a as a1, a. as a2, a as a3, b, b as b1, b as b2, c, c as c1, c as 
> c2 , c as c3)
>    |
> Filter f(a, b, c)
> |
> Base Relation (a, b, c)
>  
> So at project 

[jira] [Commented] (SPARK-33152) Constraint Propagation code causes OOM issues or increasing compilation time to hours

2020-10-15 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214761#comment-17214761
 ] 

Yuming Wang commented on SPARK-33152:
-

OK, you can still submit a PR  if you have a better solution.

> Constraint Propagation code causes OOM issues or increasing compilation time 
> to hours
> -
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Asif
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We encountered this issue at Workday. 
> The issue is that current Constraints Propagation code pessimistically 
> generates all the possible permutations of base constraint for the aliases in 
> the project node.
> This causes blow up of the number of constraints generated causing OOM issues 
> at compile time of sql query, or queries taking 18 min to 2 hrs to compile.
> The problematic piece of code is in LogicalPlan.getAliasedConstraints
> projectList.foreach {
>  case a @ Alias(l: Literal, _) =>
>  allConstraints += EqualNullSafe(a.toAttribute, l)
>  case a @ Alias(e, _) =>
>  // For every alias in `projectList`,replace the reference in
>  // constraints by its attribute.
>  allConstraints ++= allConstraints.map(_ transform {
>  case expr: Expression if expr.semanticEquals(e) =>
>  a.toAttribute
>  })
>  allConstraints += EqualNullSafe(e, a.toAttribute)
>  case _ => // Don't change.
>  }
> so consider a hypothetical plan
>  
> Project (a, a as a1, a. as a2, a as a3, b, b as b1, b as b2, c, c as c1, c as 
> c2 , c as c3)
>    |
> Filter f(a, b, c)
> |
> Base Relation (a, b, c)
> and so we have projection as
> a, a1, a2, a3
> b, b1, b2
> c, c1, c2, c3
> Lets say hypothetically f(a, b, c) has a occurring 1 times, b occurring 2 
> times, and C occurring 3 times.
> So at project node the number of constraints for a single base constraint 
> f(a, b, c) will be
> 4C1 * 3C2 * 4C3 = 48
> In our case, we have seen number of constraints going up to > 3 or more, 
> as there are complex case statements in the projection.
> Spark generates all these constraints pessimistically for pruning filters or 
> push down predicates for join , it may encounter when the optimizer traverses 
> up the tree.
>  
> This is issue is solved at our end by modifying the spark code to use a 
> different logic.
> The idea is simple. 
> Instead of generating pessimistically all possible combinations of base 
> constraint, just store the original base constraints & track the aliases at 
> each level.
> The principal followed is this:
> 1) Store the base constraint and keep the track of the aliases for the 
> underlying attribute.
> 2) If the base attribute composing the constraint is not in the output set, 
> see if the constraint survives by substituting the attribute getting removed 
> with the next available alias's attribute.
>  
> For checking if a filter can be pruned , just canonicalize the filter with 
> the attribute at 0th position of the tracking list & compare with the 
> underlying base constraint.
> To elaborate using  the plan above.
> At project node
> We have constraint f(a,b,c)
> we keep track of alias
> List 1  : a, a1.attribute, a2.attribute, a3.attribute
> List2 :  b, b1.attribute, b2.attribute 
> List3: c, c1.attribute, c2.attribute, c3.attribute
> Lets say above the project node, we encounter a filter
> f(a1, b2, c3)
> So canonicalize the filter by using the above list data, to convert it to 
> f(a,b c) & compare it with the stored base constraints.
>  
> For predicate push down , instead of generating all the redundant 
> combinations of constraints , just generate one constraint per element of the 
> alias.
> In the current spark code , in any case, filter push down happens only for 1 
> variable at a time.
> So just expanding the filter (a,b,c) to
> f(a, b, c), f(a1, b, c), f(a2, b, c), f(a3, , b ,c), f (a, b1, c), f(a, b2, 
> c) , f(a, b, c1), f(a, b, c2), f(a, b, c3) 
> would suffice, rather than generating all the redundant combinations.
> In fact the code can be easily modified to generate only those constraints 
> which involve variables forming the join condition. so the number of  
> constraints generated on expand are further reduced.
> We already have code to generate compound filters for push down ( join on 
> multiple conditions), which can be used for single variable condition, push 
> down too.
> Just to elaborate the logic further, if we consider the above hypothetical 
> plan (assume collapse project rule is not there)
>  
> Project (a1, a1. as a4, b,  c1, c1 as c4)
>   |
> Project (a, a as a1, a. as a2, a as a3, b, b as b1, b as b2, c, c as c1, c as 
> c2 , c as c3)
>    |
> Filter f(a, b, 

[jira] [Commented] (SPARK-33152) Constraint Propagation code causes OOM issues or increasing compilation time to hours

2020-10-15 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214738#comment-17214738
 ] 

Asif commented on SPARK-33152:
--

Possible. As the EliminateOuterJoin rule does reference the constraints & the 
query in question in that bug has high number of projection with aliases. 

So one of ticket can be closed.

If SPARK-29606 is to be kept open, than the bug analysis & summary should be 
changed ( may be take from this ticket). As the issue in SPARK-29606 is not 
tied just to outer join elimination but can crop up even in queries without 
join etc.

> Constraint Propagation code causes OOM issues or increasing compilation time 
> to hours
> -
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Asif
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We encountered this issue at Workday. 
> The issue is that current Constraints Propagation code pessimistically 
> generates all the possible permutations of base constraint for the aliases in 
> the project node.
> This causes blow up of the number of constraints generated causing OOM issues 
> at compile time of sql query, or queries taking 18 min to 2 hrs to compile.
> The problematic piece of code is in LogicalPlan.getAliasedConstraints
> projectList.foreach {
>  case a @ Alias(l: Literal, _) =>
>  allConstraints += EqualNullSafe(a.toAttribute, l)
>  case a @ Alias(e, _) =>
>  // For every alias in `projectList`,replace the reference in
>  // constraints by its attribute.
>  allConstraints ++= allConstraints.map(_ transform {
>  case expr: Expression if expr.semanticEquals(e) =>
>  a.toAttribute
>  })
>  allConstraints += EqualNullSafe(e, a.toAttribute)
>  case _ => // Don't change.
>  }
> so consider a hypothetical plan
>  
> Project (a, a as a1, a. as a2, a as a3, b, b as b1, b as b2, c, c as c1, c as 
> c2 , c as c3)
>    |
> Filter f(a, b, c)
> |
> Base Relation (a, b, c)
> and so we have projection as
> a, a1, a2, a3
> b, b1, b2
> c, c1, c2, c3
> Lets say hypothetically f(a, b, c) has a occurring 1 times, b occurring 2 
> times, and C occurring 3 times.
> So at project node the number of constraints for a single base constraint 
> f(a, b, c) will be
> 4C1 * 3C2 * 4C3 = 48
> In our case, we have seen number of constraints going up to > 3 or more, 
> as there are complex case statements in the projection.
> Spark generates all these constraints pessimistically for pruning filters or 
> push down predicates for join , it may encounter when the optimizer traverses 
> up the tree.
>  
> This is issue is solved at our end by modifying the spark code to use a 
> different logic.
> The idea is simple. 
> Instead of generating pessimistically all possible combinations of base 
> constraint, just store the original base constraints & track the aliases at 
> each level.
> The principal followed is this:
> 1) Store the base constraint and keep the track of the aliases for the 
> underlying attribute.
> 2) If the base attribute composing the constraint is not in the output set, 
> see if the constraint survives by substituting the attribute getting removed 
> with the next available alias's attribute.
>  
> For checking if a filter can be pruned , just canonicalize the filter with 
> the attribute at 0th position of the tracking list & compare with the 
> underlying base constraint.
> To elaborate using  the plan above.
> At project node
> We have constraint f(a,b,c)
> we keep track of alias
> List 1  : a, a1.attribute, a2.attribute, a3.attribute
> List2 :  b, b1.attribute, b2.attribute 
> List3: c, c1.attribute, c2.attribute, c3.attribute
> Lets say above the project node, we encounter a filter
> f(a1, b2, c3)
> So canonicalize the filter by using the above list data, to convert it to 
> f(a,b c) & compare it with the stored base constraints.
>  
> For predicate push down , instead of generating all the redundant 
> combinations of constraints , just generate one constraint per element of the 
> alias.
> In the current spark code , in any case, filter push down happens only for 1 
> variable at a time.
> So just expanding the filter (a,b,c) to
> f(a, b, c), f(a1, b, c), f(a2, b, c), f(a3, , b ,c), f (a, b1, c), f(a, b2, 
> c) , f(a, b, c1), f(a, b, c2), f(a, b, c3) 
> would suffice, rather than generating all the redundant combinations.
> In fact the code can be easily modified to generate only those constraints 
> which involve variables forming the join condition. so the number of  
> constraints generated on expand are further reduced.
> We already have code to generate compound filters for push down ( join on 
> multiple conditions), which can be used 

[jira] [Commented] (SPARK-33152) Constraint Propagation code causes OOM issues or increasing compilation time to hours

2020-10-15 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214706#comment-17214706
 ] 

Yuming Wang commented on SPARK-33152:
-

Is it duplicate with SPARK-29606?

> Constraint Propagation code causes OOM issues or increasing compilation time 
> to hours
> -
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Asif
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We encountered this issue at Workday. 
> The issue is that current Constraints Propagation code pessimistically 
> generates all the possible permutations of base constraint for the aliases in 
> the project node.
> This causes blow up of the number of constraints generated causing OOM issues 
> at compile time of sql query, or queries taking 18 min to 2 hrs to compile.
> The problematic piece of code is in LogicalPlan.getAliasedConstraints
> projectList.foreach {
>  case a @ Alias(l: Literal, _) =>
>  allConstraints += EqualNullSafe(a.toAttribute, l)
>  case a @ Alias(e, _) =>
>  // For every alias in `projectList`,replace the reference in
>  // constraints by its attribute.
>  allConstraints ++= allConstraints.map(_ transform {
>  case expr: Expression if expr.semanticEquals(e) =>
>  a.toAttribute
>  })
>  allConstraints += EqualNullSafe(e, a.toAttribute)
>  case _ => // Don't change.
>  }
> so consider a hypothetical plan
>  
> Project (a, a as a1, a. as a2, a as a3, b, b as b1, b as b2, c, c as c1, c as 
> c2 , c as c3)
>    |
> Filter f(a, b, c)
> |
> Base Relation (a, b, c)
> and so we have projection as
> a, a1, a2, a3
> b, b1, b2
> c, c1, c2, c3
> Lets say hypothetically f(a, b, c) has a occurring 1 times, b occurring 2 
> times, and C occurring 3 times.
> So at project node the number of constraints for a single base constraint 
> f(a, b, c) will be
> 4C1 * 3C2 * 4C3 = 48
> In our case, we have seen number of constraints going up to > 3 or more, 
> as there are complex case statements in the projection.
> Spark generates all these constraints pessimistically for pruning filters or 
> push down predicates for join , it may encounter when the optimizer traverses 
> up the tree.
>  
> This is issue is solved at our end by modifying the spark code to use a 
> different logic.
> The idea is simple. 
> Instead of generating pessimistically all possible combinations of base 
> constraint, just store the original base constraints & track the aliases at 
> each level.
> The principal followed is this:
> 1) Store the base constraint and keep the track of the aliases for the 
> underlying attribute.
> 2) If the base attribute composing the constraint is not in the output set, 
> see if the constraint survives by substituting the attribute getting removed 
> with the next available alias's attribute.
>  
> For checking if a filter can be pruned , just canonicalize the filter with 
> the attribute at 0th position of the tracking list & compare with the 
> underlying base constraint.
> To elaborate using  the plan above.
> At project node
> We have constraint f(a,b,c)
> we keep track of alias
> List 1  : a, a1.attribute, a2.attribute, a3.attribute
> List2 :  b, b1.attribute, b2.attribute 
> List3: c, c1.attribute, c2.attribute, c3.attribute
> Lets say above the project node, we encounter a filter
> f(a1, b2, c3)
> So canonicalize the filter by using the above list data, to convert it to 
> f(a,b c) & compare it with the stored base constraints.
>  
> For predicate push down , instead of generating all the redundant 
> combinations of constraints , just generate one constraint per element of the 
> alias.
> In the current spark code , in any case, filter push down happens only for 1 
> variable at a time.
> So just expanding the filter (a,b,c) to
> f(a, b, c), f(a1, b, c), f(a2, b, c), f(a3, , b ,c), f (a, b1, c), f(a, b2, 
> c) , f(a, b, c1), f(a, b, c2), f(a, b, c3) 
> would suffice, rather than generating all the redundant combinations.
> In fact the code can be easily modified to generate only those constraints 
> which involve variables forming the join condition. so the number of  
> constraints generated on expand are further reduced.
> We already have code to generate compound filters for push down ( join on 
> multiple conditions), which can be used for single variable condition, push 
> down too.
> Just to elaborate the logic further, if we consider the above hypothetical 
> plan (assume collapse project rule is not there)
>  
> Project (a1, a1. as a4, b,  c1, c1 as c4)
>   |
> Project (a, a as a1, a. as a2, a as a3, b, b as b1, b as b2, c, c as c1, c as 
> c2 , c as c3)
>    |
> Filter f(a, b, c)
> |
> Base Relation (a, 

[jira] [Commented] (SPARK-33152) Constraint Propagation code causes OOM issues or increasing compilation time to hours

2020-10-14 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214298#comment-17214298
 ] 

Asif commented on SPARK-33152:
--

I will be generating a PR for the same..

> Constraint Propagation code causes OOM issues or increasing compilation time 
> to hours
> -
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Asif
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We encountered this issue at Workday. 
> The issue is that current Constraints Propagation code pessimistically 
> generates all the possible permutations of base constraint for the aliases in 
> the project node.
> This causes blow up of the number of constraints generated causing OOM issues 
> at compile time of sql query, or queries taking 18 min to 2 hrs to compile.
> The problematic piece of code is in LogicalPlan.getAliasedConstraints
> projectList.foreach {
>  case a @ Alias(l: Literal, _) =>
>  allConstraints += EqualNullSafe(a.toAttribute, l)
>  case a @ Alias(e, _) =>
>  // For every alias in `projectList`,replace the reference in
>  // constraints by its attribute.
>  allConstraints ++= allConstraints.map(_ transform {
>  case expr: Expression if expr.semanticEquals(e) =>
>  a.toAttribute
>  })
>  allConstraints += EqualNullSafe(e, a.toAttribute)
>  case _ => // Don't change.
>  }
> so consider a hypothetical plan
>  
> Project (a, a as a1, a. as a2, a as a3, b, b as b1, b as b2, c, c as c1, c as 
> c2 , c as c3)
>    |
> Filter f(a, b, c)
> |
> Base Relation (a, b, c)
> and so we have projection as
> a, a1, a2, a3
> b, b1, b2
> c, c1, c2, c3
> Lets say hypothetically f(a, b, c) has a occurring 1 times, b occurring 2 
> times, and C occurring 3 times.
> So at project node the number of constraints for a single base constraint 
> f(a, b, c) will be
> 4C1 * 3C2 * 4C3 = 48
> In our case, we have seen number of constraints going up to > 3 or more, 
> as there are complex case statements in the projection.
> Spark generates all these constraints pessimistically for pruning filters or 
> push down predicates for join , it may encounter when the optimizer traverses 
> up the tree.
>  
> This is issue is solved at our end by modifying the spark code to use a 
> different logic.
> The idea is simple. 
> Instead of generating pessimistically all possible combinations of base 
> constraint, just store the original base constraints & track the aliases at 
> each level.
> The principal followed is this:
> 1) Store the base constraint and keep the track of the aliases for the 
> underlying attribute.
> 2) If the base attribute composing the constraint is not in the output set, 
> see if the constraint survives by substituting the attribute getting removed 
> with the next available alias's attribute.
>  
> For checking if a filter can be pruned , just canonicalize the filter with 
> the attribute at 0th position of the tracking list & compare with the 
> underlying base constraint.
> To elaborate using  the plan above.
> At project node
> We have constraint f(a,b,c)
> we keep track of alias
> List 1  : a, a1.attribute, a2.attribute, a3.attribute
> List2 :  b, b1.attribute, b2.attribute 
> List3: c, c1.attribute, c2.attribute, c3.attribute
> Lets say above the project node, we encounter a filter
> f(a1, b2, c3)
> So canonicalize the filter by using the above list data, to convert it to 
> f(a,b c) & compare it with the stored base constraints.
>  
> For predicate push down , instead of generating all the redundant 
> combinations of constraints , just generate one constraint per element of the 
> alias.
> In the current spark code , in any case, filter push down happens only for 1 
> variable at a time.
> So just expanding the filter (a,b,c) to
> f(a, b, c), f(a1, b, c), f(a2, b, c), f(a3, , b ,c), f (a, b1, c), f(a, b2, 
> c) , f(a, b, c1), f(a, b, c2), f(a, b, c3) 
> would suffice, rather than generating all the redundant combinations.
> In fact the code can be easily modified to generate only those constraints 
> which involve variables forming the join condition. so the number of  
> constraints generated on expand are further reduced.
> We already have code to generate compound filters for push down ( join on 
> multiple conditions), which can be used for single variable condition, push 
> down too.
> Just to elaborate the logic further, if we consider the above hypothetical 
> plan (assume collapse project rule is not there)
>  
> Project (a1, a1. as a4, b,  c1, c1 as c4)
>   |
> Project (a, a as a1, a. as a2, a as a3, b, b as b1, b as b2, c, c as c1, c as 
> c2 , c as c3)
>    |
> Filter f(a, b, c)
> |
> Base Relation (a, b, c)
>