[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

2014-07-15 Thread ueshin
GitHub user ueshin opened a pull request:

https://github.com/apache/spark/pull/1428

[SPARK-2509][SQL] Add optimization for Substring.

`Substring` including `null` literal cases could be added to 
`NullPropagation`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ueshin/apache-spark issues/SPARK-2509

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1428.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1428


commit d9eb85f902c5498f9b7b6c6035471f622dd7d629
Author: Takuya UESHIN 
Date:   2014-07-16T03:31:45Z

Add Substring cases to NullPropagation.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

2014-07-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1428#issuecomment-49120808
  
QA tests have started for PR 1428. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16705/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

2014-07-15 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1428#discussion_r14980614
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -171,6 +171,9 @@ object NullPropagation extends Rule[LogicalPlan] {
   case Literal(candidate, _) if candidate == v => true
   case _ => false
 })) => Literal(true, BooleanType)
+  case e @ Substring(Literal(null, _), _, _) => Literal(null, 
e.dataType)
--- End diff --

Is this not doable with constant folding? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

2014-07-15 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1428#discussion_r14980864
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -171,6 +171,9 @@ object NullPropagation extends Rule[LogicalPlan] {
   case Literal(candidate, _) if candidate == v => true
   case _ => false
 })) => Literal(true, BooleanType)
+  case e @ Substring(Literal(null, _), _, _) => Literal(null, 
e.dataType)
--- End diff --

Now I think about it more, this is probably ok to leave it here since in 
constant folding we don't really special case for null literals. However, I 
looked into other rules in NullPropagation and a lot of them don't really 
belong there... (has nothing to do with your PR though)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

2014-07-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1428#issuecomment-49125284
  
QA results for PR 1428:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16705/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

2014-07-15 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1428#issuecomment-49125317
  
Thanks. Merging this in master & branch-1.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

2014-07-15 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1428#discussion_r14981973
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -171,6 +171,9 @@ object NullPropagation extends Rule[LogicalPlan] {
   case Literal(candidate, _) if candidate == v => true
   case _ => false
 })) => Literal(true, BooleanType)
+  case e @ Substring(Literal(null, _), _, _) => Literal(null, 
e.dataType)
--- End diff --

@rxin, thank you for your comment.
The rule `ConstantFolding` can be applied only if the expression is 
foldable, i.e. ALL of the child expressions are foldable, and the 
`NullPropagation` covers the case there exists at least one `null` literal 
regardless of foldabilities of the others.
So I think the cases are needed.

BTW, `Substring` can be foldable, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

2014-07-15 Thread ueshin
Github user ueshin commented on the pull request:

https://github.com/apache/spark/pull/1428#issuecomment-49125413
  
@rxin Ah, wait for a moment.
What do you think about foldability of the `Substring` I mentioned above?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

2014-07-15 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1428


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

2014-07-15 Thread ueshin
Github user ueshin commented on the pull request:

https://github.com/apache/spark/pull/1428#issuecomment-49125499
  
I'll open another issue about foldability of `Substring`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

2014-07-15 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1428#issuecomment-49125623
  
I've already merged this. I think we should re-evaluate in the future 
whether we should push some stuff into foldable itself, because as I see it a 
lot of the pattern matching rules in NullPropagation could potentially be 
implemented by each expression itself to be foldable. I think that might also 
lead to higher planner performance (although that's probably less important). 
More importantly, it does feel more natural to me that this stuff goes into 
each expression, rather than having a rule for each NullPropagation. 

cc @marmbrus


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

2014-07-15 Thread ueshin
Github user ueshin commented on the pull request:

https://github.com/apache/spark/pull/1428#issuecomment-49125889
  
@rxin I agree that we need the better way for `NullPropagation ` instead of 
too much pattern matching.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

2014-07-16 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1428#issuecomment-49188774
  
You are right that this rule does more than null propagation now.  I'm not 
sure what a better name would be.  `DegenerateExpressionSimplification`?

Regarding moving null propagation into the expressions, you could do it... 
but what would it look like?  You specify which of the children make the entire 
expression null if they are null?  Seems like a lot of refactoring for little 
benefit...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

2014-07-16 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1428#issuecomment-49205341
  
It doesn't bring much benefit right now, but what we are doing here is 
creating patterns in NullPropagation to specify the semantics of each 
individual expression ... not very scalable in maintaining this code base. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

2014-07-16 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1428#issuecomment-49210382
  
I'm not sure I agree with that.  This is a pretty niche optimization not 
something fundamental about the expressions that is required for correct 
evaluation (and the first version of this code had a bunch of mistakes).  
Having all of these rules in one place made it easier for me to realize that 
when doing the review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

2014-07-16 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1428#issuecomment-49210870
  
Well with that argument we shouldn't have any of the constant folding logic 
in expressions themselves either.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

2014-07-16 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1428#issuecomment-49211681
  
That is exactly the argument I made when the folding logic was added. :) I 
suggested that we add `deterministic` instead and then have a rule that folds 
things that are `deterministic` and have no `references`.  `deterministic` I 
would argue is something fundamental about the expression itself, unlike these 
optimizations we are making.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

2014-07-16 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1428#issuecomment-49211732
  
I would be supportive of changing it to match my original proposal...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---