[GitHub] spark pull request #20407: [SPARK-23124][SQL] Allow to disable BroadcastNest...

2018-02-10 Thread mgaido91
Github user mgaido91 closed the pull request at:

https://github.com/apache/spark/pull/20407


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20407: [SPARK-23124][SQL] Allow to disable BroadcastNest...

2018-02-10 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20407#discussion_r167392901
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -156,6 +156,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val ALLOW_NESTEDJOIN_FALLBACK = 
buildConf("spark.sql.join.broadcastJoinFallback.enabled")
+.internal()
+.doc("When true (default), if the other options are not available, 
fallback to try and use " +
+  "BroadcastNestedLoopJoin as join strategy. This can cause OOM which 
can be a problem " +
+  "in some scenarios, eg. when running the thriftserver. Turn to false 
to disable it: an " +
+  "AnalysisException will be thrown.")
--- End diff --

`AUTO_BROADCASTJOIN_THRESHOLD` is not the threshold for avoiding OOM. It is 
normally used for performance tuning. 

`BroadcastNestedLoopJoin ` is used for most cases when we do not have the 
equi join keys. Thus, disabling this join algorithm makes Spark SQL unable to 
handle many join cases. Thus, I do not think it makes sense to add this conf


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20407: [SPARK-23124][SQL] Allow to disable BroadcastNest...

2018-02-10 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20407#discussion_r167392674
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -156,6 +156,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val ALLOW_NESTEDJOIN_FALLBACK = 
buildConf("spark.sql.join.broadcastJoinFallback.enabled")
+.internal()
+.doc("When true (default), if the other options are not available, 
fallback to try and use " +
+  "BroadcastNestedLoopJoin as join strategy. This can cause OOM which 
can be a problem " +
+  "in some scenarios, eg. when running the thriftserver. Turn to false 
to disable it: an " +
+  "AnalysisException will be thrown.")
--- End diff --

I see and in general I agree with you, but in this case 
`BroadcastNestedLoopJoin` is used as a fallback even though we are over the 
threshold for the broadcast joins. So it is very likely that this is going to 
throw an OOM.
So the rationale of this choice seems to me: I don't have any working join 
implementation, then just go for this one even though I know in advance that a 
OOM is likely to occur; anyway I don't have other choices, so if the OOM won't 
occur I have been lucky and I have been able to run it, otherwise it throws an 
OOM instead of an `AnalysisException` (so I am not able to run this scenario 
anyway). And this is fine if we think to normal Spark application, while for 
long running ones like STS an OOM is much worse than an `AnalysisException`.
Therefore, I know that this is not the solution to all OOM problem, but the 
reason of this PR is that I think that when this rule was introduced, the 
choice was made "forgetting" of use cases like STS. Do you agree with me?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20407: [SPARK-23124][SQL] Allow to disable BroadcastNest...

2018-02-10 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20407#discussion_r167392509
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -156,6 +156,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val ALLOW_NESTEDJOIN_FALLBACK = 
buildConf("spark.sql.join.broadcastJoinFallback.enabled")
+.internal()
+.doc("When true (default), if the other options are not available, 
fallback to try and use " +
+  "BroadcastNestedLoopJoin as join strategy. This can cause OOM which 
can be a problem " +
+  "in some scenarios, eg. when running the thriftserver. Turn to false 
to disable it: an " +
+  "AnalysisException will be thrown.")
--- End diff --

OOM could be triggered by various cases. `BroadcastNestedLoopJoin` is being 
widely used in many cases. If this is the case in your scenario, you can do it 
in your fork.

Also, this PR just helps a very specific case. I do not think it resolves 
the general issue.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20407: [SPARK-23124][SQL] Allow to disable BroadcastNest...

2018-02-09 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20407#discussion_r167392027
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -156,6 +156,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val ALLOW_NESTEDJOIN_FALLBACK = 
buildConf("spark.sql.join.broadcastJoinFallback.enabled")
+.internal()
+.doc("When true (default), if the other options are not available, 
fallback to try and use " +
+  "BroadcastNestedLoopJoin as join strategy. This can cause OOM which 
can be a problem " +
+  "in some scenarios, eg. when running the thriftserver. Turn to false 
to disable it: an " +
+  "AnalysisException will be thrown.")
--- End diff --

why? May you please explain which would be the right direction then?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20407: [SPARK-23124][SQL] Allow to disable BroadcastNest...

2018-02-09 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20407#discussion_r167378911
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -156,6 +156,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val ALLOW_NESTEDJOIN_FALLBACK = 
buildConf("spark.sql.join.broadcastJoinFallback.enabled")
+.internal()
+.doc("When true (default), if the other options are not available, 
fallback to try and use " +
+  "BroadcastNestedLoopJoin as join strategy. This can cause OOM which 
can be a problem " +
+  "in some scenarios, eg. when running the thriftserver. Turn to false 
to disable it: an " +
+  "AnalysisException will be thrown.")
--- End diff --

I do not think this can resolve the issue. The fix is in a wrong direction.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20407: [SPARK-23124][SQL] Allow to disable BroadcastNest...

2018-02-09 Thread liufengdb
Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20407#discussion_r167352400
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -156,6 +156,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val ALLOW_NESTEDJOIN_FALLBACK = 
buildConf("spark.sql.join.broadcastJoinFallback.enabled")
--- End diff --

nit: the key can be `spark.sql.join.broadcastNestedLoopJoin.enabled`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20407: [SPARK-23124][SQL] Allow to disable BroadcastNest...

2018-02-09 Thread liufengdb
Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20407#discussion_r167351883
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
@@ -262,6 +262,10 @@ abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
 joins.CartesianProductExec(planLater(left), planLater(right), 
condition) :: Nil
 
   case logical.Join(left, right, joinType, condition) =>
+if (!SQLConf.get.allowNestedJoinFallback) {
+  throw new AnalysisException("The only JOIN strategy available 
for this plan is " +
+s"BroadcastNestedLoopJoin, but 
`${SQLConf.ALLOW_NESTEDJOIN_FALLBACK}` is `false`.")
--- End diff --

nit: `SQLConf.ALLOW_NESTEDJOIN_FALLBACK.key`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20407: [SPARK-23124][SQL] Allow to disable BroadcastNest...

2018-01-26 Thread mgaido91
GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/20407

[SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJoin fallback

## What changes were proposed in this pull request?

In JoinStrategies, currently if no better option is available, it fallbacks 
to BroadcastNestedLoopJoin. This strategy can be very problematic, since it can 
cause OOM. While generally this is not a big problem, in some applications like 
Thriftserver this is an issue, because a failing job can cause the whole 
application to go in a bad state. Thus, in these cases, it might be useful to 
be able to disable this behavior and allow to fail only the jobs which can 
cause it.

## How was this patch tested?

added UT


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-23124

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20407.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20407


commit 074c34245d300901390d2d5ed74bb69e32539b8a
Author: Marco Gaido 
Date:   2018-01-26T12:54:29Z

[SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJoin fallback




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org