[jira] [Updated] (SPARK-20939) Do not duplicate user-defined functions while optimizing logical query plans
[ https://issues.apache.org/jira/browse/SPARK-20939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-20939: - Labels: bulk-closed logical_plan optimizer (was: logical_plan optimizer) > Do not duplicate user-defined functions while optimizing logical query plans > > > Key: SPARK-20939 > URL: https://issues.apache.org/jira/browse/SPARK-20939 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 2.1.0 >Reporter: Lovasoa >Priority: Minor > Labels: bulk-closed, logical_plan, optimizer > > Currently, while optimizing a query plan, spark pushes filters down the query > plan tree, so that > {code:title=LogicalPlan} > Join Inner, (a = b) > +- Filter UDF(a) > +- Relation A > +- Relation B > {code} > becomes > {code:title=Optimized LogicalPlan} > Join Inner, (a = b) > +- Filter UDF(a) > +- Relation A > +- Filter UDF(b) > +- Relation B > {code} > In general, it is a good thing to push down filters as it reduces the number > of records that will go through the join. > However, in the case where the filter is an user-defined function (UDF), we > cannot know if the cost of executing the function twice will be higher than > the eventual cost of joining more elements or not. > So I think that the optimizer shouldn't move the user-defined function in the > query plan tree. The user will still be able to duplicate the function if he > wants to. > See this question on stackoverflow: > https://stackoverflow.com/questions/44291078/how-to-tune-the-query-planner-and-turn-off-an-optimization-in-spark -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20939) Do not duplicate user-defined functions while optimizing logical query plans
[ https://issues.apache.org/jira/browse/SPARK-20939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-20939: - Issue Type: Improvement (was: Bug) > Do not duplicate user-defined functions while optimizing logical query plans > > > Key: SPARK-20939 > URL: https://issues.apache.org/jira/browse/SPARK-20939 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 2.1.0 >Reporter: Lovasoa >Priority: Minor > Labels: logical_plan, optimizer > > Currently, while optimizing a query plan, spark pushes filters down the query > plan tree, so that > {code:title=LogicalPlan} > Join Inner, (a = b) > +- Filter UDF(a) > +- Relation A > +- Relation B > {code} > becomes > {code:title=Optimized LogicalPlan} > Join Inner, (a = b) > +- Filter UDF(a) > +- Relation A > +- Filter UDF(b) > +- Relation B > {code} > In general, it is a good thing to push down filters as it reduces the number > of records that will go through the join. > However, in the case where the filter is an user-defined function (UDF), we > cannot know if the cost of executing the function twice will be higher than > the eventual cost of joining more elements or not. > So I think that the optimizer shouldn't move the user-defined function in the > query plan tree. The user will still be able to duplicate the function if he > wants to. > See this question on stackoverflow: > https://stackoverflow.com/questions/44291078/how-to-tune-the-query-planner-and-turn-off-an-optimization-in-spark -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20939) Do not duplicate user-defined functions while optimizing logical query plans
[ https://issues.apache.org/jira/browse/SPARK-20939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lovasoa updated SPARK-20939: Description: Currently, while optimizing a query plan, spark pushes filters down the query plan tree, so that {code:title=LogicalPlan} Join Inner, (a = b) +- Filter UDF(a) +- Relation A +- Relation B {code} becomes {code:title=Optimized LogicalPlan} Join Inner, (a = b) +- Filter UDF(a) +- Relation A +- Filter UDF(b) +- Relation B {code} In general, it is a good thing to push down filters as it reduces the number of records that will go through the join. However, in the case where the filter is an user-defined function (UDF), we cannot know if the cost of executing the function twice will be higher than the eventual cost of joining more elements or not. So I think that the optimizer shouldn't move the user-defined function in the query plan tree. The user will still be able to duplicate the function if he wants to. See this question on stackoverflow: https://stackoverflow.com/questions/44291078/how-to-tune-the-query-planner-and-turn-off-an-optimization-in-spark was: Currently, while optimizing a query plan, spark pushes filters down the query plan tree, so that {code:title=LogicalPlan} Filter UDF(a) +- Join Inner, (a = b) +- Relation +- Relation {code} becomes {code:title=Optimized LogicalPlan} Join Inner, (a = b) +- Filter UDF(a) +- Relation +- Filter UDF(b) +- Relation {code} In general, it is a good thing to push down filters as it reduces the number of records that will go through the join. However, in the case where the filter is an user-defined function (UDF), we cannot know if the cost of executing the function twice will be higher than the eventual cost of joining more elements or not. So I think that the optimizer shouldn't move the user-defined function in the query plan tree. The user will still be able to duplicate the function if he wants to. See this question on stackoverflow: https://stackoverflow.com/questions/44291078/how-to-tune-the-query-planner-and-turn-off-an-optimization-in-spark > Do not duplicate user-defined functions while optimizing logical query plans > > > Key: SPARK-20939 > URL: https://issues.apache.org/jira/browse/SPARK-20939 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.1.0 >Reporter: Lovasoa >Priority: Minor > Labels: logical_plan, optimizer > > Currently, while optimizing a query plan, spark pushes filters down the query > plan tree, so that > {code:title=LogicalPlan} > Join Inner, (a = b) > +- Filter UDF(a) > +- Relation A > +- Relation B > {code} > becomes > {code:title=Optimized LogicalPlan} > Join Inner, (a = b) > +- Filter UDF(a) > +- Relation A > +- Filter UDF(b) > +- Relation B > {code} > In general, it is a good thing to push down filters as it reduces the number > of records that will go through the join. > However, in the case where the filter is an user-defined function (UDF), we > cannot know if the cost of executing the function twice will be higher than > the eventual cost of joining more elements or not. > So I think that the optimizer shouldn't move the user-defined function in the > query plan tree. The user will still be able to duplicate the function if he > wants to. > See this question on stackoverflow: > https://stackoverflow.com/questions/44291078/how-to-tune-the-query-planner-and-turn-off-an-optimization-in-spark -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20939) Do not duplicate user-defined functions while optimizing logical query plans
[ https://issues.apache.org/jira/browse/SPARK-20939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lovasoa updated SPARK-20939: Description: Currently, while optimizing a query plan, spark pushes filters down the query plan tree, so that {code:title=LogicalPlan} Filter UDF(a) +- Join Inner, (a = b) +- Relation +- Relation {code} becomes {code:title=Optimized LogicalPlan} Join Inner, (a = b) +- Filter UDF(a) +- Relation +- Filter UDF(b) +- Relation {code} In general, it is a good thing to push down filters as it reduces the number of records that will go through the join. However, in the case where the filter is an user-defined function (UDF), we cannot know if the cost of executing the function twice will be higher than the eventual cost of joining more elements or not. So I think that the optimizer shouldn't move the user-defined function in the query plan tree. The user will still be able to duplicate the function if he wants to. See this question on stackoverflow: https://stackoverflow.com/questions/44291078/how-to-tune-the-query-planner-and-turn-off-an-optimization-in-spark was: Currently, while optimizing a query plan, spark pushes filters down the query plan tree, so that {{ Filter UDF(a) +- Join Inner, (a = b) +- Relation +- Relation }} becomes {{ Join Inner, (a = b) +- Filter UDF(a) +- Relation +- Filter UDF(b) +- Relation }} In general, it is a good thing to push down filters as it reduces the number of records that will go through the join. However, in the case where the filter is an user-defined function (UDF), we cannot know if the cost of executing the function twice will be higher than the eventual cost of joining more elements or not. So I think that the optimizer shouldn't move the user-defined function in the query plan tree. The user will still be able to duplicate the function if he wants to. See this question on stackoverflow: https://stackoverflow.com/questions/44291078/how-to-tune-the-query-planner-and-turn-off-an-optimization-in-spark > Do not duplicate user-defined functions while optimizing logical query plans > > > Key: SPARK-20939 > URL: https://issues.apache.org/jira/browse/SPARK-20939 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.1.0 >Reporter: Lovasoa >Priority: Minor > Labels: logical_plan, optimizer > > Currently, while optimizing a query plan, spark pushes filters down the query > plan tree, so that > {code:title=LogicalPlan} > Filter UDF(a) > +- Join Inner, (a = b) >+- Relation >+- Relation > {code} > becomes > {code:title=Optimized LogicalPlan} > Join Inner, (a = b) > +- Filter UDF(a) > +- Relation > +- Filter UDF(b) > +- Relation > {code} > In general, it is a good thing to push down filters as it reduces the number > of records that will go through the join. > However, in the case where the filter is an user-defined function (UDF), we > cannot know if the cost of executing the function twice will be higher than > the eventual cost of joining more elements or not. > So I think that the optimizer shouldn't move the user-defined function in the > query plan tree. The user will still be able to duplicate the function if he > wants to. > See this question on stackoverflow: > https://stackoverflow.com/questions/44291078/how-to-tune-the-query-planner-and-turn-off-an-optimization-in-spark -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20939) Do not duplicate user-defined functions while optimizing logical query plans
[ https://issues.apache.org/jira/browse/SPARK-20939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lovasoa updated SPARK-20939: Description: Currently, while optimizing a query plan, spark pushes filters down the query plan tree, so that {{ Filter UDF(a) +- Join Inner, (a = b) +- Relation +- Relation }} becomes {{ Join Inner, (a = b) +- Filter UDF(a) +- Relation +- Filter UDF(b) +- Relation }} In general, it is a good thing to push down filters as it reduces the number of records that will go through the join. However, in the case where the filter is an user-defined function (UDF), we cannot know if the cost of executing the function twice will be higher than the eventual cost of joining more elements or not. So I think that the optimizer shouldn't move the user-defined function in the query plan tree. The user will still be able to duplicate the function if he wants to. See this question on stackoverflow: https://stackoverflow.com/questions/44291078/how-to-tune-the-query-planner-and-turn-off-an-optimization-in-spark was: Currently, while optimizing a query plan, spark pushes filters down the query plan tree, so that Filter UDF(a) +- Join Inner, (a = b) +- Relation +- Relation becomes Join Inner, (a = b) +- Filter UDF(a) +- Relation +- Filter UDF(b) +- Relation In general, it is a good thing to push down filters as it reduces the number of records that will go through the join. However, in the case where the filter is an user-defined function (UDF), we cannot know if the cost of executing the function twice will be higher than the eventual cost of joining more elements or not. So I think that the optimizer shouldn't move the user-defined function in the query plan tree. The user will still be able to duplicate the function if he wants to. See this question on stackoverflow: https://stackoverflow.com/questions/44291078/how-to-tune-the-query-planner-and-turn-off-an-optimization-in-spark > Do not duplicate user-defined functions while optimizing logical query plans > > > Key: SPARK-20939 > URL: https://issues.apache.org/jira/browse/SPARK-20939 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.1.0 >Reporter: Lovasoa >Priority: Minor > Labels: logical_plan, optimizer > > Currently, while optimizing a query plan, spark pushes filters down the query > plan tree, so that > {{ > Filter UDF(a) > +- Join Inner, (a = b) >+- Relation >+- Relation > }} > becomes > {{ > Join Inner, (a = b) > +- Filter UDF(a) > +- Relation > +- Filter UDF(b) > +- Relation > }} > In general, it is a good thing to push down filters as it reduces the number > of records that will go through the join. > However, in the case where the filter is an user-defined function (UDF), we > cannot know if the cost of executing the function twice will be higher than > the eventual cost of joining more elements or not. > So I think that the optimizer shouldn't move the user-defined function in the > query plan tree. The user will still be able to duplicate the function if he > wants to. > See this question on stackoverflow: > https://stackoverflow.com/questions/44291078/how-to-tune-the-query-planner-and-turn-off-an-optimization-in-spark -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org