[ https://issues.apache.org/jira/browse/SPARK-25340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Takeshi Yamamuro updated SPARK-25340: ------------------------------------- Description: If computations in Project are heavy (e.g., UDFs), it is useful to push down sample nodes into deterministic projects; {code} scala> spark.range(10).selectExpr("id + 3").sample(0.5).explain(true) // without this proposal == Analyzed Logical Plan == (id + 3): bigint Sample 0.0, 0.5, false, 3370873312340343855 +- Project [(id#0L + cast(3 as bigint)) AS (id + 3)#2L] +- Range (0, 10, step=1, splits=Some(4)) == Optimized Logical Plan == Sample 0.0, 0.5, false, 3370873312340343855 +- Project [(id#0L + 3) AS (id + 3)#2L] +- Range (0, 10, step=1, splits=Some(4)) // with this proposal == Optimized Logical Plan == Project [(id#0L + 3) AS (id + 3)#2L] +- Sample 0.0, 0.5, false, -6519017078291024113 +- Range (0, 10, step=1, splits=Some(4)) {code} POC: https://github.com/apache/spark/compare/master...maropu:SamplePushdown was: If computations in Project are heavy (e.g., UDFs), it is useful to push down sample nodes into deterministic projects; {code} scala> spark.range(10).selectExpr("id + 3").sample(0.5).explain(true) // without this proposal == Analyzed Logical Plan == (id + 3): bigint Sample 0.0, 0.5, false, 3370873312340343855 +- Project [(id#0L + cast(3 as bigint)) AS (id + 3)#2L] +- Range (0, 10, step=1, splits=Some(4)) == Optimized Logical Plan == Sample 0.0, 0.5, false, 3370873312340343855 +- Project [(id#0L + 3) AS (id + 3)#2L] +- Range (0, 10, step=1, splits=Some(4)) // with this proposal == Optimized Logical Plan == Project [(id#0L + 3) AS (id + 3)#2L] +- Sample 0.0, 0.5, false, -6519017078291024113 +- Range (0, 10, step=1, splits=Some(4)) {code} > Pushes down Sample beneath deterministic Project > ------------------------------------------------ > > Key: SPARK-25340 > URL: https://issues.apache.org/jira/browse/SPARK-25340 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 2.3.1 > Reporter: Takeshi Yamamuro > Priority: Minor > > If computations in Project are heavy (e.g., UDFs), it is useful to push down > sample nodes into deterministic projects; > {code} > scala> spark.range(10).selectExpr("id + 3").sample(0.5).explain(true) > // without this proposal > == Analyzed Logical Plan == > (id + 3): bigint > Sample 0.0, 0.5, false, 3370873312340343855 > +- Project [(id#0L + cast(3 as bigint)) AS (id + 3)#2L] > +- Range (0, 10, step=1, splits=Some(4)) > == Optimized Logical Plan == > Sample 0.0, 0.5, false, 3370873312340343855 > +- Project [(id#0L + 3) AS (id + 3)#2L] > +- Range (0, 10, step=1, splits=Some(4)) > // with this proposal > == Optimized Logical Plan == > Project [(id#0L + 3) AS (id + 3)#2L] > +- Sample 0.0, 0.5, false, -6519017078291024113 > +- Range (0, 10, step=1, splits=Some(4)) > {code} > POC: https://github.com/apache/spark/compare/master...maropu:SamplePushdown -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org