[jira] [Updated] (SPARK-10371) Optimize sequential projections
[ https://issues.apache.org/jira/browse/SPARK-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10371: -- Priority: Critical (was: Major) > Optimize sequential projections > --- > > Key: SPARK-10371 > URL: https://issues.apache.org/jira/browse/SPARK-10371 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng >Assignee: Nong Li >Priority: Critical > > In ML pipelines, each transformer/estimator appends new columns to the input > DataFrame. For example, it might produce DataFrames like the following > columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), > and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c > and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used. > It would be nice to detect this pattern and re-use intermediate values. > {code} > val input = sqlContext.range(10) > val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * > 2) > output.explain(true) > == Parsed Logical Plan == > 'Project [*,('x * 2) AS y#254] > Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L] > LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 > == Analyzed Logical Plan == > id: bigint, x: bigint, y: bigint > Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L] > Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L] > LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 > == Optimized Logical Plan == > Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L] > LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 > == Physical Plan == > TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS > y#254L] > Scan PhysicalRDD[id#252L] > Code Generation: true > input: org.apache.spark.sql.DataFrame = [id: bigint] > output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10371) Optimize sequential projections
[ https://issues.apache.org/jira/browse/SPARK-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10371: -- Assignee: Nong Li > Optimize sequential projections > --- > > Key: SPARK-10371 > URL: https://issues.apache.org/jira/browse/SPARK-10371 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng >Assignee: Nong Li > > In ML pipelines, each transformer/estimator appends new columns to the input > DataFrame. For example, it might produce DataFrames like the following > columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), > and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c > and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used. > It would be nice to detect this pattern and re-use intermediate values. > {code} > val input = sqlContext.range(10) > val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * > 2) > output.explain(true) > == Parsed Logical Plan == > 'Project [*,('x * 2) AS y#254] > Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L] > LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 > == Analyzed Logical Plan == > id: bigint, x: bigint, y: bigint > Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L] > Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L] > LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 > == Optimized Logical Plan == > Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L] > LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 > == Physical Plan == > TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS > y#254L] > Scan PhysicalRDD[id#252L] > Code Generation: true > input: org.apache.spark.sql.DataFrame = [id: bigint] > output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10371) Optimize sequential projections
[ https://issues.apache.org/jira/browse/SPARK-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10371: -- Description: In ML pipelines, each transformer/estimator appends new columns to the input DataFrame. For example, it might produce DataFrames like the following columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used. It would be nice to detect this pattern and re-use intermediate values. {code} val input = sqlContext.range(10) val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * 2) output.explain(true) == Parsed Logical Plan == 'Project [*,('x * 2) AS y#254] Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L] LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 == Analyzed Logical Plan == id: bigint, x: bigint, y: bigint Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L] Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L] LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 == Optimized Logical Plan == Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L] LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 == Physical Plan == TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L] Scan PhysicalRDD[id#252L] Code Generation: true input: org.apache.spark.sql.DataFrame = [id: bigint] output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint] {code} was: In ML pipelines, each transformer/estimator appends new columns to the input DataFrame. For example, it might produce DataFrames like the following columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used. It would be nice to detect this pattern and re-use intermediate values. > Optimize sequential projections > --- > > Key: SPARK-10371 > URL: https://issues.apache.org/jira/browse/SPARK-10371 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > > In ML pipelines, each transformer/estimator appends new columns to the input > DataFrame. For example, it might produce DataFrames like the following > columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), > and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c > and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used. > It would be nice to detect this pattern and re-use intermediate values. > {code} > val input = sqlContext.range(10) > val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * > 2) > output.explain(true) > == Parsed Logical Plan == > 'Project [*,('x * 2) AS y#254] > Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L] > LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 > == Analyzed Logical Plan == > id: bigint, x: bigint, y: bigint > Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L] > Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L] > LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 > == Optimized Logical Plan == > Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L] > LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 > == Physical Plan == > TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS > y#254L] > Scan PhysicalRDD[id#252L] > Code Generation: true > input: org.apache.spark.sql.DataFrame = [id: bigint] > output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org