subject:"\[jira\] \[Updated\] \(SPARK\-10371\) Optimize sequential projections"

[jira] [Updated] (SPARK-10371) Optimize sequential projections

2015-11-06 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10371:
--
Priority: Critical  (was: Major)

> Optimize sequential projections
> ---
>
> Key: SPARK-10371
> URL: https://issues.apache.org/jira/browse/SPARK-10371
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Nong Li
>Priority: Critical
>
> In ML pipelines, each transformer/estimator appends new columns to the input 
> DataFrame. For example, it might produce DataFrames like the following 
> columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), 
> and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c 
> and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used.
> It would be nice to detect this pattern and re-use intermediate values.
> {code}
> val input = sqlContext.range(10)
> val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * 
> 2)
> output.explain(true)
> == Parsed Logical Plan ==
> 'Project [*,('x * 2) AS y#254]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Analyzed Logical Plan ==
> id: bigint, x: bigint, y: bigint
> Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Optimized Logical Plan ==
> Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L]
>  LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Physical Plan ==
> TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS 
> y#254L]
>  Scan PhysicalRDD[id#252L]
> Code Generation: true
> input: org.apache.spark.sql.DataFrame = [id: bigint]
> output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10371) Optimize sequential projections

2015-11-06 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10371:
--
Assignee: Nong Li

> Optimize sequential projections
> ---
>
> Key: SPARK-10371
> URL: https://issues.apache.org/jira/browse/SPARK-10371
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Nong Li
>
> In ML pipelines, each transformer/estimator appends new columns to the input 
> DataFrame. For example, it might produce DataFrames like the following 
> columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), 
> and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c 
> and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used.
> It would be nice to detect this pattern and re-use intermediate values.
> {code}
> val input = sqlContext.range(10)
> val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * 
> 2)
> output.explain(true)
> == Parsed Logical Plan ==
> 'Project [*,('x * 2) AS y#254]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Analyzed Logical Plan ==
> id: bigint, x: bigint, y: bigint
> Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Optimized Logical Plan ==
> Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L]
>  LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Physical Plan ==
> TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS 
> y#254L]
>  Scan PhysicalRDD[id#252L]
> Code Generation: true
> input: org.apache.spark.sql.DataFrame = [id: bigint]
> output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10371) Optimize sequential projections

2015-09-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10371:
--
Description: 
In ML pipelines, each transformer/estimator appends new columns to the input 
DataFrame. For example, it might produce DataFrames like the following columns: 
a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), and d = 
udf_d(c). Some UDFs could be expensive. However, if we materialize c and d, 
udf_b, and udf_c are triggered twice, i.e., value c is not re-used.

It would be nice to detect this pattern and re-use intermediate values.

{code}
val input = sqlContext.range(10)
val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * 2)
output.explain(true)

== Parsed Logical Plan ==
'Project [*,('x * 2) AS y#254]
 Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
  LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30

== Analyzed Logical Plan ==
id: bigint, x: bigint, y: bigint
Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L]
 Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
  LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30

== Optimized Logical Plan ==
Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L]
 LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30

== Physical Plan ==
TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L]
 Scan PhysicalRDD[id#252L]

Code Generation: true
input: org.apache.spark.sql.DataFrame = [id: bigint]
output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint]
{code}

  was:
In ML pipelines, each transformer/estimator appends new columns to the input 
DataFrame. For example, it might produce DataFrames like the following columns: 
a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), and d = 
udf_d(c). Some UDFs could be expensive. However, if we materialize c and d, 
udf_b, and udf_c are triggered twice, i.e., value c is not re-used.

It would be nice to detect this pattern and re-use intermediate values.


> Optimize sequential projections
> ---
>
> Key: SPARK-10371
> URL: https://issues.apache.org/jira/browse/SPARK-10371
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>
> In ML pipelines, each transformer/estimator appends new columns to the input 
> DataFrame. For example, it might produce DataFrames like the following 
> columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), 
> and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c 
> and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used.
> It would be nice to detect this pattern and re-use intermediate values.
> {code}
> val input = sqlContext.range(10)
> val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * 
> 2)
> output.explain(true)
> == Parsed Logical Plan ==
> 'Project [*,('x * 2) AS y#254]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Analyzed Logical Plan ==
> id: bigint, x: bigint, y: bigint
> Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Optimized Logical Plan ==
> Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L]
>  LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Physical Plan ==
> TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS 
> y#254L]
>  Scan PhysicalRDD[id#252L]
> Code Generation: true
> input: org.apache.spark.sql.DataFrame = [id: bigint]
> output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10371) Optimize sequential projections

[jira] [Updated] (SPARK-10371) Optimize sequential projections

[jira] [Updated] (SPARK-10371) Optimize sequential projections

3 matches

Site Navigation

Mail list logo

Footer information