Xiangrui Meng created SPARK-10371:
-------------------------------------

             Summary: Optimize sequential projections
                 Key: SPARK-10371
                 URL: https://issues.apache.org/jira/browse/SPARK-10371
             Project: Spark
          Issue Type: New Feature
          Components: ML, SQL
    Affects Versions: 1.5.0
            Reporter: Xiangrui Meng


In ML pipelines, each transformer/estimator appends new columns to the input 
DataFrame. For example, it might produce DataFrames like the following columns: 
a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), and d = 
udf_d(c). Some UDFs could be expensive. However, if we materialize c and d, 
udf_b, and udf_c are triggered twice, i.e., value c is not re-used.

It would be nice to detect this pattern and re-use intermediate values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to