[ 
https://issues.apache.org/jira/browse/SPARK-47742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hemant Sakharkar updated SPARK-47742:
-------------------------------------
    Attachment: spark_chain_transformation.png

> Spark Transformation with Multi Case filter can improve efficiency
> ------------------------------------------------------------------
>
>                 Key: SPARK-47742
>                 URL: https://issues.apache.org/jira/browse/SPARK-47742
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 4.0.0
>            Reporter: Hemant Sakharkar
>            Priority: Major
>              Labels: performance
>         Attachments: spark_chain_transformation.png
>
>
> In Feature Engineering we need to process the input data to create feature 
> and feature vectors which are required to train the model. For which we need 
> to do multiple spark transformations (etc:map, filter etc) the spark has very 
> good optimization for multiple transformations due to its lazy execution. It 
> combines multiple transformations into fewer transformations which helps to 
> optimize the overall execution time.
> I found that we can still improve the execution time in the case of filters. 
> *Sample Run Results:*
> Records :50,000,000
> 5 filter Execution Time: (t2-t1) 24854 millisec
> 5 filter with Map Execution Time: (t3-t2) 5212 millisec
> We can very well improve multiple X times and reduce significant memory 
> footprint for a complex DAG of Spark Transformation.
> Sample illustration can be found here
> [https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing]
> Need support of such transformation in Spark Core so that more complex 
> transformation can be supported. Some illustration is provided in above 
> document.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to