[ https://issues.apache.org/jira/browse/SPARK-47742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hemant Sakharkar updated SPARK-47742: ------------------------------------- Attachment: spark_chain_transformation.png > Spark Transformation with Multi Case filter can improve efficiency > ------------------------------------------------------------------ > > Key: SPARK-47742 > URL: https://issues.apache.org/jira/browse/SPARK-47742 > Project: Spark > Issue Type: New Feature > Components: Spark Core > Affects Versions: 4.0.0 > Reporter: Hemant Sakharkar > Priority: Major > Labels: performance > Attachments: spark_chain_transformation.png > > > In Feature Engineering we need to process the input data to create feature > and feature vectors which are required to train the model. For which we need > to do multiple spark transformations (etc:map, filter etc) the spark has very > good optimization for multiple transformations due to its lazy execution. It > combines multiple transformations into fewer transformations which helps to > optimize the overall execution time. > I found that we can still improve the execution time in the case of filters. > *Sample Run Results:* > Records :50,000,000 > 5 filter Execution Time: (t2-t1) 24854 millisec > 5 filter with Map Execution Time: (t3-t2) 5212 millisec > We can very well improve multiple X times and reduce significant memory > footprint for a complex DAG of Spark Transformation. > Sample illustration can be found here > [https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing] > Need support of such transformation in Spark Core so that more complex > transformation can be supported. Some illustration is provided in above > document. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org