[ 
https://issues.apache.org/jira/browse/BEAM-9440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052404#comment-17052404
 ] 

Luke Cwik commented on BEAM-9440:
---------------------------------

There are certain things that Apache Beam does (such as metrics gathering, 
state sampling, ...) which is being duplicated within Apache Beam for each 
state/stage execution because the translation from the Apache Beam pipeline to 
the native implementations doesn't perform the necessary set of optimizations 
to remove this duplicate work and use the underlying native mechanism.

Secondly, it is interesting to note that Spark/Flink/Apex executions all have 
different slow down factors of 5x, 10x and 25x even though they all share the 
same runner core libraries which points to that the translations from Apache 
Beam pipelines to native implementations vary greatly.

Finally, I would be interested if the author tested the portable versions as 
well.

> Performance Issue with Spark Runner compared with Native Spark
> --------------------------------------------------------------
>
>                 Key: BEAM-9440
>                 URL: https://issues.apache.org/jira/browse/BEAM-9440
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-spark
>            Reporter: Soumabrata Chakraborty
>            Priority: Major
>
> While doing a performance evaluation of Apache Beam with Spark Runner - I 
> found that even for a simple word count problem on a text file – Beam with 
> Spark runner was slower by a factor of 5 times as compared to Spark for a 
> dataset as small as 14 GB.
> You will find more details on this evaluation here - 
> [https://github.com/soumabrata-chakraborty/spark-vs-beam/blob/master/README.md]
> I also came across this analysis called _**Quantitative Impact Evaluation of 
> an Abstraction Layer for Data Stream Processing Systems_ 
> ([https://arxiv.org/pdf/1907.08302.pdf] / 
> [https://ieeexplore.ieee.org/document/8884832])
> According to it, the observation was that for most scenarios the slowdown was 
> at least a factor of 3 with the worse case being a factor of 58!
> While it is understood that an abstraction layer would come with some 
> performance cost - the current performance cost seems to be very high.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to