[ 
https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579854#comment-16579854
 ] 

Thomas Graves commented on SPARK-24787:
---------------------------------------

Yes it was caused by hsync, hsync has to go to the namenode in addition to the 
datanode, hflush is datanode only operation.   We saw huge increase in dropped 
events with this on large jobs, we reverted the change and went back to only 
hflush and it stopped dropping.

Talked to one of our hdfs experts and he said hsync is expensive.  it might 
depend on how loaded your hdfs cluster is.

> Events being dropped at an alarming rate due to hsync being slow for 
> eventLogging
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-24787
>                 URL: https://issues.apache.org/jira/browse/SPARK-24787
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, Web UI
>    Affects Versions: 2.3.0, 2.3.1
>            Reporter: Sanket Reddy
>            Priority: Minor
>
> [https://github.com/apache/spark/pull/16924/files] updates the length of the 
> inprogress files allowing history server being responsive.
> Although we have a production job that has 60000 tasks per stage and due to 
> hsync being slow it starts dropping events and the history server has wrong 
> stats due to events being dropped.
> A viable solution is not to make it sync very frequently or make it 
> configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to