[ 
https://issues.apache.org/jira/browse/YARN-6376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937364#comment-15937364
 ] 

Haibo Chen commented on YARN-6376:
----------------------------------

bq. We should synchronize these two operations.
Agreed. We may need to create a TimelineWriter wrap for this purpose
{code}
public class TimelineWriterSynchronizedOnPutEntitiesSync {
     private TimelineWriter writer;
     // lock for serializing putEntitiesSync() and flush()
     private final ReentrantLock lock = new ReentrantLock();
     TimelineWriteResponse putEntitesSync() {
        lock.lock();  // block until condition holds
        try {
           writer.write();
           writer.flush();
        } finally {
          lock.unlock()
        }
     }
     void putEntitiesAsync() {
         writer.write();
      }
     void flush() {
        lock.lock(); 
        try {
          writer.flush();
        } finally {
          lock.unlock();
        }
     }
}
{code}

However, this quickly gets our of control if there is flush() internal to 
TimelineWriter, buffer-size based flush for instance, because we can no longer 
synchronize outside of TimelineWriter

> Exceptions caused by synchronous putEntities requests can be swallowed in 
> TimelineCollector
> -------------------------------------------------------------------------------------------
>
>                 Key: YARN-6376
>                 URL: https://issues.apache.org/jira/browse/YARN-6376
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: ATSv2
>    Affects Versions: 3.0.0-alpha2
>            Reporter: Haibo Chen
>            Priority: Critical
>              Labels: yarn-5355-merge-blocker
>
> TimelineCollector.putEntitities() is currently implemented by calling 
> TimelineWriter.write() followed by TimelineWriter.flush(). Given 
> HBaseTimelineWriter.write() is an asynchronous operation, it is possible that 
> TimelineClient sends a synchronous putEntities() request for critical data, 
> but never gets back an exception even though the HBase write request to store 
> the entities may have failed. 
> This is due to a race condition between the WriterFlushThread in 
> TimelineCollectorManager and web threads handling synchronous putEntities() 
> requests. Entities are first put into the buffer by the web thread, it is 
> possible that before the web thread invokes writer.flush(), WriterFlushThread 
> is fired up to flush the writer. If the entities were not successfully 
> written to the backend during flush, the WriterFlushThread would just simply 
> log an error, whereas the web thread would never get an exception out from 
> its writer.flush() invocation. This is bad because the reason of 
> TimelineClient sending synchronously putEntities() is to retry upon any 
> exception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to