[jira] [Commented] (YARN-6382) Address race condition on TimelineWriter.flush() caused by buffer-sized flush
[ https://issues.apache.org/jira/browse/YARN-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802725#comment-17802725 ] Shilun Fan commented on YARN-6382: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > Address race condition on TimelineWriter.flush() caused by buffer-sized flush > - > > Key: YARN-6382 > URL: https://issues.apache.org/jira/browse/YARN-6382 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Yousef Abu-Salah >Priority: Major > > YARN-6376 fixes the race condition between putEntities() and periodical > flush() by WriterFlushThread in TimelineCollectorManager, or between > putEntities() in different threads. > However, BufferedMutator can have internal size-based flush as well. We need > to address the resulting race condition. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6382) Address race condition on TimelineWriter.flush() caused by buffer-sized flush
[ https://issues.apache.org/jira/browse/YARN-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960241#comment-15960241 ] Haibo Chen commented on YARN-6382: -- Nice catch! I have filed YARN-6455 to improve it. > Address race condition on TimelineWriter.flush() caused by buffer-sized flush > - > > Key: YARN-6382 > URL: https://issues.apache.org/jira/browse/YARN-6382 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > > YARN-6376 fixes the race condition between putEntities() and periodical > flush() by WriterFlushThread in TimelineCollectorManager, or between > putEntities() in different threads. > However, BufferedMutator can have internal size-based flush as well. We need > to address the resulting race condition. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6382) Address race condition on TimelineWriter.flush() caused by buffer-sized flush
[ https://issues.apache.org/jira/browse/YARN-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960175#comment-15960175 ] Joep Rottinghuis commented on YARN-6382: Clarified with [~haibochen] that he meant that the race conditions for the latter two cases are solved in YARN-6376. That makes sense. Synchronizing on the writer is still a little brittle there, because there is a getWriter method which lets callers access the writer without synchronizing on it. AppLevelTimelineCollector#AppLevelAggregator#agregate() does this in line 152: getWriter().write(... In this case it doesn't flush, but if that were to be added, that would re-introduce the race fixed in YARN-6376. Instead of exposing the writer, perhaps it would be better to have the sub-classes call #putEntities instead. It defers to the private writeTimelineEntities which does the same work to get the context: TimelineCollectorContext context = getTimelineEntityContext(); Should we open a separate bug for that to enhance the fix in YARN-6376? > Address race condition on TimelineWriter.flush() caused by buffer-sized flush > - > > Key: YARN-6382 > URL: https://issues.apache.org/jira/browse/YARN-6382 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > > YARN-6376 fixes the race condition between putEntities() and periodical > flush() by WriterFlushThread in TimelineCollectorManager, or between > putEntities() in different threads. > However, BufferedMutator can have internal size-based flush as well. We need > to address the resulting race condition. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6382) Address race condition on TimelineWriter.flush() caused by buffer-sized flush
[ https://issues.apache.org/jira/browse/YARN-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954358#comment-15954358 ] Haibo Chen commented on YARN-6382: -- Thanks for the nice summary [~jrottinghuis]! bq. This write causes the buffer to be full, or perhaps thread B calls flush, or a timer calls flush. The latter two cases have been fixed by YARN-6357, so we only need to concern ourselves with the case where the buffer to be full. I believe, what I was mostly concerned about, losing data due to intermittent connection issues and this race condition, is only an issue if there is no spooling support. Assuming most data/entities are not problematic, that is, a flush will not fail because of the data itself and subsequent retries will eventually write the data successfully in HBase, we can provide enough guarantee that good entities are all going to be eventually persisted in HBase. Given that most of what b) solves will go away when we have the spooling writer, I agree that we could just document the issue for now. Once we get the spooling writer, we can come back and revisit this to address what we want to do with malformed/problematic entities. > Address race condition on TimelineWriter.flush() caused by buffer-sized flush > - > > Key: YARN-6382 > URL: https://issues.apache.org/jira/browse/YARN-6382 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > Labels: yarn-5355-merge-blocker > > YARN-6376 fixes the race condition between putEntities() and periodical > flush() by WriterFlushThread in TimelineCollectorManager, or between > putEntities() in different threads. > However, BufferedMutator can have internal size-based flush as well. We need > to address the resulting race condition. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6382) Address race condition on TimelineWriter.flush() caused by buffer-sized flush
[ https://issues.apache.org/jira/browse/YARN-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954009#comment-15954009 ] Joep Rottinghuis commented on YARN-6382: Thanks for pointing this out [~haibochen]. Yes, with asynchonous buffering and size based-flush this can happen. The periodic buffering can cause the same issue. Here is the scenario: * Internal buffer in buffered mutator is almost full * Thead A does a write (which we know will cause issues later down the road) * Thead B does a write. ** This write causes the buffer to be full, or perhaps thread B calls flush, or a timer calls flush. ** The earlier put from A caused an issue ** Thread B gets an error back, not knowing exactly which put failed, it can re-try its write later * The buffer is now empty * Thread A does a flush to confirm that its previous write made it through * Thread A receives a success status, because there are no further issues * Thread A incorrectly assumes that its writes were successfully written There seem to be three options to deal with this: a) Make writes synchronous, ie. for important writes do not use a buffered Mutator. The APIs would have to change, and performance might be significantly impacted as we saw in tests early on in the application timeline service development. b) Modify the API for the BufferedMutator (or not use the public API that comes along from instantiating one from the connection, ie -> hackery required). For a put we would return the batch-id (see work on HBASE-17018) to indicate which batch of writes a put went into. Then for the flush, we'd change the API as well to take a batch ID as in input argument. The (Spooling)BufferedMutator would then have to keep track of a limited list of recent failed batches for failed flushes. When threads ask if their batch fails, we can check the earliest entry in the failed list against the requested batch and return whether it was successful, failed, or if we don't know for sure (due to the limit in # failed batches we want to keep). This becomes all more complicated when we start considering spooling, because the error can happen much later. In the presence of spooling, all we really "guarantee" is that puts are persisted to a (distributed) filesystem, and that we'll do our utmost best to replay. Of course operators of a particular installation may choose to spool after an infinite amount of time, essentially blocking writes until they can be pushed into HBase. This leads us to the third option to deal with these race conditions: c) Document the conditions in JavaDoc and/or the external documentation, and move on for now. Language could be something like: {noformat} Under rare circumstances, some race conditions can exist between writers and internal buffer flushing that make it appear that a flush succeeds after a problematic write. {noformat} > Address race condition on TimelineWriter.flush() caused by buffer-sized flush > - > > Key: YARN-6382 > URL: https://issues.apache.org/jira/browse/YARN-6382 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > Labels: yarn-5355-merge-blocker > > YARN-6376 fixes the race condition between putEntities() and periodical > flush() by WriterFlushThread in TimelineCollectorManager, or between > putEntities() in different threads. > However, BufferedMutator can have internal size-based flush as well. We need > to address the resulting race condition. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org