[jira] [Commented] (YARN-6376) Exceptions caused by synchronous putEntities requests can be swallowed
[ https://issues.apache.org/jira/browse/YARN-6376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15949860#comment-15949860 ] Hudson commented on YARN-6376: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11503 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/11503/]) YARN-6376. Exceptions caused by synchronous putEntities requests can be (varunsaxena: rev b58777a9c9a5b6f2e4bcfd2b3bede33f25f80dec) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/TimelineCollector.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/TimelineCollectorManager.java > Exceptions caused by synchronous putEntities requests can be swallowed > -- > > Key: YARN-6376 > URL: https://issues.apache.org/jira/browse/YARN-6376 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen >Priority: Critical > Labels: yarn-5355-merge-blocker > Fix For: YARN-5355, YARN-5355-branch-2, 3.0.0-alpha3 > > Attachments: YARN-6376.00.patch > > > TimelineCollector.putEntitities() is currently implemented by calling > TimelineWriter.write() followed by TimelineWriter.flush(). Given > HBaseTimelineWriter.write() is an asynchronous operation, it is possible that > TimelineClient sends a synchronous putEntities() request for critical data, > but never gets back an exception even though the HBase write request to store > the entities may have failed. > This is due to a race condition between the WriterFlushThread in > TimelineCollectorManager and web threads handling synchronous putEntities() > requests. Entities are first put into the buffer by the web thread, it is > possible that before the web thread invokes writer.flush(), WriterFlushThread > is fired up to flush the writer. If the entities were not successfully > written to the backend during flush, the WriterFlushThread would just simply > log an error, whereas the web thread would never get an exception out from > its writer.flush() invocation. This is bad because the reason of > TimelineClient sending synchronously putEntities() is to retry upon any > exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6376) Exceptions caused by synchronous putEntities requests can be swallowed
[ https://issues.apache.org/jira/browse/YARN-6376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15949818#comment-15949818 ] Varun Saxena commented on YARN-6376: Committed to trunk, YARN-5355, YARN-5355-branch-2. Thanks [~haibochen] for your contribution. > Exceptions caused by synchronous putEntities requests can be swallowed > -- > > Key: YARN-6376 > URL: https://issues.apache.org/jira/browse/YARN-6376 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen >Priority: Critical > Labels: yarn-5355-merge-blocker > Attachments: YARN-6376.00.patch > > > TimelineCollector.putEntitities() is currently implemented by calling > TimelineWriter.write() followed by TimelineWriter.flush(). Given > HBaseTimelineWriter.write() is an asynchronous operation, it is possible that > TimelineClient sends a synchronous putEntities() request for critical data, > but never gets back an exception even though the HBase write request to store > the entities may have failed. > This is due to a race condition between the WriterFlushThread in > TimelineCollectorManager and web threads handling synchronous putEntities() > requests. Entities are first put into the buffer by the web thread, it is > possible that before the web thread invokes writer.flush(), WriterFlushThread > is fired up to flush the writer. If the entities were not successfully > written to the backend during flush, the WriterFlushThread would just simply > log an error, whereas the web thread would never get an exception out from > its writer.flush() invocation. This is bad because the reason of > TimelineClient sending synchronously putEntities() is to retry upon any > exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6376) Exceptions caused by synchronous putEntities requests can be swallowed
[ https://issues.apache.org/jira/browse/YARN-6376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15949054#comment-15949054 ] Varun Saxena commented on YARN-6376: +1 Will commit it today > Exceptions caused by synchronous putEntities requests can be swallowed > -- > > Key: YARN-6376 > URL: https://issues.apache.org/jira/browse/YARN-6376 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen >Priority: Critical > Labels: yarn-5355-merge-blocker > Attachments: YARN-6376.00.patch > > > TimelineCollector.putEntitities() is currently implemented by calling > TimelineWriter.write() followed by TimelineWriter.flush(). Given > HBaseTimelineWriter.write() is an asynchronous operation, it is possible that > TimelineClient sends a synchronous putEntities() request for critical data, > but never gets back an exception even though the HBase write request to store > the entities may have failed. > This is due to a race condition between the WriterFlushThread in > TimelineCollectorManager and web threads handling synchronous putEntities() > requests. Entities are first put into the buffer by the web thread, it is > possible that before the web thread invokes writer.flush(), WriterFlushThread > is fired up to flush the writer. If the entities were not successfully > written to the backend during flush, the WriterFlushThread would just simply > log an error, whereas the web thread would never get an exception out from > its writer.flush() invocation. This is bad because the reason of > TimelineClient sending synchronously putEntities() is to retry upon any > exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6376) Exceptions caused by synchronous putEntities requests can be swallowed
[ https://issues.apache.org/jira/browse/YARN-6376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15947723#comment-15947723 ] Hadoop QA commented on YARN-6376: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 15s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 13m 21s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 16s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 12s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 17s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 25s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 14s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 13s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 13s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 39s{color} | {color:green} hadoop-yarn-server-timelineservice in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 15s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 19m 6s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:a9ad5d6 | | JIRA Issue | YARN-6376 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12861075/YARN-6376.00.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux f2fbd7eac1bc 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 13c766b | | Default Java | 1.8.0_121 | | findbugs | v3.0.0 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/15420/testReport/ | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/15420/console | | Powered by | Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Exceptions caused by synchronous putEntities requests can be swallowed > -- > > Key: YARN-6376 > URL: https://issues.apache.org/jira/browse/YARN-6376 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Affects Versions: 3.0.0-
[jira] [Commented] (YARN-6376) Exceptions caused by synchronous putEntities requests can be swallowed in TimelineCollector
[ https://issues.apache.org/jira/browse/YARN-6376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15939093#comment-15939093 ] Haibo Chen commented on YARN-6376: -- Will upload a patch once YARN-6357 is committed > Exceptions caused by synchronous putEntities requests can be swallowed in > TimelineCollector > --- > > Key: YARN-6376 > URL: https://issues.apache.org/jira/browse/YARN-6376 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen >Priority: Critical > Labels: yarn-5355-merge-blocker > > TimelineCollector.putEntitities() is currently implemented by calling > TimelineWriter.write() followed by TimelineWriter.flush(). Given > HBaseTimelineWriter.write() is an asynchronous operation, it is possible that > TimelineClient sends a synchronous putEntities() request for critical data, > but never gets back an exception even though the HBase write request to store > the entities may have failed. > This is due to a race condition between the WriterFlushThread in > TimelineCollectorManager and web threads handling synchronous putEntities() > requests. Entities are first put into the buffer by the web thread, it is > possible that before the web thread invokes writer.flush(), WriterFlushThread > is fired up to flush the writer. If the entities were not successfully > written to the backend during flush, the WriterFlushThread would just simply > log an error, whereas the web thread would never get an exception out from > its writer.flush() invocation. This is bad because the reason of > TimelineClient sending synchronously putEntities() is to retry upon any > exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6376) Exceptions caused by synchronous putEntities requests can be swallowed in TimelineCollector
[ https://issues.apache.org/jira/browse/YARN-6376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15938859#comment-15938859 ] Varun Saxena commented on YARN-6376: As discussed in the call, let's just synchronize on writer object. > Exceptions caused by synchronous putEntities requests can be swallowed in > TimelineCollector > --- > > Key: YARN-6376 > URL: https://issues.apache.org/jira/browse/YARN-6376 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen >Priority: Critical > Labels: yarn-5355-merge-blocker > > TimelineCollector.putEntitities() is currently implemented by calling > TimelineWriter.write() followed by TimelineWriter.flush(). Given > HBaseTimelineWriter.write() is an asynchronous operation, it is possible that > TimelineClient sends a synchronous putEntities() request for critical data, > but never gets back an exception even though the HBase write request to store > the entities may have failed. > This is due to a race condition between the WriterFlushThread in > TimelineCollectorManager and web threads handling synchronous putEntities() > requests. Entities are first put into the buffer by the web thread, it is > possible that before the web thread invokes writer.flush(), WriterFlushThread > is fired up to flush the writer. If the entities were not successfully > written to the backend during flush, the WriterFlushThread would just simply > log an error, whereas the web thread would never get an exception out from > its writer.flush() invocation. This is bad because the reason of > TimelineClient sending synchronously putEntities() is to retry upon any > exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6376) Exceptions caused by synchronous putEntities requests can be swallowed in TimelineCollector
[ https://issues.apache.org/jira/browse/YARN-6376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937364#comment-15937364 ] Haibo Chen commented on YARN-6376: -- bq. We should synchronize these two operations. Agreed. We may need to create a TimelineWriter wrap for this purpose {code} public class TimelineWriterSynchronizedOnPutEntitiesSync { private TimelineWriter writer; // lock for serializing putEntitiesSync() and flush() private final ReentrantLock lock = new ReentrantLock(); TimelineWriteResponse putEntitesSync() { lock.lock(); // block until condition holds try { writer.write(); writer.flush(); } finally { lock.unlock() } } void putEntitiesAsync() { writer.write(); } void flush() { lock.lock(); try { writer.flush(); } finally { lock.unlock(); } } } {code} However, this quickly gets our of control if there is flush() internal to TimelineWriter, buffer-size based flush for instance, because we can no longer synchronize outside of TimelineWriter > Exceptions caused by synchronous putEntities requests can be swallowed in > TimelineCollector > --- > > Key: YARN-6376 > URL: https://issues.apache.org/jira/browse/YARN-6376 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Priority: Critical > Labels: yarn-5355-merge-blocker > > TimelineCollector.putEntitities() is currently implemented by calling > TimelineWriter.write() followed by TimelineWriter.flush(). Given > HBaseTimelineWriter.write() is an asynchronous operation, it is possible that > TimelineClient sends a synchronous putEntities() request for critical data, > but never gets back an exception even though the HBase write request to store > the entities may have failed. > This is due to a race condition between the WriterFlushThread in > TimelineCollectorManager and web threads handling synchronous putEntities() > requests. Entities are first put into the buffer by the web thread, it is > possible that before the web thread invokes writer.flush(), WriterFlushThread > is fired up to flush the writer. If the entities were not successfully > written to the backend during flush, the WriterFlushThread would just simply > log an error, whereas the web thread would never get an exception out from > its writer.flush() invocation. This is bad because the reason of > TimelineClient sending synchronously putEntities() is to retry upon any > exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6376) Exceptions caused by synchronous putEntities requests can be swallowed in TimelineCollector
[ https://issues.apache.org/jira/browse/YARN-6376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937246#comment-15937246 ] Varun Saxena commented on YARN-6376: Thanks [~haibochen]. Makes sense. We should synchronize these two operations. > Exceptions caused by synchronous putEntities requests can be swallowed in > TimelineCollector > --- > > Key: YARN-6376 > URL: https://issues.apache.org/jira/browse/YARN-6376 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Priority: Critical > Labels: yarn-5355-merge-blocker > > TimelineCollector.putEntitities() is currently implemented by calling > TimelineWriter.write() followed by TimelineWriter.flush(). Given > HBaseTimelineWriter.write() is an asynchronous operation, it is possible that > TimelineClient sends a synchronous putEntities() request for critical data, > but never gets back an exception even though the HBase write request to store > the entities may have failed. > This is due to a race condition between the WriterFlushThread in > TimelineCollectorManager and web threads handling synchronous putEntities() > requests. Entities are first put into the buffer by the web thread, it is > possible that before the web thread invokes writer.flush(), WriterFlushThread > is fired up to flush the writer. If the entities were not successfully > written to the backend during flush, the WriterFlushThread would just simply > log an error, whereas the web thread would never get an exception out from > its writer.flush() invocation. This is bad because the reason of > TimelineClient sending synchronously putEntities() is to retry upon any > exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6376) Exceptions caused by synchronous putEntities requests can be swallowed in TimelineCollector
[ https://issues.apache.org/jira/browse/YARN-6376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937221#comment-15937221 ] Haibo Chen commented on YARN-6376: -- [~varun_saxena] Just added more details to this jira. > Exceptions caused by synchronous putEntities requests can be swallowed in > TimelineCollector > --- > > Key: YARN-6376 > URL: https://issues.apache.org/jira/browse/YARN-6376 > Project: Hadoop YARN > Issue Type: Bug > Components: ATSv2 >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Priority: Critical > Labels: yarn-5355-merge-blocker > > TimelineCollector.putEntitities() is currently implemented by calling > TimelineWriter.write() followed by TimelineWriter.flush(). Given > HBaseTimelineWriter.write() is an asynchronous operation, it is possible that > TimelineClient sends a synchronous putEntities() request for critical data, > but never gets back an exception even though the HBase write request to store > the entities may have failed. > This is due to a race condition between the WriterFlushThread in > TimelineCollectorManager and web threads handling synchronous putEntities() > requests. Entities are first put into the buffer by the web thread, it is > possible that before the web thread invokes writer.flush(), WriterFlushThread > is fired up to flush the writer. If the entities were not successfully > written to the backend during flush, the WriterFlushThread would just simply > log an error, whereas the web thread would never get an exception out from > its writer.flush() invocation. This is bad because the reason of > TimelineClient sending synchronously putEntities() is to retry upon any > exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6376) Exceptions caused by synchronous putEntities requests can be swallowed in TimelineCollector
[ https://issues.apache.org/jira/browse/YARN-6376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937195#comment-15937195 ] Varun Saxena commented on YARN-6376: [~haibochen] can you elaborate upon it. write failure would lead to an IOException which we would caught and an appropriate HTTP error response sent. Right? > Exceptions caused by synchronous putEntities requests can be swallowed in > TimelineCollector > --- > > Key: YARN-6376 > URL: https://issues.apache.org/jira/browse/YARN-6376 > Project: Hadoop YARN > Issue Type: Bug > Components: ATSv2 >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Priority: Critical > Labels: yarn-5355-merge-blocker > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org