[jira] [Commented] (HADOOP-13927) ADLS TestAdlContractRootDirLive.testRmNonEmptyRootDirNonRecursive failed
[ https://issues.apache.org/jira/browse/HADOOP-13927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15795923#comment-15795923 ] Tony Wu commented on HADOOP-13927: -- Hi [~jzhuge], this error started appearing after HADOOP-13900. After reverting the {{azure-data-lake-store-sdk}} version to {{2.0.4-SNAPSHOT}} the error went away. > ADLS TestAdlContractRootDirLive.testRmNonEmptyRootDirNonRecursive failed > > > Key: HADOOP-13927 > URL: https://issues.apache.org/jira/browse/HADOOP-13927 > Project: Hadoop Common > Issue Type: Bug > Components: fs/adl, test >Affects Versions: 3.0.0-alpha2 >Reporter: John Zhuge >Priority: Critical > > {noformat} > Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 18.095 sec > <<< FAILURE! - in org.apache.hadoop.fs.adl.live.TestAdlContractRootDirLive > testRmNonEmptyRootDirNonRecursive(org.apache.hadoop.fs.adl.live.TestAdlContractRootDirLive) > Time elapsed: 1.085 sec <<< FAILURE! > java.lang.AssertionError: non recursive delete should have raised an > exception, but completed with exit code false > at org.junit.Assert.fail(Assert.java:88) > at > org.apache.hadoop.fs.contract.AbstractContractRootDirectoryTest.testRmNonEmptyRootDirNonRecursive(AbstractContractRootDirectoryTest.java:132) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > at > org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13897) TestAdlFileContextMainOperationsLive#testGetFileContext1 fails consistently
[ https://issues.apache.org/jira/browse/HADOOP-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745554#comment-15745554 ] Tony Wu commented on HADOOP-13897: -- Hi [~steve_l], thanks a lot for your response. {{TestAdlFileContextMainOperationsLive}} does setup the FC to use ADL as default FS: {code} public class TestAdlFileContextMainOperationsLive extends FileContextMainOperationsBaseTest { ... @Override public void setUp() throws Exception { Configuration conf = AdlStorageConfiguration.getConfiguration(); String fileSystem = conf.get(KEY_FILE_SYSTEM); if (fileSystem == null || fileSystem.trim().length() == 0) { throw new Exception("Default file system not configured."); } URI uri = new URI(fileSystem); FileSystem fs = AdlStorageConfiguration.createStorageConnector(); fc = FileContext.getFileContext( new DelegateToFileSystem(uri, fs, conf, fs.getScheme(), false) { }, conf); super.setUp(); } {code} However the test that's failing is creating a second FC, with default config: {code} @Test /* * Test method * org.apache.hadoop.fs.FileContext.getFileContext(AbstractFileSystem) */ public void testGetFileContext1() throws IOException { final Path rootPath = getTestRootPath(fc, "test"); AbstractFileSystem asf = fc.getDefaultFileSystem(); // create FileContext using the protected #getFileContext(1) method: << FileContext fc2 = FileContext.getFileContext(asf); << 2nd FC created >> // Now just check that this context can do something reasonable: final Path path = new Path(rootPath, "zoo"); FSDataOutputStream out = fc2.create(path, EnumSet.of(CREATE), Options.CreateOpts.createParent()); out.close(); Path pathResolved = fc2.resolvePath(path); assertEquals(pathResolved.toUri().getPath(), path.toUri().getPath()); } {code} {{FileContext.getFileContext()}} uses the default configuration: {code} /** * Create a FileContext for specified file system using the default config. * * @param defaultFS * @return a FileContext with the specified AbstractFileSystem * as the default FS. */ protected static FileContext getFileContext( final AbstractFileSystem defaultFS) { return getFileContext(defaultFS, new Configuration()); } {code} It looks like {{TestAdlFileContextMainOperationsLive#testGetFileContext1}} should be using {{AdlStorageConfiguration.getConfiguration()}} to create {{fc2}}. Or maybe {{testGetFileContext1}} should be omitted as the protected API {{FileContext#getFileContext(final AbstractFileSystem defaultFS)}} does not appear to be used anywhere else but this test case. The rest of {{TestAdlFileContextMainOperationsLive}} and all other {{hadoop-azure-datalake}} live tests passes successfully during my test. > TestAdlFileContextMainOperationsLive#testGetFileContext1 fails consistently > --- > > Key: HADOOP-13897 > URL: https://issues.apache.org/jira/browse/HADOOP-13897 > Project: Hadoop Common > Issue Type: Bug > Components: fs/azure >Affects Versions: 3.0.0-alpha2 >Reporter: Tony Wu > > {{TestAdlFileContextMainOperationsLive#testGetFileContext1}} (this is a live > test against Azure Data Lake Store) fails consistently with the following > error: > {noformat} > --- > T E S T S > --- > Running org.apache.hadoop.fs.adl.live.TestAdlFileContextMainOperationsLive > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 11.55 sec <<< > FAILURE! - in > org.apache.hadoop.fs.adl.live.TestAdlFileContextMainOperationsLive > testGetFileContext1(org.apache.hadoop.fs.adl.live.TestAdlFileContextMainOperationsLive) > Time elapsed: 11.229 sec <<< ERROR! > java.lang.RuntimeException: java.lang.reflect.InvocationTargetException > at > org.apache.hadoop.fs.AbstractFileSystem.newInstance(AbstractFileSystem.java:136) > at > org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:165) > at > org.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:250) > at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:331) > at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:328) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1857) > at > org.apache.hadoop.fs.FileContext.getAbstractFileSystem(FileContext.java:328) > at org.apache.hadoop.fs.FileContext.getFSofPath(FileContext.java:320) > at
[jira] [Commented] (HADOOP-13897) TestAdlFileContextMainOperationsLive#testGetFileContext1 fails consistently
[ https://issues.apache.org/jira/browse/HADOOP-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15743637#comment-15743637 ] Tony Wu commented on HADOOP-13897: -- It appears the cause is that the following test ({{FileContextMainOperationsBaseTest#testGetFileContext1}}) is using the default configuration rather than the ADLS test specific config {{hadoop-azure-datalake/src/test/resources/contract-test-options.xml}}. {code} @Test /* * Test method * org.apache.hadoop.fs.FileContext.getFileContext(AbstractFileSystem) */ public void testGetFileContext1() throws IOException { final Path rootPath = getTestRootPath(fc, "test"); AbstractFileSystem asf = fc.getDefaultFileSystem(); // create FileContext using the protected #getFileContext(1) method: FileContext fc2 = FileContext.getFileContext(asf); // this uses the default config // Now just check that this context can do something reasonable: final Path path = new Path(rootPath, "zoo"); FSDataOutputStream out = fc2.create(path, EnumSet.of(CREATE), Options.CreateOpts.createParent()); out.close(); Path pathResolved = fc2.resolvePath(path); assertEquals(pathResolved.toUri().getPath(), path.toUri().getPath()); } {code} The default config does not have {{dfs.adls.oauth2.access.token.provider.type}} defined. > TestAdlFileContextMainOperationsLive#testGetFileContext1 fails consistently > --- > > Key: HADOOP-13897 > URL: https://issues.apache.org/jira/browse/HADOOP-13897 > Project: Hadoop Common > Issue Type: Bug > Components: fs/azure >Affects Versions: 3.0.0-alpha2 >Reporter: Tony Wu > > {{TestAdlFileContextMainOperationsLive#testGetFileContext1}} (this is a live > test against Azure Data Lake Store) fails consistently with the following > error: > {noformat} > --- > T E S T S > --- > Running org.apache.hadoop.fs.adl.live.TestAdlFileContextMainOperationsLive > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 11.55 sec <<< > FAILURE! - in > org.apache.hadoop.fs.adl.live.TestAdlFileContextMainOperationsLive > testGetFileContext1(org.apache.hadoop.fs.adl.live.TestAdlFileContextMainOperationsLive) > Time elapsed: 11.229 sec <<< ERROR! > java.lang.RuntimeException: java.lang.reflect.InvocationTargetException > at > org.apache.hadoop.fs.AbstractFileSystem.newInstance(AbstractFileSystem.java:136) > at > org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:165) > at > org.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:250) > at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:331) > at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:328) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1857) > at > org.apache.hadoop.fs.FileContext.getAbstractFileSystem(FileContext.java:328) > at org.apache.hadoop.fs.FileContext.getFSofPath(FileContext.java:320) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:85) > at org.apache.hadoop.fs.FileContext.create(FileContext.java:685) > at > org.apache.hadoop.fs.FileContextMainOperationsBaseTest.testGetFileContext1(FileContextMainOperationsBaseTest.java:1350) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) > at
[jira] [Created] (HADOOP-13897) TestAdlFileContextMainOperationsLive#testGetFileContext1 fails consistently
Tony Wu created HADOOP-13897: Summary: TestAdlFileContextMainOperationsLive#testGetFileContext1 fails consistently Key: HADOOP-13897 URL: https://issues.apache.org/jira/browse/HADOOP-13897 Project: Hadoop Common Issue Type: Bug Components: fs/azure Affects Versions: 3.0.0-alpha2 Reporter: Tony Wu {{TestAdlFileContextMainOperationsLive#testGetFileContext1}} (this is a live test against Azure Data Lake Store) fails consistently with the following error: {noformat} --- T E S T S --- Running org.apache.hadoop.fs.adl.live.TestAdlFileContextMainOperationsLive Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 11.55 sec <<< FAILURE! - in org.apache.hadoop.fs.adl.live.TestAdlFileContextMainOperationsLive testGetFileContext1(org.apache.hadoop.fs.adl.live.TestAdlFileContextMainOperationsLive) Time elapsed: 11.229 sec <<< ERROR! java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.fs.AbstractFileSystem.newInstance(AbstractFileSystem.java:136) at org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:165) at org.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:250) at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:331) at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:328) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1857) at org.apache.hadoop.fs.FileContext.getAbstractFileSystem(FileContext.java:328) at org.apache.hadoop.fs.FileContext.getFSofPath(FileContext.java:320) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:85) at org.apache.hadoop.fs.FileContext.create(FileContext.java:685) at org.apache.hadoop.fs.FileContextMainOperationsBaseTest.testGetFileContext1(FileContextMainOperationsBaseTest.java:1350) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:254) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:149) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103) Caused by: java.lang.reflect.InvocationTargetException: null at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at
[jira] [Commented] (HADOOP-12875) [Azure Data Lake] Support for contract test and unit test cases
[ https://issues.apache.org/jira/browse/HADOOP-12875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214317#comment-15214317 ] Tony Wu commented on HADOOP-12875: -- Hi [~vishwajeet.dusane], Thanks a lot for filing a separate JIRA, isolating out the ADL unit tests and providing a FS contract test! This is very helpful. I did a quick scan of the patch and have the following comments regarding the newly added FS contract tests: Instead of having the following change in various contract test implementations: {code:java} + @Override + protected boolean isSupported(String feature) throws IOException { +return true; + } {code} It's better to define a {{adls.xml}} file where you specify the file system behavior. You can refer to {{wasb.xml}} as an example. This page [here|https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/filesystem/testing.html] also describes the details of FS contract tests and best practices of adding new ones. Instead of adding {{@Ignore}} to test cases unsupported by ADL, I believe you can use {{ContractTestUtils#unsupported("...")}} or {{ContractTestUtils#unsupported("skip")}} instead. > [Azure Data Lake] Support for contract test and unit test cases > --- > > Key: HADOOP-12875 > URL: https://issues.apache.org/jira/browse/HADOOP-12875 > Project: Hadoop Common > Issue Type: Test > Components: fs, fs/azure, tools >Reporter: Vishwajeet Dusane >Assignee: Vishwajeet Dusane > Attachments: Hadoop-12875-001.patch > > > This JIRA describes contract test and unit test cases support for azure data > lake file system. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-12666) Support Microsoft Azure Data Lake - as a file system in Hadoop
[ https://issues.apache.org/jira/browse/HADOOP-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210903#comment-15210903 ] Tony Wu commented on HADOOP-12666: -- Hi [~vishwajeet.dusane], Thank you for posting the semantics document, meeting minutes and a new patch. It was really helpful. *General comments:* # Regarding the semantics document. It will be great if you can also include more information on how ADL backend can "lock" a file for write. From the doc and our meeting discussion, there seems to be 2 ways (please confirm): *# File lease (which you included in the doc). Used by {{createNonRecursive()}} *# Maintain connection to the backend (you mentioned during meeting). Used by {{append()}} *** In this case do we assume ADL backend tracks which file is opened for write by keeping track of HTTP connections to the file? # In {{hadoop-tools/hadoop-azure-datalake/src/site/markdown/index.md}}. You mentioned: {quote}User and group information returned as ListStatus and GetFileStatus is in form of GUID associated in Azure Active Directory.{quote} There are applications which verifies the file ownership and it will fail because of this. Also I believe this means commands like {{hdfs dfs -ls }} will return GUID as user & group. It may not be that readable. Can you comment on how users should handle this? ** I see there is a {{adl.debug.override.localuserasfileowner}} config which will override the user with local client user and group with "hdfs". But this workaround is probably not for actual usage. *Specific comments:* 1. Can you explain why {{flushAsync()}} is used in {{BatchAppendOutputStream}}? It seems like {{flushAsync()}} is only used in a particular case where there is some data left in the buffer from previous writes and that combined with the current write will cross the buffer boundary. I'm not sure why this particular flush has to be async. Consider the following case: # {{BatchAppendOutputStream#flushAsync()}} returns. {{flushAsync()}} sends the sync job to thread and sets offset to 0: {code:java} private void flushAsync() throws IOException { if (offset > 0) { waitForOutstandingFlush(); // Submit the new flush task to the executor flushTask = EXECUTOR.submit(new CommitTask(data, offset, eof)); // Get a new internal buffer for the user to write data = getBuffer(); offset = 0; } } {code} # BatchAppendOutputStream#write() returns. # Client closes outputStream. # BatchAppendOutputStream#close() checks to see if there's anything to flush by checking offset. And in this case there is nothing to flush because offset is set to 0 earlier. {code:java} boolean flushedSomething = false; if (hadError) { // No point proceeding further since the error has occurred and // stream would be required to upload again. return; } else { flushedSomething = offset > 0; flush(); } {code} # {{BatchAppendOutputStream#close()}} does not wait for the async flush job to complete. After this point if {{flushAsync()}} hit any error, this error will be lost and client will not be aware of it. # If client then starts to write (append) to the same file with a new stream, this new write also will not wait for the previous async job to complete because {{flushTask}} is internal to {{BatchAppendOutputStream}}. It might be possible for the 2 writes to reach the backend in reverse order. IMHO if this async flush is necessary then {{EXECUTOR}} should be created inside {{BatchAppendOutputStream}} and shutdown when the stream is closed. Currently {{EXECUTOR}} lives in {{PrivateAzureDataLakeFileSystem}}. 2. {{EXECUTOR}} in {{PrivateAzureDataLakeFileSystem}} is not shutdown properly. 3. Stream is closed check is missing in {{BatchAppendOutputStream}}. This check is present for {{BatchByteArrayInputStream}}. > Support Microsoft Azure Data Lake - as a file system in Hadoop > -- > > Key: HADOOP-12666 > URL: https://issues.apache.org/jira/browse/HADOOP-12666 > Project: Hadoop Common > Issue Type: New Feature > Components: fs, fs/azure, tools >Reporter: Vishwajeet Dusane >Assignee: Vishwajeet Dusane > Attachments: Create_Read_Hadoop_Adl_Store_Semantics.pdf, > HADOOP-12666-002.patch, HADOOP-12666-003.patch, HADOOP-12666-004.patch, > HADOOP-12666-005.patch, HADOOP-12666-006.patch, HADOOP-12666-007.patch, > HADOOP-12666-008.patch, HADOOP-12666-009.patch, HADOOP-12666-1.patch > > Original Estimate: 336h > Time Spent: 336h > Remaining Estimate: 0h > > h2. Description > This JIRA describes a new file system implementation for accessing Microsoft > Azure Data Lake Store (ADL) from within Hadoop. This would enable existing > Hadoop applications
[jira] [Commented] (HADOOP-12876) [Azure Data Lake] Support for process level FileStatus cache to optimize GetFileStatus frequent opeations
[ https://issues.apache.org/jira/browse/HADOOP-12876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15204524#comment-15204524 ] Tony Wu commented on HADOOP-12876: -- Hi [~vishwajeet.dusane], Thanks a lot for creating a separate JIRA to discuss the file status cache. I noticed you have removed the relevant code (i.e. {{FileStatusCacheManager}}) from the latest patch in HADOOP-12666. Do you mind reposting the cache implementation here? I think you can post a patch for this JIRA based off the latest patch for HADOOP-12666. > [Azure Data Lake] Support for process level FileStatus cache to optimize > GetFileStatus frequent opeations > - > > Key: HADOOP-12876 > URL: https://issues.apache.org/jira/browse/HADOOP-12876 > Project: Hadoop Common > Issue Type: Improvement > Components: fs, fs/azure, tools >Reporter: Vishwajeet Dusane >Assignee: Vishwajeet Dusane > > Add support to cache GetFileStatus and ListStatus response locally for > limited period of time. Local cache for limited period of time would optimize > number of calls for GetFileStatus operation. > One of the example where local limited period cache would be useful - > terasort ListStatus on input directory follows with GetFileStatus operation > on each file within directory. For 2048 input files in a directory would save > 2048 GetFileStatus calls during start up (Using the ListStatus response to > cache FileStatus instances). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HADOOP-12482) Race condition in JMX cache update
[ https://issues.apache.org/jira/browse/HADOOP-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Wu updated HADOOP-12482: - Attachment: HADOOP-12482.006.patch In v6 patch: * Fix a missed {{printStackTrace()}} in previous patch and convert it to {{LOG.error()}}. > Race condition in JMX cache update > -- > > Key: HADOOP-12482 > URL: https://issues.apache.org/jira/browse/HADOOP-12482 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Tony Wu >Assignee: Tony Wu > Attachments: HADOOP-12482.001.patch, HADOOP-12482.002.patch, > HADOOP-12482.003.patch, HADOOP-12482.004.patch, HADOOP-12482.005.patch, > HADOOP-12482.006.patch > > > updateJmxCache() was updated in HADOOP-11301. However the patch introduced a > race condition. In updateJmxCache() function in MetricsSourceAdapter.java: > {code:java} > private void updateJmxCache() { > boolean getAllMetrics = false; > synchronized (this) { > if (Time.now() - jmxCacheTS >= jmxCacheTTL) { > // temporarilly advance the expiry while updating the cache > jmxCacheTS = Time.now() + jmxCacheTTL; > if (lastRecs == null) { > getAllMetrics = true; > } > } else { > return; > } > if (getAllMetrics) { > MetricsCollectorImpl builder = new MetricsCollectorImpl(); > getMetrics(builder, true); > } > updateAttrCache(); > if (getAllMetrics) { > updateInfoCache(); > } > jmxCacheTS = Time.now(); > lastRecs = null; // in case regular interval update is not running > } > } > {code} > Notice that getAllMetrics is set to true when: > # jmxCacheTTL has passed > # lastRecs == null > lastRecs is set to null in the same function, but gets reassigned by > getMetrics(). > However getMetrics() can be called from a different thread: > # MetricsSystemImpl.onTimerEvent() > # MetricsSystemImpl.publishMetricsNow() > Consider the following sequence: > # updateJmxCache() is called by getMBeanInfo() from a thread getting cached > info. > ** lastRecs is set to null. > # metrics sources is updated with new value/field. > # getMetrics() is called by publishMetricsNow() or onTimerEvent() from a > different thread getting the latest metrics. > ** lastRecs is updated (!= null). > # jmxCacheTTL passed. > # updateJmxCache() is called again via getMBeanInfo(). > ** However because lastRecs is already updated (!= null), getAllMetrics will > not be set to true. So updateInfoCache() is not called and getMBeanInfo() > returns the old cached info. > We ran into this issue on a cluster where a new metric did not get published > until much later. > The case can be made worse by a periodic call to getMetrics() (driven by an > external program or script). In such case getMBeanInfo() may never be able to > retrieve the new record. > The desired behavior should be that updateJmxCache() will guarantee to call > updateInfoCache() once after jmxCacheTTL, if lastRecs has been set to null by > updateJmxCache() itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-12482) Race condition in JMX cache update
[ https://issues.apache.org/jira/browse/HADOOP-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996783#comment-14996783 ] Tony Wu commented on HADOOP-12482: -- Hi [~ozawa] & [~eddyxu], Please kindly take a look at the latest patch (v4) and let me know if you have any comments. Thanks, Tony > Race condition in JMX cache update > -- > > Key: HADOOP-12482 > URL: https://issues.apache.org/jira/browse/HADOOP-12482 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Tony Wu >Assignee: Tony Wu > Attachments: HADOOP-12482.001.patch, HADOOP-12482.002.patch, > HADOOP-12482.003.patch, HADOOP-12482.004.patch > > > updateJmxCache() was updated in HADOOP-11301. However the patch introduced a > race condition. In updateJmxCache() function in MetricsSourceAdapter.java: > {code:java} > private void updateJmxCache() { > boolean getAllMetrics = false; > synchronized (this) { > if (Time.now() - jmxCacheTS >= jmxCacheTTL) { > // temporarilly advance the expiry while updating the cache > jmxCacheTS = Time.now() + jmxCacheTTL; > if (lastRecs == null) { > getAllMetrics = true; > } > } else { > return; > } > if (getAllMetrics) { > MetricsCollectorImpl builder = new MetricsCollectorImpl(); > getMetrics(builder, true); > } > updateAttrCache(); > if (getAllMetrics) { > updateInfoCache(); > } > jmxCacheTS = Time.now(); > lastRecs = null; // in case regular interval update is not running > } > } > {code} > Notice that getAllMetrics is set to true when: > # jmxCacheTTL has passed > # lastRecs == null > lastRecs is set to null in the same function, but gets reassigned by > getMetrics(). > However getMetrics() can be called from a different thread: > # MetricsSystemImpl.onTimerEvent() > # MetricsSystemImpl.publishMetricsNow() > Consider the following sequence: > # updateJmxCache() is called by getMBeanInfo() from a thread getting cached > info. > ** lastRecs is set to null. > # metrics sources is updated with new value/field. > # getMetrics() is called by publishMetricsNow() or onTimerEvent() from a > different thread getting the latest metrics. > ** lastRecs is updated (!= null). > # jmxCacheTTL passed. > # updateJmxCache() is called again via getMBeanInfo(). > ** However because lastRecs is already updated (!= null), getAllMetrics will > not be set to true. So updateInfoCache() is not called and getMBeanInfo() > returns the old cached info. > We ran into this issue on a cluster where a new metric did not get published > until much later. > The case can be made worse by a periodic call to getMetrics() (driven by an > external program or script). In such case getMBeanInfo() may never be able to > retrieve the new record. > The desired behavior should be that updateJmxCache() will guarantee to call > updateInfoCache() once after jmxCacheTTL, if lastRecs has been set to null by > updateJmxCache() itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-12482) Race condition in JMX cache update
[ https://issues.apache.org/jira/browse/HADOOP-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997959#comment-14997959 ] Tony Wu commented on HADOOP-12482: -- Thanks a lot to [~ozawa] for your thorough review. Please kindly take a look at the new patch. > Race condition in JMX cache update > -- > > Key: HADOOP-12482 > URL: https://issues.apache.org/jira/browse/HADOOP-12482 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Tony Wu >Assignee: Tony Wu > Attachments: HADOOP-12482.001.patch, HADOOP-12482.002.patch, > HADOOP-12482.003.patch, HADOOP-12482.004.patch, HADOOP-12482.005.patch, > HADOOP-12482.006.patch > > > updateJmxCache() was updated in HADOOP-11301. However the patch introduced a > race condition. In updateJmxCache() function in MetricsSourceAdapter.java: > {code:java} > private void updateJmxCache() { > boolean getAllMetrics = false; > synchronized (this) { > if (Time.now() - jmxCacheTS >= jmxCacheTTL) { > // temporarilly advance the expiry while updating the cache > jmxCacheTS = Time.now() + jmxCacheTTL; > if (lastRecs == null) { > getAllMetrics = true; > } > } else { > return; > } > if (getAllMetrics) { > MetricsCollectorImpl builder = new MetricsCollectorImpl(); > getMetrics(builder, true); > } > updateAttrCache(); > if (getAllMetrics) { > updateInfoCache(); > } > jmxCacheTS = Time.now(); > lastRecs = null; // in case regular interval update is not running > } > } > {code} > Notice that getAllMetrics is set to true when: > # jmxCacheTTL has passed > # lastRecs == null > lastRecs is set to null in the same function, but gets reassigned by > getMetrics(). > However getMetrics() can be called from a different thread: > # MetricsSystemImpl.onTimerEvent() > # MetricsSystemImpl.publishMetricsNow() > Consider the following sequence: > # updateJmxCache() is called by getMBeanInfo() from a thread getting cached > info. > ** lastRecs is set to null. > # metrics sources is updated with new value/field. > # getMetrics() is called by publishMetricsNow() or onTimerEvent() from a > different thread getting the latest metrics. > ** lastRecs is updated (!= null). > # jmxCacheTTL passed. > # updateJmxCache() is called again via getMBeanInfo(). > ** However because lastRecs is already updated (!= null), getAllMetrics will > not be set to true. So updateInfoCache() is not called and getMBeanInfo() > returns the old cached info. > We ran into this issue on a cluster where a new metric did not get published > until much later. > The case can be made worse by a periodic call to getMetrics() (driven by an > external program or script). In such case getMBeanInfo() may never be able to > retrieve the new record. > The desired behavior should be that updateJmxCache() will guarantee to call > updateInfoCache() once after jmxCacheTTL, if lastRecs has been set to null by > updateJmxCache() itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HADOOP-12482) Race condition in JMX cache update
[ https://issues.apache.org/jira/browse/HADOOP-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Wu updated HADOOP-12482: - Attachment: HADOOP-12482.005.patch In v5 patch: * Addressed [~eddyxu]'s review comments: rename {{TsSource}} and use {{LOG.error}} for printing exception. > Race condition in JMX cache update > -- > > Key: HADOOP-12482 > URL: https://issues.apache.org/jira/browse/HADOOP-12482 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Tony Wu >Assignee: Tony Wu > Attachments: HADOOP-12482.001.patch, HADOOP-12482.002.patch, > HADOOP-12482.003.patch, HADOOP-12482.004.patch, HADOOP-12482.005.patch > > > updateJmxCache() was updated in HADOOP-11301. However the patch introduced a > race condition. In updateJmxCache() function in MetricsSourceAdapter.java: > {code:java} > private void updateJmxCache() { > boolean getAllMetrics = false; > synchronized (this) { > if (Time.now() - jmxCacheTS >= jmxCacheTTL) { > // temporarilly advance the expiry while updating the cache > jmxCacheTS = Time.now() + jmxCacheTTL; > if (lastRecs == null) { > getAllMetrics = true; > } > } else { > return; > } > if (getAllMetrics) { > MetricsCollectorImpl builder = new MetricsCollectorImpl(); > getMetrics(builder, true); > } > updateAttrCache(); > if (getAllMetrics) { > updateInfoCache(); > } > jmxCacheTS = Time.now(); > lastRecs = null; // in case regular interval update is not running > } > } > {code} > Notice that getAllMetrics is set to true when: > # jmxCacheTTL has passed > # lastRecs == null > lastRecs is set to null in the same function, but gets reassigned by > getMetrics(). > However getMetrics() can be called from a different thread: > # MetricsSystemImpl.onTimerEvent() > # MetricsSystemImpl.publishMetricsNow() > Consider the following sequence: > # updateJmxCache() is called by getMBeanInfo() from a thread getting cached > info. > ** lastRecs is set to null. > # metrics sources is updated with new value/field. > # getMetrics() is called by publishMetricsNow() or onTimerEvent() from a > different thread getting the latest metrics. > ** lastRecs is updated (!= null). > # jmxCacheTTL passed. > # updateJmxCache() is called again via getMBeanInfo(). > ** However because lastRecs is already updated (!= null), getAllMetrics will > not be set to true. So updateInfoCache() is not called and getMBeanInfo() > returns the old cached info. > We ran into this issue on a cluster where a new metric did not get published > until much later. > The case can be made worse by a periodic call to getMetrics() (driven by an > external program or script). In such case getMBeanInfo() may never be able to > retrieve the new record. > The desired behavior should be that updateJmxCache() will guarantee to call > updateInfoCache() once after jmxCacheTTL, if lastRecs has been set to null by > updateJmxCache() itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-12482) Race condition in JMX cache update
[ https://issues.apache.org/jira/browse/HADOOP-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992260#comment-14992260 ] Tony Wu commented on HADOOP-12482: -- The failed test hadoop.ipc.TestDecayRpcScheduler is not relevant to the change. > Race condition in JMX cache update > -- > > Key: HADOOP-12482 > URL: https://issues.apache.org/jira/browse/HADOOP-12482 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Tony Wu >Assignee: Tony Wu > Attachments: HADOOP-12482.001.patch, HADOOP-12482.002.patch, > HADOOP-12482.003.patch, HADOOP-12482.004.patch > > > updateJmxCache() was updated in HADOOP-11301. However the patch introduced a > race condition. In updateJmxCache() function in MetricsSourceAdapter.java: > {code:java} > private void updateJmxCache() { > boolean getAllMetrics = false; > synchronized (this) { > if (Time.now() - jmxCacheTS >= jmxCacheTTL) { > // temporarilly advance the expiry while updating the cache > jmxCacheTS = Time.now() + jmxCacheTTL; > if (lastRecs == null) { > getAllMetrics = true; > } > } else { > return; > } > if (getAllMetrics) { > MetricsCollectorImpl builder = new MetricsCollectorImpl(); > getMetrics(builder, true); > } > updateAttrCache(); > if (getAllMetrics) { > updateInfoCache(); > } > jmxCacheTS = Time.now(); > lastRecs = null; // in case regular interval update is not running > } > } > {code} > Notice that getAllMetrics is set to true when: > # jmxCacheTTL has passed > # lastRecs == null > lastRecs is set to null in the same function, but gets reassigned by > getMetrics(). > However getMetrics() can be called from a different thread: > # MetricsSystemImpl.onTimerEvent() > # MetricsSystemImpl.publishMetricsNow() > Consider the following sequence: > # updateJmxCache() is called by getMBeanInfo() from a thread getting cached > info. > ** lastRecs is set to null. > # metrics sources is updated with new value/field. > # getMetrics() is called by publishMetricsNow() or onTimerEvent() from a > different thread getting the latest metrics. > ** lastRecs is updated (!= null). > # jmxCacheTTL passed. > # updateJmxCache() is called again via getMBeanInfo(). > ** However because lastRecs is already updated (!= null), getAllMetrics will > not be set to true. So updateInfoCache() is not called and getMBeanInfo() > returns the old cached info. > We ran into this issue on a cluster where a new metric did not get published > until much later. > The case can be made worse by a periodic call to getMetrics() (driven by an > external program or script). In such case getMBeanInfo() may never be able to > retrieve the new record. > The desired behavior should be that updateJmxCache() will guarantee to call > updateInfoCache() once after jmxCacheTTL, if lastRecs has been set to null by > updateJmxCache() itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-12482) Race condition in JMX cache update
[ https://issues.apache.org/jira/browse/HADOOP-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986170#comment-14986170 ] Tony Wu commented on HADOOP-12482: -- Thanks [~eddyxu] for reviewing my patch. Please take a look at the update patch and kindly let me know if you have any further comments. > Race condition in JMX cache update > -- > > Key: HADOOP-12482 > URL: https://issues.apache.org/jira/browse/HADOOP-12482 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Tony Wu >Assignee: Tony Wu > Attachments: HADOOP-12482.001.patch, HADOOP-12482.002.patch, > HADOOP-12482.003.patch > > > updateJmxCache() was updated in HADOOP-11301. However the patch introduced a > race condition. In updateJmxCache() function in MetricsSourceAdapter.java: > {code:java} > private void updateJmxCache() { > boolean getAllMetrics = false; > synchronized (this) { > if (Time.now() - jmxCacheTS >= jmxCacheTTL) { > // temporarilly advance the expiry while updating the cache > jmxCacheTS = Time.now() + jmxCacheTTL; > if (lastRecs == null) { > getAllMetrics = true; > } > } else { > return; > } > if (getAllMetrics) { > MetricsCollectorImpl builder = new MetricsCollectorImpl(); > getMetrics(builder, true); > } > updateAttrCache(); > if (getAllMetrics) { > updateInfoCache(); > } > jmxCacheTS = Time.now(); > lastRecs = null; // in case regular interval update is not running > } > } > {code} > Notice that getAllMetrics is set to true when: > # jmxCacheTTL has passed > # lastRecs == null > lastRecs is set to null in the same function, but gets reassigned by > getMetrics(). > However getMetrics() can be called from a different thread: > # MetricsSystemImpl.onTimerEvent() > # MetricsSystemImpl.publishMetricsNow() > Consider the following sequence: > # updateJmxCache() is called by getMBeanInfo() from a thread getting cached > info. > ** lastRecs is set to null. > # metrics sources is updated with new value/field. > # getMetrics() is called by publishMetricsNow() or onTimerEvent() from a > different thread getting the latest metrics. > ** lastRecs is updated (!= null). > # jmxCacheTTL passed. > # updateJmxCache() is called again via getMBeanInfo(). > ** However because lastRecs is already updated (!= null), getAllMetrics will > not be set to true. So updateInfoCache() is not called and getMBeanInfo() > returns the old cached info. > We ran into this issue on a cluster where a new metric did not get published > until much later. > The case can be made worse by a periodic call to getMetrics() (driven by an > external program or script). In such case getMBeanInfo() may never be able to > retrieve the new record. > The desired behavior should be that updateJmxCache() will guarantee to call > updateInfoCache() once after jmxCacheTTL, if lastRecs has been set to null by > updateJmxCache() itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HADOOP-12482) Race condition in JMX cache update
[ https://issues.apache.org/jira/browse/HADOOP-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Wu updated HADOOP-12482: - Attachment: HADOOP-12482.004.patch In v4 patch: * Addressed [~ozawa] and [~eddyxu]'s review comments. * Reworked the test to make use of {{ScheduledExecutorService #scheduleAtFixedRate}}. > Race condition in JMX cache update > -- > > Key: HADOOP-12482 > URL: https://issues.apache.org/jira/browse/HADOOP-12482 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Tony Wu >Assignee: Tony Wu > Attachments: HADOOP-12482.001.patch, HADOOP-12482.002.patch, > HADOOP-12482.003.patch, HADOOP-12482.004.patch > > > updateJmxCache() was updated in HADOOP-11301. However the patch introduced a > race condition. In updateJmxCache() function in MetricsSourceAdapter.java: > {code:java} > private void updateJmxCache() { > boolean getAllMetrics = false; > synchronized (this) { > if (Time.now() - jmxCacheTS >= jmxCacheTTL) { > // temporarilly advance the expiry while updating the cache > jmxCacheTS = Time.now() + jmxCacheTTL; > if (lastRecs == null) { > getAllMetrics = true; > } > } else { > return; > } > if (getAllMetrics) { > MetricsCollectorImpl builder = new MetricsCollectorImpl(); > getMetrics(builder, true); > } > updateAttrCache(); > if (getAllMetrics) { > updateInfoCache(); > } > jmxCacheTS = Time.now(); > lastRecs = null; // in case regular interval update is not running > } > } > {code} > Notice that getAllMetrics is set to true when: > # jmxCacheTTL has passed > # lastRecs == null > lastRecs is set to null in the same function, but gets reassigned by > getMetrics(). > However getMetrics() can be called from a different thread: > # MetricsSystemImpl.onTimerEvent() > # MetricsSystemImpl.publishMetricsNow() > Consider the following sequence: > # updateJmxCache() is called by getMBeanInfo() from a thread getting cached > info. > ** lastRecs is set to null. > # metrics sources is updated with new value/field. > # getMetrics() is called by publishMetricsNow() or onTimerEvent() from a > different thread getting the latest metrics. > ** lastRecs is updated (!= null). > # jmxCacheTTL passed. > # updateJmxCache() is called again via getMBeanInfo(). > ** However because lastRecs is already updated (!= null), getAllMetrics will > not be set to true. So updateInfoCache() is not called and getMBeanInfo() > returns the old cached info. > We ran into this issue on a cluster where a new metric did not get published > until much later. > The case can be made worse by a periodic call to getMetrics() (driven by an > external program or script). In such case getMBeanInfo() may never be able to > retrieve the new record. > The desired behavior should be that updateJmxCache() will guarantee to call > updateInfoCache() once after jmxCacheTTL, if lastRecs has been set to null by > updateJmxCache() itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HADOOP-12482) Race condition in JMX cache update
[ https://issues.apache.org/jira/browse/HADOOP-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Wu updated HADOOP-12482: - Attachment: HADOOP-12482.003.patch In v3 patch: * Address review comments by fixing a typo. > Race condition in JMX cache update > -- > > Key: HADOOP-12482 > URL: https://issues.apache.org/jira/browse/HADOOP-12482 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Tony Wu >Assignee: Tony Wu > Attachments: HADOOP-12482.001.patch, HADOOP-12482.002.patch, > HADOOP-12482.003.patch > > > updateJmxCache() was updated in HADOOP-11301. However the patch introduced a > race condition. In updateJmxCache() function in MetricsSourceAdapter.java: > {code:java} > private void updateJmxCache() { > boolean getAllMetrics = false; > synchronized (this) { > if (Time.now() - jmxCacheTS >= jmxCacheTTL) { > // temporarilly advance the expiry while updating the cache > jmxCacheTS = Time.now() + jmxCacheTTL; > if (lastRecs == null) { > getAllMetrics = true; > } > } else { > return; > } > if (getAllMetrics) { > MetricsCollectorImpl builder = new MetricsCollectorImpl(); > getMetrics(builder, true); > } > updateAttrCache(); > if (getAllMetrics) { > updateInfoCache(); > } > jmxCacheTS = Time.now(); > lastRecs = null; // in case regular interval update is not running > } > } > {code} > Notice that getAllMetrics is set to true when: > # jmxCacheTTL has passed > # lastRecs == null > lastRecs is set to null in the same function, but gets reassigned by > getMetrics(). > However getMetrics() can be called from a different thread: > # MetricsSystemImpl.onTimerEvent() > # MetricsSystemImpl.publishMetricsNow() > Consider the following sequence: > # updateJmxCache() is called by getMBeanInfo() from a thread getting cached > info. > ** lastRecs is set to null. > # metrics sources is updated with new value/field. > # getMetrics() is called by publishMetricsNow() or onTimerEvent() from a > different thread getting the latest metrics. > ** lastRecs is updated (!= null). > # jmxCacheTTL passed. > # updateJmxCache() is called again via getMBeanInfo(). > ** However because lastRecs is already updated (!= null), getAllMetrics will > not be set to true. So updateInfoCache() is not called and getMBeanInfo() > returns the old cached info. > We ran into this issue on a cluster where a new metric did not get published > until much later. > The case can be made worse by a periodic call to getMetrics() (driven by an > external program or script). In such case getMBeanInfo() may never be able to > retrieve the new record. > The desired behavior should be that updateJmxCache() will guarantee to call > updateInfoCache() once after jmxCacheTTL, if lastRecs has been set to null by > updateJmxCache() itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-12482) Race condition in JMX cache update
[ https://issues.apache.org/jira/browse/HADOOP-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982967#comment-14982967 ] Tony Wu commented on HADOOP-12482: -- Oops. I will correct it and post a new patch. Thanks a lot for looking at the patch! > Race condition in JMX cache update > -- > > Key: HADOOP-12482 > URL: https://issues.apache.org/jira/browse/HADOOP-12482 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Tony Wu >Assignee: Tony Wu > Attachments: HADOOP-12482.001.patch, HADOOP-12482.002.patch > > > updateJmxCache() was updated in HADOOP-11301. However the patch introduced a > race condition. In updateJmxCache() function in MetricsSourceAdapter.java: > {code:java} > private void updateJmxCache() { > boolean getAllMetrics = false; > synchronized (this) { > if (Time.now() - jmxCacheTS >= jmxCacheTTL) { > // temporarilly advance the expiry while updating the cache > jmxCacheTS = Time.now() + jmxCacheTTL; > if (lastRecs == null) { > getAllMetrics = true; > } > } else { > return; > } > if (getAllMetrics) { > MetricsCollectorImpl builder = new MetricsCollectorImpl(); > getMetrics(builder, true); > } > updateAttrCache(); > if (getAllMetrics) { > updateInfoCache(); > } > jmxCacheTS = Time.now(); > lastRecs = null; // in case regular interval update is not running > } > } > {code} > Notice that getAllMetrics is set to true when: > # jmxCacheTTL has passed > # lastRecs == null > lastRecs is set to null in the same function, but gets reassigned by > getMetrics(). > However getMetrics() can be called from a different thread: > # MetricsSystemImpl.onTimerEvent() > # MetricsSystemImpl.publishMetricsNow() > Consider the following sequence: > # updateJmxCache() is called by getMBeanInfo() from a thread getting cached > info. > ** lastRecs is set to null. > # metrics sources is updated with new value/field. > # getMetrics() is called by publishMetricsNow() or onTimerEvent() from a > different thread getting the latest metrics. > ** lastRecs is updated (!= null). > # jmxCacheTTL passed. > # updateJmxCache() is called again via getMBeanInfo(). > ** However because lastRecs is already updated (!= null), getAllMetrics will > not be set to true. So updateInfoCache() is not called and getMBeanInfo() > returns the old cached info. > We ran into this issue on a cluster where a new metric did not get published > until much later. > The case can be made worse by a periodic call to getMetrics() (driven by an > external program or script). In such case getMBeanInfo() may never be able to > retrieve the new record. > The desired behavior should be that updateJmxCache() will guarantee to call > updateInfoCache() once after jmxCacheTTL, if lastRecs has been set to null by > updateJmxCache() itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HADOOP-12482) Race condition in JMX cache update
[ https://issues.apache.org/jira/browse/HADOOP-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Wu updated HADOOP-12482: - Attachment: HADOOP-12482.002.patch In v2 patch: * Rebased to latest trunk. Manually verified the reported failed test cases again (on Linux and with native option) and they pass without error: {code} $ mvn -Dtest=TestMetricsSourceAdapter,TestDecayRpcScheduler,TestCopyPreserveFlag,TestReloadingX509TrustManager,TestGangliaMetrics test -Pnative ... --- T E S T S --- Running org.apache.hadoop.fs.shell.TestCopyPreserveFlag Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.77 sec - in org.apache.hadoop.fs.shell.TestCopyPreserveFlag Running org.apache.hadoop.metrics2.impl.TestGangliaMetrics Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.504 sec - in org.apache.hadoop.metrics2.impl.TestGangliaMetrics Running org.apache.hadoop.metrics2.impl.TestMetricsSourceAdapter Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 7.447 sec - in org.apache.hadoop.metrics2.impl.TestMetricsSourceAdapter Running org.apache.hadoop.ipc.TestDecayRpcScheduler Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.98 sec - in org.apache.hadoop.ipc.TestDecayRpcScheduler Running org.apache.hadoop.security.ssl.TestReloadingX509TrustManager Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.538 sec - in org.apache.hadoop.security.ssl.TestReloadingX509TrustManager Results : Tests run: 27, Failures: 0, Errors: 0, Skipped: 0 {code} > Race condition in JMX cache update > -- > > Key: HADOOP-12482 > URL: https://issues.apache.org/jira/browse/HADOOP-12482 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Tony Wu >Assignee: Tony Wu > Attachments: HADOOP-12482.001.patch, HADOOP-12482.002.patch > > > updateJmxCache() was updated in HADOOP-11301. However the patch introduced a > race condition. In updateJmxCache() function in MetricsSourceAdapter.java: > {code:java} > private void updateJmxCache() { > boolean getAllMetrics = false; > synchronized (this) { > if (Time.now() - jmxCacheTS >= jmxCacheTTL) { > // temporarilly advance the expiry while updating the cache > jmxCacheTS = Time.now() + jmxCacheTTL; > if (lastRecs == null) { > getAllMetrics = true; > } > } else { > return; > } > if (getAllMetrics) { > MetricsCollectorImpl builder = new MetricsCollectorImpl(); > getMetrics(builder, true); > } > updateAttrCache(); > if (getAllMetrics) { > updateInfoCache(); > } > jmxCacheTS = Time.now(); > lastRecs = null; // in case regular interval update is not running > } > } > {code} > Notice that getAllMetrics is set to true when: > # jmxCacheTTL has passed > # lastRecs == null > lastRecs is set to null in the same function, but gets reassigned by > getMetrics(). > However getMetrics() can be called from a different thread: > # MetricsSystemImpl.onTimerEvent() > # MetricsSystemImpl.publishMetricsNow() > Consider the following sequence: > # updateJmxCache() is called by getMBeanInfo() from a thread getting cached > info. > ** lastRecs is set to null. > # metrics sources is updated with new value/field. > # getMetrics() is called by publishMetricsNow() or onTimerEvent() from a > different thread getting the latest metrics. > ** lastRecs is updated (!= null). > # jmxCacheTTL passed. > # updateJmxCache() is called again via getMBeanInfo(). > ** However because lastRecs is already updated (!= null), getAllMetrics will > not be set to true. So updateInfoCache() is not called and getMBeanInfo() > returns the old cached info. > We ran into this issue on a cluster where a new metric did not get published > until much later. > The case can be made worse by a periodic call to getMetrics() (driven by an > external program or script). In such case getMBeanInfo() may never be able to > retrieve the new record. > The desired behavior should be that updateJmxCache() will guarantee to call > updateInfoCache() once after jmxCacheTTL, if lastRecs has been set to null by > updateJmxCache() itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-12482) Race condition in JMX cache update
[ https://issues.apache.org/jira/browse/HADOOP-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976619#comment-14976619 ] Tony Wu commented on HADOOP-12482: -- Manually ran the failed tests on Linux using JDK 1.7, all tests pass without error. > Race condition in JMX cache update > -- > > Key: HADOOP-12482 > URL: https://issues.apache.org/jira/browse/HADOOP-12482 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Tony Wu >Assignee: Tony Wu > Attachments: HADOOP-12482.001.patch > > > updateJmxCache() was updated in HADOOP-11301. However the patch introduced a > race condition. In updateJmxCache() function in MetricsSourceAdapter.java: > {code:java} > private void updateJmxCache() { > boolean getAllMetrics = false; > synchronized (this) { > if (Time.now() - jmxCacheTS >= jmxCacheTTL) { > // temporarilly advance the expiry while updating the cache > jmxCacheTS = Time.now() + jmxCacheTTL; > if (lastRecs == null) { > getAllMetrics = true; > } > } else { > return; > } > if (getAllMetrics) { > MetricsCollectorImpl builder = new MetricsCollectorImpl(); > getMetrics(builder, true); > } > updateAttrCache(); > if (getAllMetrics) { > updateInfoCache(); > } > jmxCacheTS = Time.now(); > lastRecs = null; // in case regular interval update is not running > } > } > {code} > Notice that getAllMetrics is set to true when: > # jmxCacheTTL has passed > # lastRecs == null > lastRecs is set to null in the same function, but gets reassigned by > getMetrics(). > However getMetrics() can be called from a different thread: > # MetricsSystemImpl.onTimerEvent() > # MetricsSystemImpl.publishMetricsNow() > Consider the following sequence: > # updateJmxCache() is called by getMBeanInfo() from a thread getting cached > info. > ** lastRecs is set to null. > # metrics sources is updated with new value/field. > # getMetrics() is called by publishMetricsNow() or onTimerEvent() from a > different thread getting the latest metrics. > ** lastRecs is updated (!= null). > # jmxCacheTTL passed. > # updateJmxCache() is called again via getMBeanInfo(). > ** However because lastRecs is already updated (!= null), getAllMetrics will > not be set to true. So updateInfoCache() is not called and getMBeanInfo() > returns the old cached info. > We ran into this issue on a cluster where a new metric did not get published > until much later. > The case can be made worse by a periodic call to getMetrics() (driven by an > external program or script). In such case getMBeanInfo() may never be able to > retrieve the new record. > The desired behavior should be that updateJmxCache() will guarantee to call > updateInfoCache() once after jmxCacheTTL, if lastRecs has been set to null by > updateJmxCache() itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-12482) Race condition in JMX cache update
[ https://issues.apache.org/jira/browse/HADOOP-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14974504#comment-14974504 ] Tony Wu commented on HADOOP-12482: -- Manually ran the failed unit tests on OS X using JDK 1.7 and they all pass without error. > Race condition in JMX cache update > -- > > Key: HADOOP-12482 > URL: https://issues.apache.org/jira/browse/HADOOP-12482 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Tony Wu >Assignee: Tony Wu > Attachments: HADOOP-12482.001.patch > > > updateJmxCache() was updated in HADOOP-11301. However the patch introduced a > race condition. In updateJmxCache() function in MetricsSourceAdapter.java: > {code:java} > private void updateJmxCache() { > boolean getAllMetrics = false; > synchronized (this) { > if (Time.now() - jmxCacheTS >= jmxCacheTTL) { > // temporarilly advance the expiry while updating the cache > jmxCacheTS = Time.now() + jmxCacheTTL; > if (lastRecs == null) { > getAllMetrics = true; > } > } else { > return; > } > if (getAllMetrics) { > MetricsCollectorImpl builder = new MetricsCollectorImpl(); > getMetrics(builder, true); > } > updateAttrCache(); > if (getAllMetrics) { > updateInfoCache(); > } > jmxCacheTS = Time.now(); > lastRecs = null; // in case regular interval update is not running > } > } > {code} > Notice that getAllMetrics is set to true when: > # jmxCacheTTL has passed > # lastRecs == null > lastRecs is set to null in the same function, but gets reassigned by > getMetrics(). > However getMetrics() can be called from a different thread: > # MetricsSystemImpl.onTimerEvent() > # MetricsSystemImpl.publishMetricsNow() > Consider the following sequence: > # updateJmxCache() is called by getMBeanInfo() from a thread getting cached > info. > ** lastRecs is set to null. > # metrics sources is updated with new value/field. > # getMetrics() is called by publishMetricsNow() or onTimerEvent() from a > different thread getting the latest metrics. > ** lastRecs is updated (!= null). > # jmxCacheTTL passed. > # updateJmxCache() is called again via getMBeanInfo(). > ** However because lastRecs is already updated (!= null), getAllMetrics will > not be set to true. So updateInfoCache() is not called and getMBeanInfo() > returns the old cached info. > We ran into this issue on a cluster where a new metric did not get published > until much later. > The case can be made worse by a periodic call to getMetrics() (driven by an > external program or script). In such case getMBeanInfo() may never be able to > retrieve the new record. > The desired behavior should be that updateJmxCache() will guarantee to call > updateInfoCache() once after jmxCacheTTL, if lastRecs has been set to null by > updateJmxCache() itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HADOOP-12482) Race condition in JMX cache update
[ https://issues.apache.org/jira/browse/HADOOP-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Wu updated HADOOP-12482: - Status: Patch Available (was: Open) > Race condition in JMX cache update > -- > > Key: HADOOP-12482 > URL: https://issues.apache.org/jira/browse/HADOOP-12482 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Tony Wu >Assignee: Tony Wu > Attachments: HADOOP-12482.001.patch > > > updateJmxCache() was updated in HADOOP-11301. However the patch introduced a > race condition. In updateJmxCache() function in MetricsSourceAdapter.java: > {code:java} > private void updateJmxCache() { > boolean getAllMetrics = false; > synchronized (this) { > if (Time.now() - jmxCacheTS >= jmxCacheTTL) { > // temporarilly advance the expiry while updating the cache > jmxCacheTS = Time.now() + jmxCacheTTL; > if (lastRecs == null) { > getAllMetrics = true; > } > } else { > return; > } > if (getAllMetrics) { > MetricsCollectorImpl builder = new MetricsCollectorImpl(); > getMetrics(builder, true); > } > updateAttrCache(); > if (getAllMetrics) { > updateInfoCache(); > } > jmxCacheTS = Time.now(); > lastRecs = null; // in case regular interval update is not running > } > } > {code} > Notice that getAllMetrics is set to true when: > # jmxCacheTTL has passed > # lastRecs == null > lastRecs is set to null in the same function, but gets reassigned by > getMetrics(). > However getMetrics() can be called from a different thread: > # MetricsSystemImpl.onTimerEvent() > # MetricsSystemImpl.publishMetricsNow() > Consider the following sequence: > # updateJmxCache() is called by getMBeanInfo() from a thread getting cached > info. > ** lastRecs is set to null. > # metrics sources is updated with new value/field. > # getMetrics() is called by publishMetricsNow() or onTimerEvent() from a > different thread getting the latest metrics. > ** lastRecs is updated (!= null). > # jmxCacheTTL passed. > # updateJmxCache() is called again via getMBeanInfo(). > ** However because lastRecs is already updated (!= null), getAllMetrics will > not be set to true. So updateInfoCache() is not called and getMBeanInfo() > returns the old cached info. > We ran into this issue on a cluster where a new metric did not get published > until much later. > The case can be made worse by a periodic call to getMetrics() (driven by an > external program or script). In such case getMBeanInfo() may never be able to > retrieve the new record. > The desired behavior should be that updateJmxCache() will guarantee to call > updateInfoCache() once after jmxCacheTTL, if lastRecs has been set to null by > updateJmxCache() itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HADOOP-12482) Race condition in JMX cache update
[ https://issues.apache.org/jira/browse/HADOOP-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Wu updated HADOOP-12482: - Attachment: HADOOP-12482.001.patch In this patch: * Addressed the problem by adding a new variable to track lastRecs has been cleared by updateJmxCache(). * Change updateJmxCache() to use the new variable to track when to refresh the info cache. * Add a test to simulate multiple thread accessing/updating the cache and make sure: ** The new test does capture the original problem. ** The hew test verifies the problem is fixed. Also run a few additional tests and make sure they pass: * org.apache.hadoop.metrics2.impl.TestMetricsSystemImpl * org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.TestContainerMetrics > Race condition in JMX cache update > -- > > Key: HADOOP-12482 > URL: https://issues.apache.org/jira/browse/HADOOP-12482 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Tony Wu >Assignee: Tony Wu > Attachments: HADOOP-12482.001.patch > > > updateJmxCache() was in HADOOP-11301. However the patch introduced a race > condition. In updateJmxCache() function in MetricsSourceAdapter.java: > {code:java} > private void updateJmxCache() { > boolean getAllMetrics = false; > synchronized (this) { > if (Time.now() - jmxCacheTS >= jmxCacheTTL) { > // temporarilly advance the expiry while updating the cache > jmxCacheTS = Time.now() + jmxCacheTTL; > if (lastRecs == null) { > getAllMetrics = true; > } > } else { > return; > } > if (getAllMetrics) { > MetricsCollectorImpl builder = new MetricsCollectorImpl(); > getMetrics(builder, true); > } > updateAttrCache(); > if (getAllMetrics) { > updateInfoCache(); > } > jmxCacheTS = Time.now(); > lastRecs = null; // in case regular interval update is not running > } > } > {code} > Notice that getAllMetrics is set to true when: > # jmxCacheTTL has passed > # lastRecs == null > lastRecs is set to null in the same function, but gets reassigned by > getMetrics(). > However getMetrics() can be called from a different thread: > # MetricsSystemImpl.onTimerEvent() > # MetricsSystemImpl.publishMetricsNow() > Consider the following sequence: > # updateJmxCache() is called by getMBeanInfo() from a thread getting cached > info. > ** lastRecs is set to null. > # metrics sources is updated with new value/field. > # getMetrics() is called by publishMetricsNow() or onTimerEvent() from a > different thread getting the latest metrics. > ** lastRecs is updated (!= null). > # jmxCacheTTL passed. > # updateJmxCache() is called again via getMBeanInfo(). > ** However because lastRecs is already updated (!= null), getAllMetrics will > not be set to true. So updateInfoCache() is not called and getMBeanInfo() > returns the old cached info. > We ran into this issue on a cluster where a new metric did not get published > until much later. > The case can be made worse by a periodic call to getMetrics() (driven by an > external program or script). In such case getMBeanInfo() may never be able to > retrieve the new record. > The desired behavior should be that updateJmxCache() will guarantee to call > updateInfoCache() once after jmxCacheTTL, if lastRecs has been set to null by > updateJmxCache() itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HADOOP-12482) Race condition in JMX cache update
Tony Wu created HADOOP-12482: Summary: Race condition in JMX cache update Key: HADOOP-12482 URL: https://issues.apache.org/jira/browse/HADOOP-12482 Project: Hadoop Common Issue Type: Bug Affects Versions: 2.7.1 Reporter: Tony Wu Assignee: Tony Wu updateJmxCache() was in HADOOP-11301. However the patch introduced a race condition. In updateJmxCache() function in MetricsSourceAdapter.java: {code:java} private void updateJmxCache() { boolean getAllMetrics = false; synchronized (this) { if (Time.now() - jmxCacheTS >= jmxCacheTTL) { // temporarilly advance the expiry while updating the cache jmxCacheTS = Time.now() + jmxCacheTTL; if (lastRecs == null) { getAllMetrics = true; } } else { return; } if (getAllMetrics) { MetricsCollectorImpl builder = new MetricsCollectorImpl(); getMetrics(builder, true); } updateAttrCache(); if (getAllMetrics) { updateInfoCache(); } jmxCacheTS = Time.now(); lastRecs = null; // in case regular interval update is not running } } {code} Notice that getAllMetrics is set to true when: # jmxCacheTTL has passed # lastRecs == null lastRecs is set to null in the same function, but gets reassigned by getMetrics(). However getMetrics() can be called from a different thread: # MetricsSystemImpl.onTimerEvent() # MetricsSystemImpl.publishMetricsNow() Consider the following sequence: # updateJmxCache() is called by getMBeanInfo() from a thread getting cached info. ** lastRecs is set to null. # metrics sources is updated with new value/field. # getMetrics() is called by publishMetricsNow() or onTimerEvent() from a different thread getting the latest metrics. ** lastRecs is updated (!= null). # jmxCacheTTL passed. # updateJmxCache() is called again via getMBeanInfo(). ** However because lastRecs is already updated (!= null), getAllMetrics will not be set to true. So updateInfoCache() is not called and getMBeanInfo() returns the old cached info. We ran into this issue on a cluster where a new metric did not get published until much later. The case can be made worse by a periodic call to getMetrics() (driven by an external program or script). In such case getMBeanInfo() may never be able to retrieve the new record. The desired behavior should be that updateJmxCache() will guarantee to call updateInfoCache() once after jmxCacheTTL, if lastRecs has been set to null by updateJmxCache() itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HADOOP-12482) Race condition in JMX cache update
[ https://issues.apache.org/jira/browse/HADOOP-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Wu updated HADOOP-12482: - Description: updateJmxCache() was updated in HADOOP-11301. However the patch introduced a race condition. In updateJmxCache() function in MetricsSourceAdapter.java: {code:java} private void updateJmxCache() { boolean getAllMetrics = false; synchronized (this) { if (Time.now() - jmxCacheTS >= jmxCacheTTL) { // temporarilly advance the expiry while updating the cache jmxCacheTS = Time.now() + jmxCacheTTL; if (lastRecs == null) { getAllMetrics = true; } } else { return; } if (getAllMetrics) { MetricsCollectorImpl builder = new MetricsCollectorImpl(); getMetrics(builder, true); } updateAttrCache(); if (getAllMetrics) { updateInfoCache(); } jmxCacheTS = Time.now(); lastRecs = null; // in case regular interval update is not running } } {code} Notice that getAllMetrics is set to true when: # jmxCacheTTL has passed # lastRecs == null lastRecs is set to null in the same function, but gets reassigned by getMetrics(). However getMetrics() can be called from a different thread: # MetricsSystemImpl.onTimerEvent() # MetricsSystemImpl.publishMetricsNow() Consider the following sequence: # updateJmxCache() is called by getMBeanInfo() from a thread getting cached info. ** lastRecs is set to null. # metrics sources is updated with new value/field. # getMetrics() is called by publishMetricsNow() or onTimerEvent() from a different thread getting the latest metrics. ** lastRecs is updated (!= null). # jmxCacheTTL passed. # updateJmxCache() is called again via getMBeanInfo(). ** However because lastRecs is already updated (!= null), getAllMetrics will not be set to true. So updateInfoCache() is not called and getMBeanInfo() returns the old cached info. We ran into this issue on a cluster where a new metric did not get published until much later. The case can be made worse by a periodic call to getMetrics() (driven by an external program or script). In such case getMBeanInfo() may never be able to retrieve the new record. The desired behavior should be that updateJmxCache() will guarantee to call updateInfoCache() once after jmxCacheTTL, if lastRecs has been set to null by updateJmxCache() itself. was: updateJmxCache() was in HADOOP-11301. However the patch introduced a race condition. In updateJmxCache() function in MetricsSourceAdapter.java: {code:java} private void updateJmxCache() { boolean getAllMetrics = false; synchronized (this) { if (Time.now() - jmxCacheTS >= jmxCacheTTL) { // temporarilly advance the expiry while updating the cache jmxCacheTS = Time.now() + jmxCacheTTL; if (lastRecs == null) { getAllMetrics = true; } } else { return; } if (getAllMetrics) { MetricsCollectorImpl builder = new MetricsCollectorImpl(); getMetrics(builder, true); } updateAttrCache(); if (getAllMetrics) { updateInfoCache(); } jmxCacheTS = Time.now(); lastRecs = null; // in case regular interval update is not running } } {code} Notice that getAllMetrics is set to true when: # jmxCacheTTL has passed # lastRecs == null lastRecs is set to null in the same function, but gets reassigned by getMetrics(). However getMetrics() can be called from a different thread: # MetricsSystemImpl.onTimerEvent() # MetricsSystemImpl.publishMetricsNow() Consider the following sequence: # updateJmxCache() is called by getMBeanInfo() from a thread getting cached info. ** lastRecs is set to null. # metrics sources is updated with new value/field. # getMetrics() is called by publishMetricsNow() or onTimerEvent() from a different thread getting the latest metrics. ** lastRecs is updated (!= null). # jmxCacheTTL passed. # updateJmxCache() is called again via getMBeanInfo(). ** However because lastRecs is already updated (!= null), getAllMetrics will not be set to true. So updateInfoCache() is not called and getMBeanInfo() returns the old cached info. We ran into this issue on a cluster where a new metric did not get published until much later. The case can be made worse by a periodic call to getMetrics() (driven by an external program or script). In such case getMBeanInfo() may never be able to retrieve the new record. The desired behavior should be that updateJmxCache() will guarantee to call updateInfoCache() once after jmxCacheTTL, if lastRecs has been set to null by updateJmxCache() itself. > Race condition in JMX cache update > -- > > Key: HADOOP-12482 > URL: https://issues.apache.org/jira/browse/HADOOP-12482 > Project: Hadoop Common > Issue