[jira] [Commented] (HDFS-9129) Move the safemode block count into BlockManager
[ https://issues.apache.org/jira/browse/HDFS-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960263#comment-14960263 ] Hadoop QA commented on HDFS-9129: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 18m 7s | Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 4 new or modified test files. | | {color:green}+1{color} | javac | 8m 3s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 36s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 26s | The applied patch generated 6 new checkstyle issues (total was 626, now 577). | | {color:green}+1{color} | whitespace | 0m 3s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 31s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 35s | The patch built with eclipse:eclipse. | | {color:red}-1{color} | findbugs | 2m 34s | The patch appears to introduce 3 new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | native | 3m 12s | Pre-build of native portion | | {color:red}-1{color} | hdfs tests | 50m 12s | Tests failed in hadoop-hdfs. | | | | 96m 47s | | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-hdfs | | Failed unit tests | hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints | | | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure000 | | | hadoop.hdfs.server.datanode.TestDataNodeMultipleRegistrations | | | hadoop.hdfs.TestReplication | | | hadoop.hdfs.util.TestByteArrayManager | | | hadoop.hdfs.TestGetBlocks | | | hadoop.hdfs.server.namenode.ha.TestDFSUpgradeWithHA | | | hadoop.hdfs.server.datanode.fsdataset.impl.TestScrLazyPersistFiles | | | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure010 | | Timed out tests | org.apache.hadoop.hdfs.TestFileCreation | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12766984/HDFS-9129.005.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / cf23f2c | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-HDFS-Build/13024/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/13024/artifact/patchprocess/diffcheckstylehadoop-hdfs.txt | | Findbugs warnings | https://builds.apache.org/job/PreCommit-HDFS-Build/13024/artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13024/artifact/patchprocess/testrun_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/13024/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/13024/console | This message was automatically generated. > Move the safemode block count into BlockManager > --- > > Key: HDFS-9129 > URL: https://issues.apache.org/jira/browse/HDFS-9129 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Haohui Mai >Assignee: Mingliang Liu > Attachments: HDFS-9129.000.patch, HDFS-9129.001.patch, > HDFS-9129.002.patch, HDFS-9129.003.patch, HDFS-9129.004.patch, > HDFS-9129.005.patch > > > The {{SafeMode}} needs to track whether there are enough blocks so that the > NN can get out of the safemode. These fields can moved to the > {{BlockManager}} class. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HDFS-7964) Add support for async edit logging
[ https://issues.apache.org/jira/browse/HDFS-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960114#comment-14960114 ] Yi Liu edited comment on HDFS-7964 at 10/16/15 6:40 AM: Thanks [~daryn] for the work. Further comments: *1.* In FSEditLogAsync#run {code} @Override public void run() { try { while (true) { if (doSync) { ... logSync(getLastWrittenTxId()); ... {code} I think it's better to pass the txid of current edit to {{logSync}}, not need to wait for all txid written. Then it's more efficient and client can get more faster response? *2.* {code} -log4j.rootLogger=OFF, CONSOLE +log4j.rootLogger=DEBUG, CONSOLE {code} Any reason to change it? *3.* {code} call.abortResponse(syncEx); {code} Seems this code is not available? was (Author: hitliuyi): Thanks [~daryn] for the work. Further comments: *1.* In FSEditLogAsync#run {code} @Override public void run() { try { while (true) { if (doSync) { ... logSync(getLastWrittenTxId()); ... {code} I think it's better to pass the txid of current edit to {{logSync}}, not need to wait for all txid written. Then it's more efficient and client can get more faster response? *2.* {code} + editsBatchedInSync = txid - synctxid - 1; {code} Isn't it "txid - synctxid"? The txid is the max txid written, and synctxid is the max txid already synced, suppose txid = 20, synctxid = 10, then the editsBatchedInSync should be (txid - synctxid) = (20 - 10) = 10. Also you can get it from the existing log message: {code} final String msg = "Could not sync enough journals to persistent storage " + "due to " + e.getMessage() + ". " + "Unsynced transactions: " + (txid - synctxid); {code} *3.* {code} -log4j.rootLogger=OFF, CONSOLE +log4j.rootLogger=DEBUG, CONSOLE {code} Any reason to change it? *4.* {code} call.abortResponse(syncEx); {code} Seems this code is not available? > Add support for async edit logging > -- > > Key: HDFS-7964 > URL: https://issues.apache.org/jira/browse/HDFS-7964 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 2.0.2-alpha >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: HDFS-7964.patch, HDFS-7964.patch > > > Edit logging is a major source of contention within the NN. LogEdit is > called within the namespace write log, while logSync is called outside of the > lock to allow greater concurrency. The handler thread remains busy until > logSync returns to provide the client with a durability guarantee for the > response. > Write heavy RPC load and/or slow IO causes handlers to stall in logSync. > Although the write lock is not held, readers are limited/starved and the call > queue fills. Combining an edit log thread with postponed RPC responses from > HADOOP-10300 will provide the same durability guarantee but immediately free > up the handlers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9129) Move the safemode block count into BlockManager
[ https://issues.apache.org/jira/browse/HDFS-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960230#comment-14960230 ] Jing Zhao commented on HDFS-9129: - Thanks for the work, Mingliang! Glad to hear that finally we're allowed to review your patch now :))) > Move the safemode block count into BlockManager > --- > > Key: HDFS-9129 > URL: https://issues.apache.org/jira/browse/HDFS-9129 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Haohui Mai >Assignee: Mingliang Liu > Attachments: HDFS-9129.000.patch, HDFS-9129.001.patch, > HDFS-9129.002.patch, HDFS-9129.003.patch, HDFS-9129.004.patch, > HDFS-9129.005.patch > > > The {{SafeMode}} needs to track whether there are enough blocks so that the > NN can get out of the safemode. These fields can moved to the > {{BlockManager}} class. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9173) Erasure Coding: Lease recovery for striped file
[ https://issues.apache.org/jira/browse/HDFS-9173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960222#comment-14960222 ] Walter Su commented on HDFS-9173: - Oh, I get your point. I can do that. I must be very careful to touch the code for contiguous. Firstly I think I should separate the code movement in another jira as [~rakeshr] suggested. And try your suggestion in here. > Erasure Coding: Lease recovery for striped file > --- > > Key: HDFS-9173 > URL: https://issues.apache.org/jira/browse/HDFS-9173 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Walter Su >Assignee: Walter Su > Attachments: HDFS-9173.00.wip.patch, HDFS-9173.01.patch, > HDFS-9173.02.step125.patch, HDFS-9173.03.patch, HDFS-9173.04.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9253) Refactor tests of libhdfs into a directory
[ https://issues.apache.org/jira/browse/HDFS-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960207#comment-14960207 ] Hadoop QA commented on HDFS-9253: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 42s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 7 new or modified test files. | | {color:green}+1{color} | javac | 7m 47s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 13s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 31s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | native | 3m 9s | Pre-build of native portion | | {color:green}+1{color} | hdfs tests | 0m 43s | Tests passed in hadoop-hdfs-native-client. | | | | 40m 8s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12766982/HDFS-9253.002.patch | | Optional Tests | javadoc javac unit | | git revision | trunk / cf23f2c | | hadoop-hdfs-native-client test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13023/artifact/patchprocess/testrun_hadoop-hdfs-native-client.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/13023/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/13023/console | This message was automatically generated. > Refactor tests of libhdfs into a directory > -- > > Key: HDFS-9253 > URL: https://issues.apache.org/jira/browse/HDFS-9253 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haohui Mai >Assignee: Haohui Mai > Attachments: HDFS-9253.000.patch, HDFS-9253.001.patch, > HDFS-9253.002.patch > > > This jira proposes to refactor the current tests in libhdfs into a separate > directory. The refactor opens up the opportunity to reuse tests in libhdfs, > libwebhdfs and libhdfspp in HDFS-8707 and to also allow cross validation of > different implementation of the libhdfs API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9254) HDFS Secure Mode Documentation updates
[ https://issues.apache.org/jira/browse/HDFS-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arpit Agarwal updated HDFS-9254: Status: Patch Available (was: In Progress) > HDFS Secure Mode Documentation updates > -- > > Key: HDFS-9254 > URL: https://issues.apache.org/jira/browse/HDFS-9254 > Project: Hadoop HDFS > Issue Type: Bug > Components: documentation >Affects Versions: 2.7.1 >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal > Attachments: HDFS-9254.01.patch > > > Some Kerberos configuration parameters are not documented well enough. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9254) HDFS Secure Mode Documentation updates
[ https://issues.apache.org/jira/browse/HDFS-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arpit Agarwal updated HDFS-9254: Attachment: HDFS-9254.01.patch v1 patch. # Documents missing settings in hdfs-default.xml. # Add more detail to SecureMode.html, rewrite some sections. Document JournalNode settings. # Update HdfsMultihoming.html to document new security settings added by HADOOP-12437. # Minor cleanup of HttpAuthentication.html to convert textual description to a table. I think this doc needs more detail but I don't understand this part of the configuration well enough to add content. > HDFS Secure Mode Documentation updates > -- > > Key: HDFS-9254 > URL: https://issues.apache.org/jira/browse/HDFS-9254 > Project: Hadoop HDFS > Issue Type: Bug > Components: documentation >Affects Versions: 2.7.1 >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal > Attachments: HDFS-9254.01.patch > > > Some Kerberos configuration parameters are not documented well enough. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9129) Move the safemode block count into BlockManager
[ https://issues.apache.org/jira/browse/HDFS-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mingliang Liu updated HDFS-9129: Attachment: HDFS-9129.005.patch The failing test is not related. The v5 patch is not tracking safe blocks after leaving safe mode. Any comment is welcome. Will work on reducing synchronization overhead in {{BlockManagerSafeMode}}. > Move the safemode block count into BlockManager > --- > > Key: HDFS-9129 > URL: https://issues.apache.org/jira/browse/HDFS-9129 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Haohui Mai >Assignee: Mingliang Liu > Attachments: HDFS-9129.000.patch, HDFS-9129.001.patch, > HDFS-9129.002.patch, HDFS-9129.003.patch, HDFS-9129.004.patch, > HDFS-9129.005.patch > > > The {{SafeMode}} needs to track whether there are enough blocks so that the > NN can get out of the safemode. These fields can moved to the > {{BlockManager}} class. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9253) Refactor tests of libhdfs into a directory
[ https://issues.apache.org/jira/browse/HDFS-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haohui Mai updated HDFS-9253: - Attachment: HDFS-9253.002.patch > Refactor tests of libhdfs into a directory > -- > > Key: HDFS-9253 > URL: https://issues.apache.org/jira/browse/HDFS-9253 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haohui Mai >Assignee: Haohui Mai > Attachments: HDFS-9253.000.patch, HDFS-9253.001.patch, > HDFS-9253.002.patch > > > This jira proposes to refactor the current tests in libhdfs into a separate > directory. The refactor opens up the opportunity to reuse tests in libhdfs, > libwebhdfs and libhdfspp in HDFS-8707 and to also allow cross validation of > different implementation of the libhdfs API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9173) Erasure Coding: Lease recovery for striped file
[ https://issues.apache.org/jira/browse/HDFS-9173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960162#comment-14960162 ] Zhe Zhang commented on HDFS-9173: - bq. #2. about syncBlockFinalized, syncBlockUnfinalized. bq. Unfortunately, The logic isn't quite the same between contiguous and striped. For example, I meant that {{syncBlockFinalized}} is only for contiguous blocks (in the code snippet it only appears in {{RecoveryTaskContiguous}}). The logic of syncing unfinalized replicas in {{RecoveryTaskContiguous}} (RWR, RBW) is very similar to syncing striped internal blocks. So ideally {{RecoveryTaskContiguous}} and {{RecoveryTaskStriped}} should share a {{syncBlockUnfinalized}} method (it appears in both {{RecoveryTaskContiguous}} and {{RecoveryTaskStriped}} in the code snippet). > Erasure Coding: Lease recovery for striped file > --- > > Key: HDFS-9173 > URL: https://issues.apache.org/jira/browse/HDFS-9173 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Walter Su >Assignee: Walter Su > Attachments: HDFS-9173.00.wip.patch, HDFS-9173.01.patch, > HDFS-9173.02.step125.patch, HDFS-9173.03.patch, HDFS-9173.04.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9252) Change TestFileTruncate to FsDatasetTestUtils to get block file size and genstamp.
[ https://issues.apache.org/jira/browse/HDFS-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960161#comment-14960161 ] Hadoop QA commented on HDFS-9252: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 24m 21s | Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 3 new or modified test files. | | {color:green}+1{color} | javac | 10m 36s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 13m 56s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 31s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 58s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 57s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 44s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 3m 16s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | native | 4m 10s | Pre-build of native portion | | {color:red}-1{color} | hdfs tests | 62m 20s | Tests failed in hadoop-hdfs. | | | | 123m 52s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.hdfs.util.TestByteArrayManager | | | hadoop.hdfs.TestDFSUpgradeFromImage | | | hadoop.hdfs.server.balancer.TestBalancerWithMultipleNameNodes | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12766899/HDFS-9252.00.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / cf23f2c | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-HDFS-Build/13021/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13021/artifact/patchprocess/testrun_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/13021/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/13021/console | This message was automatically generated. > Change TestFileTruncate to FsDatasetTestUtils to get block file size and > genstamp. > -- > > Key: HDFS-9252 > URL: https://issues.apache.org/jira/browse/HDFS-9252 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.7.1 >Reporter: Lei (Eddy) Xu >Assignee: Lei (Eddy) Xu > Attachments: HDFS-9252.00.patch > > > {{TestFileTruncate}} verifies block size and genstamp by directly accessing > the local filesystem, e.g.: > {code} > assertTrue(cluster.getBlockMetadataFile(dn0, >newBlock.getBlock()).getName().endsWith( >newBlock.getBlock().getGenerationStamp() + ".meta")); > {code} > Lets abstract the fsdataset-special logic behind FsDatasetTestUtils. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9184) Logging HDFS operation's caller context into audit logs
[ https://issues.apache.org/jira/browse/HDFS-9184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mingliang Liu updated HDFS-9184: Status: Open (was: Patch Available) > Logging HDFS operation's caller context into audit logs > --- > > Key: HDFS-9184 > URL: https://issues.apache.org/jira/browse/HDFS-9184 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Mingliang Liu >Assignee: Mingliang Liu > Attachments: HDFS-9184.000.patch, HDFS-9184.001.patch, > HDFS-9184.002.patch, HDFS-9184.003.patch, HDFS-9184.004.patch, > HDFS-9184.005.patch, HDFS-9184.006.patch, HDFS-9184.007.patch > > > For a given HDFS operation (e.g. delete file), it's very helpful to track > which upper level job issues it. The upper level callers may be specific > Oozie tasks, MR jobs, and hive queries. One scenario is that the namenode > (NN) is abused/spammed, the operator may want to know immediately which MR > job should be blamed so that she can kill it. To this end, the caller context > contains at least the application-dependent "tracking id". > There are several existing techniques that may be related to this problem. > 1. Currently the HDFS audit log tracks the users of the the operation which > is obviously not enough. It's common that the same user issues multiple jobs > at the same time. Even for a single top level task, tracking back to a > specific caller in a chain of operations of the whole workflow (e.g.Oozie -> > Hive -> Yarn) is hard, if not impossible. > 2. HDFS integrated {{htrace}} support for providing tracing information > across multiple layers. The span is created in many places interconnected > like a tree structure which relies on offline analysis across RPC boundary. > For this use case, {{htrace}} has to be enabled at 100% sampling rate which > introduces significant overhead. Moreover, passing additional information > (via annotations) other than span id from root of the tree to leaf is a > significant additional work. > 3. In [HDFS-4680 | https://issues.apache.org/jira/browse/HDFS-4680], there > are some related discussion on this topic. The final patch implemented the > tracking id as a part of delegation token. This protects the tracking > information from being changed or impersonated. However, kerberos > authenticated connections or insecure connections don't have tokens. > [HADOOP-8779] proposes to use tokens in all the scenarios, but that might > mean changes to several upstream projects and is a major change in their > security implementation. > We propose another approach to address this problem. We also treat HDFS audit > log as a good place for after-the-fact root cause analysis. We propose to put > the caller id (e.g. Hive query id) in threadlocals. Specially, on client side > the threadlocal object is passed to NN as a part of RPC header (optional), > while on sever side NN retrieves it from header and put it to {{Handler}}'s > threadlocals. Finally in {{FSNamesystem}}, HDFS audit logger will record the > caller context for each operation. In this way, the existing code is not > affected. > It is still challenging to keep "lying" client from abusing the caller > context. Our proposal is to add a {{signature}} field to the caller context. > The client choose to provide its signature along with the caller id. The > operator may need to validate the signature at the time of offline analysis. > The NN is not responsible for validating the signature online. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9184) Logging HDFS operation's caller context into audit logs
[ https://issues.apache.org/jira/browse/HDFS-9184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mingliang Liu updated HDFS-9184: Status: Patch Available (was: Open) > Logging HDFS operation's caller context into audit logs > --- > > Key: HDFS-9184 > URL: https://issues.apache.org/jira/browse/HDFS-9184 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Mingliang Liu >Assignee: Mingliang Liu > Attachments: HDFS-9184.000.patch, HDFS-9184.001.patch, > HDFS-9184.002.patch, HDFS-9184.003.patch, HDFS-9184.004.patch, > HDFS-9184.005.patch, HDFS-9184.006.patch, HDFS-9184.007.patch > > > For a given HDFS operation (e.g. delete file), it's very helpful to track > which upper level job issues it. The upper level callers may be specific > Oozie tasks, MR jobs, and hive queries. One scenario is that the namenode > (NN) is abused/spammed, the operator may want to know immediately which MR > job should be blamed so that she can kill it. To this end, the caller context > contains at least the application-dependent "tracking id". > There are several existing techniques that may be related to this problem. > 1. Currently the HDFS audit log tracks the users of the the operation which > is obviously not enough. It's common that the same user issues multiple jobs > at the same time. Even for a single top level task, tracking back to a > specific caller in a chain of operations of the whole workflow (e.g.Oozie -> > Hive -> Yarn) is hard, if not impossible. > 2. HDFS integrated {{htrace}} support for providing tracing information > across multiple layers. The span is created in many places interconnected > like a tree structure which relies on offline analysis across RPC boundary. > For this use case, {{htrace}} has to be enabled at 100% sampling rate which > introduces significant overhead. Moreover, passing additional information > (via annotations) other than span id from root of the tree to leaf is a > significant additional work. > 3. In [HDFS-4680 | https://issues.apache.org/jira/browse/HDFS-4680], there > are some related discussion on this topic. The final patch implemented the > tracking id as a part of delegation token. This protects the tracking > information from being changed or impersonated. However, kerberos > authenticated connections or insecure connections don't have tokens. > [HADOOP-8779] proposes to use tokens in all the scenarios, but that might > mean changes to several upstream projects and is a major change in their > security implementation. > We propose another approach to address this problem. We also treat HDFS audit > log as a good place for after-the-fact root cause analysis. We propose to put > the caller id (e.g. Hive query id) in threadlocals. Specially, on client side > the threadlocal object is passed to NN as a part of RPC header (optional), > while on sever side NN retrieves it from header and put it to {{Handler}}'s > threadlocals. Finally in {{FSNamesystem}}, HDFS audit logger will record the > caller context for each operation. In this way, the existing code is not > affected. > It is still challenging to keep "lying" client from abusing the caller > context. Our proposal is to add a {{signature}} field to the caller context. > The client choose to provide its signature along with the caller id. The > operator may need to validate the signature at the time of offline analysis. > The NN is not responsible for validating the signature online. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7964) Add support for async edit logging
[ https://issues.apache.org/jira/browse/HDFS-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960144#comment-14960144 ] Hadoop QA commented on HDFS-7964: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 21m 28s | Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 11 new or modified test files. | | {color:red}-1{color} | javac | 1m 55s | The patch appears to cause the build to fail. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12766854/HDFS-7964.patch | | Optional Tests | javac unit findbugs checkstyle javadoc | | git revision | trunk / cf23f2c | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-HDFS-Build/13022/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/13022/console | This message was automatically generated. > Add support for async edit logging > -- > > Key: HDFS-7964 > URL: https://issues.apache.org/jira/browse/HDFS-7964 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 2.0.2-alpha >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: HDFS-7964.patch, HDFS-7964.patch > > > Edit logging is a major source of contention within the NN. LogEdit is > called within the namespace write log, while logSync is called outside of the > lock to allow greater concurrency. The handler thread remains busy until > logSync returns to provide the client with a durability guarantee for the > response. > Write heavy RPC load and/or slow IO causes handlers to stall in logSync. > Although the write lock is not held, readers are limited/starved and the call > queue fills. Combining an edit log thread with postponed RPC responses from > HADOOP-10300 will provide the same durability guarantee but immediately free > up the handlers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8425) [umbrella] Performance tuning, investigation and optimization for erasure coding
[ https://issues.apache.org/jira/browse/HDFS-8425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Fukudome updated HDFS-8425: -- Attachment: testdfsio-read-mbsec.png testdfsio-write-mbsec.png Hi. I have ran TestDFSIO on both normal directory and the directory with EC policy set. I attached two charts which respectively show write and read throughput(mb/sec) of both replicaions files and EC files. And the throughputs are calculated by dividing the total bytes of TestDFSIO's data by the total elapsed time. In summary, writing EC files is better than writing replication files at throughput. And reading EC files is the same performance as reading replication files. Though DataNodes' average CPU usage of writing EC files raised 5.5% comparing to writing replication files(from 9.8% to 15.3%). The specification of our test cluster is bellow || Number of DataNodes | 20 | server info: || CPU | Xeon E5-2630L 2.00GHz/2CPU | || RAM | 64GB | || Disk | SATA 300 | Our test cluster was build with trunk codes. Its commit revision id is r30e2f836a26490a24c7ddea754dd19f95b24bbc8. Those are initial performance test result, we are still working on further test. Please let me know if the initial test result make sense to you. Any advise is welcome! Thank you. > [umbrella] Performance tuning, investigation and optimization for erasure > coding > > > Key: HDFS-8425 > URL: https://issues.apache.org/jira/browse/HDFS-8425 > Project: Hadoop HDFS > Issue Type: Sub-task >Affects Versions: HDFS-7285 >Reporter: GAO Rui > Attachments: testClientWriteReadFile_v1.pdf, > testdfsio-read-mbsec.png, testdfsio-write-mbsec.png > > > This {{umbrella}} jira aims to track performance tuning, investigation and > optimization for erasure coding. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9173) Erasure Coding: Lease recovery for striped file
[ https://issues.apache.org/jira/browse/HDFS-9173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960115#comment-14960115 ] Walter Su commented on HDFS-9173: - bq. #2. about {{syncBlockFinalized}}, {{syncBlockUnfinalized}}. Unfortunately, The logic isn't quite the same between contiguous and striped. For example, {noformat} blk_0 blk_1 blk_2 blk_3 blk_4 blk_5 blk_6 blk_7 blk_8 64k64k64k64k64k64k64k64k64k 64k____________64k64k64k64k {noformat} blk_0, blk5~8 are finalized, blk1~4 are RBW. In this case, the last cell of blk_5 is garbage, should be truncated. Because the data isn't sequential in last stripe and can't decode. So I have to keep blk_0, truncate blk_5, and re-encode blk_6~8 (I'm talking about last cells). (It's done in step 4) So I don't care too much whether it's finalized or not. For contiguous blocks, If it finds a finalized, it keeps the finalized. > Erasure Coding: Lease recovery for striped file > --- > > Key: HDFS-9173 > URL: https://issues.apache.org/jira/browse/HDFS-9173 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Walter Su >Assignee: Walter Su > Attachments: HDFS-9173.00.wip.patch, HDFS-9173.01.patch, > HDFS-9173.02.step125.patch, HDFS-9173.03.patch, HDFS-9173.04.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7964) Add support for async edit logging
[ https://issues.apache.org/jira/browse/HDFS-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liu updated HDFS-7964: - Status: Patch Available (was: Open) > Add support for async edit logging > -- > > Key: HDFS-7964 > URL: https://issues.apache.org/jira/browse/HDFS-7964 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 2.0.2-alpha >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: HDFS-7964.patch, HDFS-7964.patch > > > Edit logging is a major source of contention within the NN. LogEdit is > called within the namespace write log, while logSync is called outside of the > lock to allow greater concurrency. The handler thread remains busy until > logSync returns to provide the client with a durability guarantee for the > response. > Write heavy RPC load and/or slow IO causes handlers to stall in logSync. > Although the write lock is not held, readers are limited/starved and the call > queue fills. Combining an edit log thread with postponed RPC responses from > HADOOP-10300 will provide the same durability guarantee but immediately free > up the handlers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7964) Add support for async edit logging
[ https://issues.apache.org/jira/browse/HDFS-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960114#comment-14960114 ] Yi Liu commented on HDFS-7964: -- Thanks [~daryn] for the work. Further comments: *1.* In FSEditLogAsync#run {code} @Override public void run() { try { while (true) { if (doSync) { ... logSync(getLastWrittenTxId()); ... {code} I think it's better to pass the txid of current edit to {{logSync}}, not need to wait for all txid written. Then it's more efficient and client can get more faster response? *2.* {code} + editsBatchedInSync = txid - synctxid - 1; {code} Isn't it "txid - synctxid"? The txid is the max txid written, and synctxid is the max txid already synced, suppose txid = 20, synctxid = 10, then the editsBatchedInSync should be (txid - synctxid) = (20 - 10) = 10. Also you can get it from the existing log message: {code} final String msg = "Could not sync enough journals to persistent storage " + "due to " + e.getMessage() + ". " + "Unsynced transactions: " + (txid - synctxid); {code} *3.* {code} -log4j.rootLogger=OFF, CONSOLE +log4j.rootLogger=DEBUG, CONSOLE {code} Any reason to change it? *4.* {code} call.abortResponse(syncEx); {code} Seems this code is not available? > Add support for async edit logging > -- > > Key: HDFS-7964 > URL: https://issues.apache.org/jira/browse/HDFS-7964 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 2.0.2-alpha >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: HDFS-7964.patch, HDFS-7964.patch > > > Edit logging is a major source of contention within the NN. LogEdit is > called within the namespace write log, while logSync is called outside of the > lock to allow greater concurrency. The handler thread remains busy until > logSync returns to provide the client with a durability guarantee for the > response. > Write heavy RPC load and/or slow IO causes handlers to stall in logSync. > Although the write lock is not held, readers are limited/starved and the call > queue fills. Combining an edit log thread with postponed RPC responses from > HADOOP-10300 will provide the same durability guarantee but immediately free > up the handlers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9253) Refactor tests of libhdfs into a directory
[ https://issues.apache.org/jira/browse/HDFS-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960100#comment-14960100 ] Hadoop QA commented on HDFS-9253: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 3s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 7 new or modified test files. | | {color:green}+1{color} | javac | 8m 12s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 26s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 30s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | native | 3m 9s | Pre-build of native portion | | {color:red}-1{color} | hdfs tests | 0m 35s | Tests failed in hadoop-hdfs-native-client. | | | | 40m 57s | | \\ \\ || Reason || Tests || | Failed build | hadoop-hdfs-native-client | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12766943/HDFS-9253.001.patch | | Optional Tests | javadoc javac unit | | git revision | trunk / cf23f2c | | hadoop-hdfs-native-client test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13020/artifact/patchprocess/testrun_hadoop-hdfs-native-client.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/13020/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/13020/console | This message was automatically generated. > Refactor tests of libhdfs into a directory > -- > > Key: HDFS-9253 > URL: https://issues.apache.org/jira/browse/HDFS-9253 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haohui Mai >Assignee: Haohui Mai > Attachments: HDFS-9253.000.patch, HDFS-9253.001.patch > > > This jira proposes to refactor the current tests in libhdfs into a separate > directory. The refactor opens up the opportunity to reuse tests in libhdfs, > libwebhdfs and libhdfspp in HDFS-8707 and to also allow cross validation of > different implementation of the libhdfs API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9173) Erasure Coding: Lease recovery for striped file
[ https://issues.apache.org/jira/browse/HDFS-9173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960092#comment-14960092 ] Walter Su commented on HDFS-9173: - bq. #3, In StripedRecoveryTask#recover we are calling callInitReplicaRecovery twice. Is the second call necessary? It's about blocks state changes, For RecoveryTaskContiguous, it's 3 RBW --> 3 RUR --> 3 Finalized For RecoveryTaskStriped, it's 9 RBW --> 6 RUR + 3 RBW --> 6 Finialzed + 3 RBW. The total RPC calls is 9+6+6=21. 2nd option is 9 RBW --> 9 RUR --> 9 Finalized. The total RPC calls is 9+9=18. 3rd option 9 RBW --> 9 RUR --> 6 Finalized + 3 RUR. We left the 3 RUR unfinalized. The total RPC calls is 9+6=15. The RUR will be removed as long as the block completed (The 6 finalized must be reported). I still choose the 1st option. 1. The most important reason is for step 3. A RUR can't be appended with more data. 2. We need remove the dead/stale one. We must make sure we have 6 healthy RBW can be converted to RUR. If not, we shouldn't convert them to RUR too early. 3. The 2nd option, It is it's unnecessary to update the 3 smallest RBW' length. 4. The 3d option, If the 6 Finalized isn't all reported. We can't start a new recovery. 5. I haven't thought how to address the issue of 2 recoveryTasks running at the same time. 1st option is a good start. I must say reason #1 beats the rest. You can check 01 patch which include step 3. > Erasure Coding: Lease recovery for striped file > --- > > Key: HDFS-9173 > URL: https://issues.apache.org/jira/browse/HDFS-9173 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Walter Su >Assignee: Walter Su > Attachments: HDFS-9173.00.wip.patch, HDFS-9173.01.patch, > HDFS-9173.02.step125.patch, HDFS-9173.03.patch, HDFS-9173.04.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8972) EINVAL Invalid argument when RAM_DISK usage 90%+
[ https://issues.apache.org/jira/browse/HDFS-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xu Chen updated HDFS-8972: -- Description: the directory which uses LAZY_PERSIST policy , so use "df" command look up tmpfs is usage >=90% , run spark,hive or mapreduce application , Datanode come out following exception {code} 2015-08-26 17:37:34,123 WARN org.apache.hadoop.io.ReadaheadPool: Failed readahead on null EINVAL: Invalid argument at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posix_fadvise(Native Method) at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posixFadviseIfPossible(NativeIO.java:267) at org.apache.hadoop.io.nativeio.NativeIO$POSIX$CacheManipulator.posixFadviseIfPossible(NativeIO.java:146) at org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} And the application is slowly than without exception 25% Regards was: the directory which is use LAZY_PERSIST policy , so use "df" command look up tmpfs is usage >=90% , run spark,hive or mapreduce application , Datanode come out following exception {code} 2015-08-26 17:37:34,123 WARN org.apache.hadoop.io.ReadaheadPool: Failed readahead on null EINVAL: Invalid argument at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posix_fadvise(Native Method) at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posixFadviseIfPossible(NativeIO.java:267) at org.apache.hadoop.io.nativeio.NativeIO$POSIX$CacheManipulator.posixFadviseIfPossible(NativeIO.java:146) at org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} And the application is slowly than without exception 25% Regards > EINVAL Invalid argument when RAM_DISK usage 90%+ > > > Key: HDFS-8972 > URL: https://issues.apache.org/jira/browse/HDFS-8972 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Xu Chen >Assignee: Jagadesh Kiran N >Priority: Critical > > the directory which uses LAZY_PERSIST policy , so use "df" command look up > tmpfs is usage >=90% , run spark,hive or mapreduce application , Datanode > come out following exception > {code} > 2015-08-26 17:37:34,123 WARN org.apache.hadoop.io.ReadaheadPool: Failed > readahead on null > EINVAL: Invalid argument > at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posix_fadvise(Native > Method) > at > org.apache.hadoop.io.nativeio.NativeIO$POSIX.posixFadviseIfPossible(NativeIO.java:267) > at > org.apache.hadoop.io.nativeio.NativeIO$POSIX$CacheManipulator.posixFadviseIfPossible(NativeIO.java:146) > at > org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:206) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > And the application is slowly than without exception 25% > Regards -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9252) Change TestFileTruncate to FsDatasetTestUtils to get block file size and genstamp.
[ https://issues.apache.org/jira/browse/HDFS-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu updated HDFS-9252: Status: Patch Available (was: Open) > Change TestFileTruncate to FsDatasetTestUtils to get block file size and > genstamp. > -- > > Key: HDFS-9252 > URL: https://issues.apache.org/jira/browse/HDFS-9252 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.7.1 >Reporter: Lei (Eddy) Xu >Assignee: Lei (Eddy) Xu > Attachments: HDFS-9252.00.patch > > > {{TestFileTruncate}} verifies block size and genstamp by directly accessing > the local filesystem, e.g.: > {code} > assertTrue(cluster.getBlockMetadataFile(dn0, >newBlock.getBlock()).getName().endsWith( >newBlock.getBlock().getGenerationStamp() + ".meta")); > {code} > Lets abstract the fsdataset-special logic behind FsDatasetTestUtils. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8647) Abstract BlockManager's rack policy into BlockPlacementPolicy
[ https://issues.apache.org/jira/browse/HDFS-8647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960050#comment-14960050 ] Ming Ma commented on HDFS-8647: --- [~brahmareddy], can you please check again if TestBalancer and maybe other test failures are related? Somehow these tests timed out with the patch. Thanks! > Abstract BlockManager's rack policy into BlockPlacementPolicy > - > > Key: HDFS-8647 > URL: https://issues.apache.org/jira/browse/HDFS-8647 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Brahma Reddy Battula > Attachments: HDFS-8647-001.patch, HDFS-8647-002.patch, > HDFS-8647-003.patch, HDFS-8647-004.patch, HDFS-8647-004.patch, > HDFS-8647-005.patch, HDFS-8647-006.patch > > > Sometimes we want to have namenode use alternative block placement policy > such as upgrade domains in HDFS-7541. > BlockManager has built-in assumption about rack policy in functions such as > useDelHint, blockHasEnoughRacks. That means when we have new block placement > policy, we need to modify BlockManager to account for the new policy. Ideally > BlockManager should ask BlockPlacementPolicy object instead. That will allow > us to provide new BlockPlacementPolicy without changing BlockManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9250) LocatedBlock#addCachedLoc may throw ArrayStoreException when cache is empty
[ https://issues.apache.org/jira/browse/HDFS-9250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960044#comment-14960044 ] Xiao Chen commented on HDFS-9250: - Hi Andrew, thanks for the comments and additional information. I will further investigate it. > LocatedBlock#addCachedLoc may throw ArrayStoreException when cache is empty > --- > > Key: HDFS-9250 > URL: https://issues.apache.org/jira/browse/HDFS-9250 > Project: Hadoop HDFS > Issue Type: Bug > Components: HDFS >Reporter: Xiao Chen >Assignee: Xiao Chen > Attachments: HDFS-9250.001.patch > > > We may see the following exception: > {noformat} > java.lang.ArrayStoreException > at java.util.ArrayList.toArray(ArrayList.java:389) > at > org.apache.hadoop.hdfs.protocol.LocatedBlock.addCachedLoc(LocatedBlock.java:205) > at > org.apache.hadoop.hdfs.server.namenode.CacheManager.setCachedLocations(CacheManager.java:907) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1974) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1873) > {noformat} > The cause is that in LocatedBlock.java, when {{addCachedLoc}}: > - Passed in parameter {{loc}}, which is type {{DatanodeDescriptor}}, is added > to {{cachedList}} > - {{cachedList}} was assigned to {{EMPTY_LOCS}}, which is type > {{DatanodeInfoWithStorage}}. > Both {{DatanodeDescriptor}} and {{DatanodeInfoWithStorage}} are subclasses of > {{DatanodeInfo}} but do not inherit from each other, resulting in the > ArrayStoreException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9254) HDFS Secure Mode Documentation updates
Arpit Agarwal created HDFS-9254: --- Summary: HDFS Secure Mode Documentation updates Key: HDFS-9254 URL: https://issues.apache.org/jira/browse/HDFS-9254 Project: Hadoop HDFS Issue Type: Bug Components: documentation Affects Versions: 2.7.1 Reporter: Arpit Agarwal Assignee: Arpit Agarwal Some Kerberos configuration parameters are not documented well enough. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work started] (HDFS-9254) HDFS Secure Mode Documentation updates
[ https://issues.apache.org/jira/browse/HDFS-9254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-9254 started by Arpit Agarwal. --- > HDFS Secure Mode Documentation updates > -- > > Key: HDFS-9254 > URL: https://issues.apache.org/jira/browse/HDFS-9254 > Project: Hadoop HDFS > Issue Type: Bug > Components: documentation >Affects Versions: 2.7.1 >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal > > Some Kerberos configuration parameters are not documented well enough. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9250) LocatedBlock#addCachedLoc may throw ArrayStoreException when cache is empty
[ https://issues.apache.org/jira/browse/HDFS-9250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960039#comment-14960039 ] Andrew Wang commented on HDFS-9250: --- It may also be good to add a Precondition check somewhere in addCachedLoc so we can more easily debug this in the future, and as a form of documentation about this assumption. > LocatedBlock#addCachedLoc may throw ArrayStoreException when cache is empty > --- > > Key: HDFS-9250 > URL: https://issues.apache.org/jira/browse/HDFS-9250 > Project: Hadoop HDFS > Issue Type: Bug > Components: HDFS >Reporter: Xiao Chen >Assignee: Xiao Chen > Attachments: HDFS-9250.001.patch > > > We may see the following exception: > {noformat} > java.lang.ArrayStoreException > at java.util.ArrayList.toArray(ArrayList.java:389) > at > org.apache.hadoop.hdfs.protocol.LocatedBlock.addCachedLoc(LocatedBlock.java:205) > at > org.apache.hadoop.hdfs.server.namenode.CacheManager.setCachedLocations(CacheManager.java:907) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1974) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1873) > {noformat} > The cause is that in LocatedBlock.java, when {{addCachedLoc}}: > - Passed in parameter {{loc}}, which is type {{DatanodeDescriptor}}, is added > to {{cachedList}} > - {{cachedList}} was assigned to {{EMPTY_LOCS}}, which is type > {{DatanodeInfoWithStorage}}. > Both {{DatanodeDescriptor}} and {{DatanodeInfoWithStorage}} are subclasses of > {{DatanodeInfo}} but do not inherit from each other, resulting in the > ArrayStoreException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9250) LocatedBlock#addCachedLoc may throw ArrayStoreException when cache is empty
[ https://issues.apache.org/jira/browse/HDFS-9250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960038#comment-14960038 ] Andrew Wang commented on HDFS-9250: --- Hi Xiao, thanks for working on this, So one question about this, we're not supposed to add cached locations that do not also have a backing disk replica. So in your test case, {{dn}} would be present in locs already. If I edit your test case to do this, it passes without the change. This is probably related to HDFS-8646 which I worked on before, we missed some places where cache state could get out of sync with replica state. I thought I added enough pruning to safeguard against this, but maybe I missed a place. Could you investigate? > LocatedBlock#addCachedLoc may throw ArrayStoreException when cache is empty > --- > > Key: HDFS-9250 > URL: https://issues.apache.org/jira/browse/HDFS-9250 > Project: Hadoop HDFS > Issue Type: Bug > Components: HDFS >Reporter: Xiao Chen >Assignee: Xiao Chen > Attachments: HDFS-9250.001.patch > > > We may see the following exception: > {noformat} > java.lang.ArrayStoreException > at java.util.ArrayList.toArray(ArrayList.java:389) > at > org.apache.hadoop.hdfs.protocol.LocatedBlock.addCachedLoc(LocatedBlock.java:205) > at > org.apache.hadoop.hdfs.server.namenode.CacheManager.setCachedLocations(CacheManager.java:907) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1974) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1873) > {noformat} > The cause is that in LocatedBlock.java, when {{addCachedLoc}}: > - Passed in parameter {{loc}}, which is type {{DatanodeDescriptor}}, is added > to {{cachedList}} > - {{cachedList}} was assigned to {{EMPTY_LOCS}}, which is type > {{DatanodeInfoWithStorage}}. > Both {{DatanodeDescriptor}} and {{DatanodeInfoWithStorage}} are subclasses of > {{DatanodeInfo}} but do not inherit from each other, resulting in the > ArrayStoreException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9245) Fix findbugs warnings in hdfs-nfs/WriteCtx
[ https://issues.apache.org/jira/browse/HDFS-9245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960033#comment-14960033 ] Hadoop QA commented on HDFS-9245: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 18m 6s | Pre-patch trunk has 2 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 8m 49s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 11m 22s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 26s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 25s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 36s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 38s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 0m 55s | The patch does not introduce any new Findbugs (version 3.0.0) warnings, and fixes 2 pre-existing warnings. | | {color:green}+1{color} | native | 3m 40s | Pre-build of native portion | | {color:green}+1{color} | hdfs tests | 1m 53s | Tests passed in hadoop-hdfs-nfs. | | | | 47m 54s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12766930/HDFS-9245.000.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / cf23f2c | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-HDFS-Build/13019/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs-nfs.html | | hadoop-hdfs-nfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13019/artifact/patchprocess/testrun_hadoop-hdfs-nfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/13019/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/13019/console | This message was automatically generated. > Fix findbugs warnings in hdfs-nfs/WriteCtx > -- > > Key: HDFS-9245 > URL: https://issues.apache.org/jira/browse/HDFS-9245 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Mingliang Liu >Assignee: Mingliang Liu > Attachments: HDFS-9245.000.patch > > > There are findbugs warnings as follows, brought by [HDFS-9092]. > It seems fine to ignore them by write a filter rule in the > {{findbugsExcludeFile.xml}} file. > {code:xml} > instanceHash="592511935f7cb9e5f97ef4c99a6c46c2" instanceOccurrenceNum="0" > priority="2" abbrev="IS" type="IS2_INCONSISTENT_SYNC" cweid="366" > instanceOccurrenceMax="0"> > Inconsistent synchronization > > Inconsistent synchronization of > org.apache.hadoop.hdfs.nfs.nfs3.WriteCtx.offset; locked 75% of time > > > sourcepath="org/apache/hadoop/hdfs/nfs/nfs3/WriteCtx.java" > sourcefile="WriteCtx.java" end="314"> > At WriteCtx.java:[lines 40-314] > > In class org.apache.hadoop.hdfs.nfs.nfs3.WriteCtx > > {code} > and > {code:xml} > instanceHash="4f3daa339eb819220f26c998369b02fe" instanceOccurrenceNum="0" > priority="2" abbrev="IS" type="IS2_INCONSISTENT_SYNC" cweid="366" > instanceOccurrenceMax="0"> > Inconsistent synchronization > > Inconsistent synchronization of > org.apache.hadoop.hdfs.nfs.nfs3.WriteCtx.originalCount; locked 50% of time > > > sourcepath="org/apache/hadoop/hdfs/nfs/nfs3/WriteCtx.java" > sourcefile="WriteCtx.java" end="314"> > At WriteCtx.java:[lines 40-314] > > In class org.apache.hadoop.hdfs.nfs.nfs3.WriteCtx > > name="originalCount" primary="true" signature="I"> > sourcepath="org/apache/hadoop/hdfs/nfs/nfs3/WriteCtx.java" > sourcefile="WriteCtx.java"> > In WriteCtx.java > > > Field org.apache.hadoop.hdfs.nfs.nfs3.WriteCtx.originalCount > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9231) fsck doesn't explicitly list when Bad Replicas/Blocks are in a snapshot
[ https://issues.apache.org/jira/browse/HDFS-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Chen updated HDFS-9231: Status: Open (was: Patch Available) > fsck doesn't explicitly list when Bad Replicas/Blocks are in a snapshot > --- > > Key: HDFS-9231 > URL: https://issues.apache.org/jira/browse/HDFS-9231 > Project: Hadoop HDFS > Issue Type: Bug > Components: snapshots >Reporter: Xiao Chen >Assignee: Xiao Chen > Attachments: HDFS-9231.001.patch, HDFS-9231.002.patch, > HDFS-9231.003.patch > > > For snapshot files, fsck shows corrupt blocks with the original file dir > instead of the snapshot dir. > This can be confusing since even when the original file is deleted, a new > fsck run will still show that file as corrupted although what's actually > corrupted is the snapshot. > This is true even when given the -includeSnapshots option. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9129) Move the safemode block count into BlockManager
[ https://issues.apache.org/jira/browse/HDFS-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960018#comment-14960018 ] Hadoop QA commented on HDFS-9129: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 18m 50s | Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 4 new or modified test files. | | {color:green}+1{color} | javac | 8m 12s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 37s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 26s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 27s | The applied patch generated 6 new checkstyle issues (total was 626, now 577). | | {color:green}+1{color} | whitespace | 0m 3s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 28s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 36s | The patch built with eclipse:eclipse. | | {color:red}-1{color} | findbugs | 2m 38s | The patch appears to introduce 3 new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | native | 3m 19s | Pre-build of native portion | | {color:red}-1{color} | hdfs tests | 49m 13s | Tests failed in hadoop-hdfs. | | | | 96m 55s | | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-hdfs | | Failed unit tests | hadoop.hdfs.server.blockmanagement.TestPendingInvalidateBlock | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12766923/HDFS-9129.004.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / a121fa1 | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-HDFS-Build/13015/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/13015/artifact/patchprocess/diffcheckstylehadoop-hdfs.txt | | Findbugs warnings | https://builds.apache.org/job/PreCommit-HDFS-Build/13015/artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13015/artifact/patchprocess/testrun_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/13015/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/13015/console | This message was automatically generated. > Move the safemode block count into BlockManager > --- > > Key: HDFS-9129 > URL: https://issues.apache.org/jira/browse/HDFS-9129 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Haohui Mai >Assignee: Mingliang Liu > Attachments: HDFS-9129.000.patch, HDFS-9129.001.patch, > HDFS-9129.002.patch, HDFS-9129.003.patch, HDFS-9129.004.patch > > > The {{SafeMode}} needs to track whether there are enough blocks so that the > NN can get out of the safemode. These fields can moved to the > {{BlockManager}} class. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9205) Do not schedule corrupt blocks for replication
[ https://issues.apache.org/jira/browse/HDFS-9205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo Nicholas Sze updated HDFS-9205: -- Fix Version/s: (was: 3.0.0) 2.8.0 Merged this to branch-2. > Do not schedule corrupt blocks for replication > -- > > Key: HDFS-9205 > URL: https://issues.apache.org/jira/browse/HDFS-9205 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Tsz Wo Nicholas Sze >Assignee: Tsz Wo Nicholas Sze >Priority: Minor > Fix For: 2.8.0 > > Attachments: h9205_20151007.patch, h9205_20151007b.patch, > h9205_20151008.patch, h9205_20151009.patch, h9205_20151009b.patch, > h9205_20151013.patch, h9205_20151015.patch > > > Corrupted blocks by definition are blocks cannot be read. As a consequence, > they cannot be replicated. In UnderReplicatedBlocks, there is a queue for > QUEUE_WITH_CORRUPT_BLOCKS and chooseUnderReplicatedBlocks may choose blocks > from it. It seems that scheduling corrupted block for replication is wasting > resource and potentially slow down replication for the higher priority blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-3059) ssl-server.xml causes NullPointer
[ https://issues.apache.org/jira/browse/HDFS-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960014#comment-14960014 ] Xiao Chen commented on HDFS-3059: - Thanks Yongjun for the comments. Attached patch 06 addressed your suggestions. {quote} 3. Would you please explain why the following comments? maybe add the explanation as an addition to the comment. {quote} This is added because I met the same NPE described when running secondarynamenode (2NN). Running a command like {{hdfs secondarynamenode -checkpoint}} with kerberos enabled will fail with the same NPE thrown. The cause is that 2NN web server is needed when starting as a daemon, to show status/metrics etc., which needs to get credentials. When running from shell, the environment doesn't have the credentials and prompts for password. When the password is not correct, {{getPassword}} sets the password to null, causing the NPE. Note that clients are't supposed to know the password, but we should definitely allow them to checkpoint. Since the metrics etc. are not needed when running 2NN from shell, I think it makes sense to not start the web server at all. I have updated the comments like below, to give more information. {code} // The web server is only needed when starting SNN as a daemon, // and not needed if called from shell command. Starting the web server // from shell may fail when getting credentials, if the environment is not // set up for it, which is most of the case. {code} > ssl-server.xml causes NullPointer > - > > Key: HDFS-3059 > URL: https://issues.apache.org/jira/browse/HDFS-3059 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, security >Affects Versions: 2.7.1 > Environment: in core-site.xml: > {code:xml} > > hadoop.security.authentication > kerberos > > > hadoop.security.authorization > true > > {code} > in hdfs-site.xml: > {code:xml} > > dfs.https.server.keystore.resource > /etc/hadoop/conf/ssl-server.xml > > > dfs.https.enable > true > > > ...other security props > > {code} >Reporter: Evert Lammerts >Assignee: Xiao Chen >Priority: Minor > Labels: BB2015-05-TBR > Attachments: HDFS-3059.02.patch, HDFS-3059.03.patch, > HDFS-3059.04.patch, HDFS-3059.05.patch, HDFS-3059.06.patch, HDFS-3059.patch, > HDFS-3059.patch.2 > > > If ssl is enabled (dfs.https.enable) but ssl-server.xml is not available, a > DN will crash during startup while setting up an SSL socket with a > NullPointerException: > {noformat}12/03/07 17:08:36 DEBUG security.Krb5AndCertsSslSocketConnector: > useKerb = false, useCerts = true > jetty.ssl.password : jetty.ssl.keypassword : 12/03/07 17:08:36 INFO > mortbay.log: jetty-6.1.26.cloudera.1 > 12/03/07 17:08:36 INFO mortbay.log: Started > selectchannelconnec...@p-worker35.alley.sara.nl:1006 > 12/03/07 17:08:36 DEBUG security.Krb5AndCertsSslSocketConnector: Creating new > KrbServerSocket for: 0.0.0.0 > 12/03/07 17:08:36 WARN mortbay.log: java.lang.NullPointerException > 12/03/07 17:08:36 WARN mortbay.log: failed > Krb5AndCertsSslSocketConnector@0.0.0.0:50475: java.io.IOException: > !JsseListener: java.lang.NullPointerException > 12/03/07 17:08:36 WARN mortbay.log: failed Server@604788d5: > java.io.IOException: !JsseListener: java.lang.NullPointerException > 12/03/07 17:08:36 INFO mortbay.log: Stopped > Krb5AndCertsSslSocketConnector@0.0.0.0:50475 > 12/03/07 17:08:36 INFO mortbay.log: Stopped > selectchannelconnec...@p-worker35.alley.sara.nl:1006 > 12/03/07 17:08:37 INFO datanode.DataNode: Waiting for threadgroup to exit, > active threads is 0{noformat} > The same happens if I set an absolute path to an existing > dfs.https.server.keystore.resource - in this case the file cannot be found > but not even a WARN is given. > Since in dfs.https.server.keystore.resource we know we need to have 4 > properties specified (ssl.server.truststore.location, > ssl.server.keystore.location, ssl.server.keystore.password, and > ssl.server.keystore.keypassword) we should check if they are set and throw an > IOException if they are not. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9251) Refactor TestWriteToReplica and TestFsDatasetImpl to avoid explicitly creating Files in tests code.
[ https://issues.apache.org/jira/browse/HDFS-9251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960011#comment-14960011 ] Hadoop QA commented on HDFS-9251: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 7m 53s | Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 4 new or modified test files. | | {color:green}+1{color} | javac | 9m 2s | There were no new javac warning messages. | | {color:green}+1{color} | release audit | 0m 21s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 34s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 43s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 38s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 54s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | native | 1m 13s | Pre-build of native portion | | {color:green}+1{color} | hdfs tests | 55m 13s | Tests passed in hadoop-hdfs. | | | | 80m 35s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12766916/HDFS-9251.01.patch | | Optional Tests | javac unit findbugs checkstyle | | git revision | trunk / cf23f2c | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-HDFS-Build/13016/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13016/artifact/patchprocess/testrun_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/13016/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/13016/console | This message was automatically generated. > Refactor TestWriteToReplica and TestFsDatasetImpl to avoid explicitly > creating Files in tests code. > --- > > Key: HDFS-9251 > URL: https://issues.apache.org/jira/browse/HDFS-9251 > Project: Hadoop HDFS > Issue Type: Improvement > Components: HDFS >Affects Versions: 2.7.1 >Reporter: Lei (Eddy) Xu >Assignee: Lei (Eddy) Xu > Attachments: HDFS-9251.00.patch, HDFS-9251.01.patch > > > In {{TestWriteToReplica}} and {{TestFsDatasetImpl}}, tests directly creates > block and metadata files: > {code} > replicaInfo.getBlockFile().createNewFile(); > replicaInfo.getMetaFile().createNewFile(); > {code} > It leaks the implementation details of {{FsDatasetImpl}}. This JIRA proposes > to use {{FsDatasetImplTestUtils}} (HDFS-9188) to create replicas. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9249) NPE thrown if an IOException is thrown in NameNode.
[ https://issues.apache.org/jira/browse/HDFS-9249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960004#comment-14960004 ] Wei-Chiu Chuang commented on HDFS-9249: --- This is more of a supportability improvement patch. The Findbugs warning is unrelated. It's supportability improvement therefore I deem there is no need for a test case. I can not reproduce the failed test, and it looks unrelated. > NPE thrown if an IOException is thrown in NameNode. > - > > Key: HDFS-9249 > URL: https://issues.apache.org/jira/browse/HDFS-9249 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Wei-Chiu Chuang >Assignee: Wei-Chiu Chuang >Priority: Minor > Labels: supportability > Attachments: HDFS-9249.001.patch > > > This issue was found when running test case > TestBackupNode.testCheckpointNode, but upon closer look, the problem is not > due to the test case. > Looks like an IOException was thrown in > try { > initializeGenericKeys(conf, nsId, namenodeId); > initialize(conf); > try { > haContext.writeLock(); > state.prepareToEnterState(haContext); > state.enterState(haContext); > } finally { > haContext.writeUnlock(); > } > causing the namenode to stop, but the namesystem was not yet properly > instantiated, causing NPE. > I tried to reproduce locally, but to no avail. > Because I could not reproduce the bug, and the log does not indicate what > caused the IOException, I suggest make this a supportability JIRA to log the > exception for future improvement. > Stacktrace > java.lang.NullPointerException: null > at > org.apache.hadoop.hdfs.server.namenode.NameNode.getFSImage(NameNode.java:906) > at org.apache.hadoop.hdfs.server.namenode.BackupNode.stop(BackupNode.java:210) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:827) > at > org.apache.hadoop.hdfs.server.namenode.BackupNode.(BackupNode.java:89) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1474) > at > org.apache.hadoop.hdfs.server.namenode.TestBackupNode.startBackupNode(TestBackupNode.java:102) > at > org.apache.hadoop.hdfs.server.namenode.TestBackupNode.testCheckpoint(TestBackupNode.java:298) > at > org.apache.hadoop.hdfs.server.namenode.TestBackupNode.testCheckpointNode(TestBackupNode.java:130) > The last few lines of log: > 2015-10-14 19:45:07,807 INFO namenode.NameNode > (NameNode.java:createNameNode(1422)) - createNameNode [-checkpoint] > 2015-10-14 19:45:07,807 INFO impl.MetricsSystemImpl > (MetricsSystemImpl.java:init(158)) - CheckpointNode metrics system started > (again) > 2015-10-14 19:45:07,808 INFO namenode.NameNode > (NameNode.java:setClientNamenodeAddress(402)) - fs.defaultFS is > hdfs://localhost:37835 > 2015-10-14 19:45:07,808 INFO namenode.NameNode > (NameNode.java:setClientNamenodeAddress(422)) - Clients are to use > localhost:37835 to access this namenode/service. > 2015-10-14 19:45:07,810 INFO hdfs.MiniDFSCluster > (MiniDFSCluster.java:shutdown(1708)) - Shutting down the Mini HDFS Cluster > 2015-10-14 19:45:07,810 INFO namenode.FSNamesystem > (FSNamesystem.java:stopActiveServices(1298)) - Stopping services started for > active state > 2015-10-14 19:45:07,811 INFO namenode.FSEditLog > (FSEditLog.java:endCurrentLogSegment(1228)) - Ending log segment 1 > 2015-10-14 19:45:07,811 INFO namenode.FSNamesystem > (FSNamesystem.java:run(5306)) - NameNodeEditLogRoller was interrupted, exiting > 2015-10-14 19:45:07,811 INFO namenode.FSEditLog > (FSEditLog.java:printStatistics(703)) - Number of transactions: 3 Total time > for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of > syncs: 4 SyncTimes(ms): 2 1 > 2015-10-14 19:45:07,811 INFO namenode.FSNamesystem > (FSNamesystem.java:run(5373)) - LazyPersistFileScrubber was interrupted, > exiting > 2015-10-14 19:45:07,822 INFO namenode.FileJournalManager > (FileJournalManager.java:finalizeLogSegment(142)) - Finalizing edits file > /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1/current/edits_inprogress_001 > -> > /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1/current/edits_001-003 > 2015-10-14 19:45:07,835 INFO namenode.FileJournalManager > (FileJournalManager.java:finalizeLogSegment(142)) - Finalizing edits file > /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name2/current/edits_inprogress_001 > -> > /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name2/current/edits_001-0
[jira] [Updated] (HDFS-3059) ssl-server.xml causes NullPointer
[ https://issues.apache.org/jira/browse/HDFS-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Chen updated HDFS-3059: Attachment: HDFS-3059.06.patch > ssl-server.xml causes NullPointer > - > > Key: HDFS-3059 > URL: https://issues.apache.org/jira/browse/HDFS-3059 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, security >Affects Versions: 2.7.1 > Environment: in core-site.xml: > {code:xml} > > hadoop.security.authentication > kerberos > > > hadoop.security.authorization > true > > {code} > in hdfs-site.xml: > {code:xml} > > dfs.https.server.keystore.resource > /etc/hadoop/conf/ssl-server.xml > > > dfs.https.enable > true > > > ...other security props > > {code} >Reporter: Evert Lammerts >Assignee: Xiao Chen >Priority: Minor > Labels: BB2015-05-TBR > Attachments: HDFS-3059.02.patch, HDFS-3059.03.patch, > HDFS-3059.04.patch, HDFS-3059.05.patch, HDFS-3059.06.patch, HDFS-3059.patch, > HDFS-3059.patch.2 > > > If ssl is enabled (dfs.https.enable) but ssl-server.xml is not available, a > DN will crash during startup while setting up an SSL socket with a > NullPointerException: > {noformat}12/03/07 17:08:36 DEBUG security.Krb5AndCertsSslSocketConnector: > useKerb = false, useCerts = true > jetty.ssl.password : jetty.ssl.keypassword : 12/03/07 17:08:36 INFO > mortbay.log: jetty-6.1.26.cloudera.1 > 12/03/07 17:08:36 INFO mortbay.log: Started > selectchannelconnec...@p-worker35.alley.sara.nl:1006 > 12/03/07 17:08:36 DEBUG security.Krb5AndCertsSslSocketConnector: Creating new > KrbServerSocket for: 0.0.0.0 > 12/03/07 17:08:36 WARN mortbay.log: java.lang.NullPointerException > 12/03/07 17:08:36 WARN mortbay.log: failed > Krb5AndCertsSslSocketConnector@0.0.0.0:50475: java.io.IOException: > !JsseListener: java.lang.NullPointerException > 12/03/07 17:08:36 WARN mortbay.log: failed Server@604788d5: > java.io.IOException: !JsseListener: java.lang.NullPointerException > 12/03/07 17:08:36 INFO mortbay.log: Stopped > Krb5AndCertsSslSocketConnector@0.0.0.0:50475 > 12/03/07 17:08:36 INFO mortbay.log: Stopped > selectchannelconnec...@p-worker35.alley.sara.nl:1006 > 12/03/07 17:08:37 INFO datanode.DataNode: Waiting for threadgroup to exit, > active threads is 0{noformat} > The same happens if I set an absolute path to an existing > dfs.https.server.keystore.resource - in this case the file cannot be found > but not even a WARN is given. > Since in dfs.https.server.keystore.resource we know we need to have 4 > properties specified (ssl.server.truststore.location, > ssl.server.keystore.location, ssl.server.keystore.password, and > ssl.server.keystore.keypassword) we should check if they are set and throw an > IOException if they are not. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-3059) ssl-server.xml causes NullPointer
[ https://issues.apache.org/jira/browse/HDFS-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Chen updated HDFS-3059: Status: Open (was: Patch Available) > ssl-server.xml causes NullPointer > - > > Key: HDFS-3059 > URL: https://issues.apache.org/jira/browse/HDFS-3059 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, security >Affects Versions: 2.7.1 > Environment: in core-site.xml: > {code:xml} > > hadoop.security.authentication > kerberos > > > hadoop.security.authorization > true > > {code} > in hdfs-site.xml: > {code:xml} > > dfs.https.server.keystore.resource > /etc/hadoop/conf/ssl-server.xml > > > dfs.https.enable > true > > > ...other security props > > {code} >Reporter: Evert Lammerts >Assignee: Xiao Chen >Priority: Minor > Labels: BB2015-05-TBR > Attachments: HDFS-3059.02.patch, HDFS-3059.03.patch, > HDFS-3059.04.patch, HDFS-3059.05.patch, HDFS-3059.06.patch, HDFS-3059.patch, > HDFS-3059.patch.2 > > > If ssl is enabled (dfs.https.enable) but ssl-server.xml is not available, a > DN will crash during startup while setting up an SSL socket with a > NullPointerException: > {noformat}12/03/07 17:08:36 DEBUG security.Krb5AndCertsSslSocketConnector: > useKerb = false, useCerts = true > jetty.ssl.password : jetty.ssl.keypassword : 12/03/07 17:08:36 INFO > mortbay.log: jetty-6.1.26.cloudera.1 > 12/03/07 17:08:36 INFO mortbay.log: Started > selectchannelconnec...@p-worker35.alley.sara.nl:1006 > 12/03/07 17:08:36 DEBUG security.Krb5AndCertsSslSocketConnector: Creating new > KrbServerSocket for: 0.0.0.0 > 12/03/07 17:08:36 WARN mortbay.log: java.lang.NullPointerException > 12/03/07 17:08:36 WARN mortbay.log: failed > Krb5AndCertsSslSocketConnector@0.0.0.0:50475: java.io.IOException: > !JsseListener: java.lang.NullPointerException > 12/03/07 17:08:36 WARN mortbay.log: failed Server@604788d5: > java.io.IOException: !JsseListener: java.lang.NullPointerException > 12/03/07 17:08:36 INFO mortbay.log: Stopped > Krb5AndCertsSslSocketConnector@0.0.0.0:50475 > 12/03/07 17:08:36 INFO mortbay.log: Stopped > selectchannelconnec...@p-worker35.alley.sara.nl:1006 > 12/03/07 17:08:37 INFO datanode.DataNode: Waiting for threadgroup to exit, > active threads is 0{noformat} > The same happens if I set an absolute path to an existing > dfs.https.server.keystore.resource - in this case the file cannot be found > but not even a WARN is given. > Since in dfs.https.server.keystore.resource we know we need to have 4 > properties specified (ssl.server.truststore.location, > ssl.server.keystore.location, ssl.server.keystore.password, and > ssl.server.keystore.keypassword) we should check if they are set and throw an > IOException if they are not. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9253) Refactor tests of libhdfs into a directory
[ https://issues.apache.org/jira/browse/HDFS-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haohui Mai updated HDFS-9253: - Attachment: HDFS-9253.001.patch > Refactor tests of libhdfs into a directory > -- > > Key: HDFS-9253 > URL: https://issues.apache.org/jira/browse/HDFS-9253 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haohui Mai >Assignee: Haohui Mai > Attachments: HDFS-9253.000.patch, HDFS-9253.001.patch > > > This jira proposes to refactor the current tests in libhdfs into a separate > directory. The refactor opens up the opportunity to reuse tests in libhdfs, > libwebhdfs and libhdfspp in HDFS-8707 and to also allow cross validation of > different implementation of the libhdfs API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959988#comment-14959988 ] Yi Liu commented on HDFS-9053: -- Hi Nicholas, sorry, I may bypass some description here. For 2047, I wanted to say it just an example threshold of small elements size. Currently the small elements size is an assumption value, actually we can set the degree of BTree as any value we want to. If we want 4K as the threshold of small elements size, we can set the degree of B-Tree to 2K, then max degree is (4K - 1). (I should make the description clear..) Thanks. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959972#comment-14959972 ] Tsz Wo Nicholas Sze commented on HDFS-9053: --- > For small elements size (assume # < max degree which is 2047), ... Do I miss > something? According to [your comment|https://issues.apache.org/jira/browse/HDFS-9053?focusedCommentId=14950498&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14950498] (also copied below), you were saying that B-Tree only increased 8 bytes when #children < 4K, i.e. when 2047 < #children < 4K. Is it still true? If not, how much memory is needed when 2047 < #children < 4K? {quote} I find a good approach to improve B-Tree memory overhead to make it only increase 8 bytes memory usage comparing with using ArrayList for small elements size. So we don't need to use ArrayList when #children is small (< 4K), and we can always use the BTree. {quote} > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959955#comment-14959955 ] Yi Liu edited comment on HDFS-9053 at 10/16/15 12:56 AM: - Thanks [~szetszwo] for the comments. {quote} >> 24 8 Object[] Node.elements N/A >> 32 8 Object[] Node.children N/A It only counts the reference but array objects are not counted. So the BTree overhead is still a lot more than ArrayList. {quote} For small elements size (assume # < max degree which is 2047), the {{children}} is null reference, so there is no array object of {{children}} here, just 8 bytes for null reference. And for {{elements}}, {{ArrayList}} also has it. So as described above, {{BTree}} increases 8 bytes compared with {{ArrayList}} for small size elements. Do I miss something? was (Author: hitliuyi): Thanks [~szetszwo] for the comments. {quote} >> 24 8 Object[] Node.elements N/A >> 32 8 Object[] Node.children N/A It only counts the reference but array objects are not counted. So the BTree overhead is still a lot more than ArrayList. {quote} For small elements size (assume # < max degree which is 2047), the {{children}} is null reference, so there is no array object of {{children}} here, and for {{elements}}, {{ArrayList}} also has it. So as described above, {{BTree}} increases 8 bytes compared with {{ArrayList}} for small size elements. Do I miss something? > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959955#comment-14959955 ] Yi Liu commented on HDFS-9053: -- Thanks [~szetszwo] for the comments. {quote} >> 24 8 Object[] Node.elements N/A >> 32 8 Object[] Node.children N/A It only counts the reference but array objects are not counted. So the BTree overhead is still a lot more than ArrayList. {quote} For small elements size (assume # < max degree which is 2047), the {{children}} is null reference, so there is no array object of {{children}} here, and for {{elements}}, {{ArrayList}} also has it. So as described above, {{BTree}} increases 8 bytes compared with {{ArrayList}} for small size elements. Do I miss something? > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9253) Refactor tests of libhdfs into a directory
[ https://issues.apache.org/jira/browse/HDFS-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959950#comment-14959950 ] Hadoop QA commented on HDFS-9253: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 19m 52s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 5 new or modified test files. | | {color:red}-1{color} | javac | 2m 22s | The patch appears to cause the build to fail. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12766917/HDFS-9253.000.patch | | Optional Tests | javadoc javac unit | | git revision | trunk / cf23f2c | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/13017/console | This message was automatically generated. > Refactor tests of libhdfs into a directory > -- > > Key: HDFS-9253 > URL: https://issues.apache.org/jira/browse/HDFS-9253 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haohui Mai >Assignee: Haohui Mai > Attachments: HDFS-9253.000.patch > > > This jira proposes to refactor the current tests in libhdfs into a separate > directory. The refactor opens up the opportunity to reuse tests in libhdfs, > libwebhdfs and libhdfspp in HDFS-8707 and to also allow cross validation of > different implementation of the libhdfs API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9184) Logging HDFS operation's caller context into audit logs
[ https://issues.apache.org/jira/browse/HDFS-9184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959929#comment-14959929 ] Hadoop QA commented on HDFS-9184: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 35m 12s | Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 2 new or modified test files. | | {color:green}+1{color} | javac | 15m 18s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 22m 22s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 1m 11s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 3m 24s | The applied patch generated 3 new checkstyle issues (total was 403, now 405). | | {color:green}+1{color} | whitespace | 0m 3s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 3m 20s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 1m 13s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 8m 16s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | common tests | 18m 12s | Tests failed in hadoop-common. | | {color:red}-1{color} | hdfs tests | 62m 1s | Tests failed in hadoop-hdfs. | | | | 171m 10s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.fs.shell.find.TestIname | | | hadoop.fs.shell.find.TestFind | | | hadoop.ipc.TestIPC | | | hadoop.security.token.delegation.TestZKDelegationTokenSecretManager | | | hadoop.fs.shell.find.TestPrint0 | | | hadoop.fs.shell.find.TestPrint | | | hadoop.hdfs.tools.TestDFSZKFailoverController | | | hadoop.hdfs.server.namenode.TestFileTruncate | | Timed out tests | org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure | | | org.apache.hadoop.hdfs.TestConnCache | | | org.apache.hadoop.hdfs.TestSetrepDecreasing | | | org.apache.hadoop.hdfs.TestEncryptedTransfer | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12766871/HDFS-9184.007.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 8d2d3eb | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-HDFS-Build/13012/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/13012/artifact/patchprocess/diffcheckstylehadoop-common.txt | | hadoop-common test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13012/artifact/patchprocess/testrun_hadoop-common.txt | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13012/artifact/patchprocess/testrun_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/13012/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/13012/console | This message was automatically generated. > Logging HDFS operation's caller context into audit logs > --- > > Key: HDFS-9184 > URL: https://issues.apache.org/jira/browse/HDFS-9184 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Mingliang Liu >Assignee: Mingliang Liu > Attachments: HDFS-9184.000.patch, HDFS-9184.001.patch, > HDFS-9184.002.patch, HDFS-9184.003.patch, HDFS-9184.004.patch, > HDFS-9184.005.patch, HDFS-9184.006.patch, HDFS-9184.007.patch > > > For a given HDFS operation (e.g. delete file), it's very helpful to track > which upper level job issues it. The upper level callers may be specific > Oozie tasks, MR jobs, and hive queries. One scenario is that the namenode > (NN) is abused/spammed, the operator may want to know immediately which MR > job should be blamed so that she can kill it. To this end, the caller context > contains at least the application-dependent "tracking id". > There are several existing techniques that may be related to this problem. > 1. Currently the HDFS audit log tracks the users of the the operation which > is obviously not enough. It's common that the same user issues multiple jobs > at the same time. Even for a single top level task, tracking back to a > specific caller in a chain of operations of the whole workflow (e.g.Oozie -> > Hive -> Yarn) is hard, if not impossible. > 2. HDFS integrated {{htrace}} support fo
[jira] [Updated] (HDFS-9214) Reconfigure DN concurrent move on the fly
[ https://issues.apache.org/jira/browse/HDFS-9214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobing Zhou updated HDFS-9214: Attachment: HDFS-9214.002.patch Patch 002 with more tests added. > Reconfigure DN concurrent move on the fly > - > > Key: HDFS-9214 > URL: https://issues.apache.org/jira/browse/HDFS-9214 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode >Affects Versions: 2.7.0 >Reporter: Xiaobing Zhou >Assignee: Xiaobing Zhou > Attachments: HDFS-9214.001.patch, HDFS-9214.002.patch > > > This is to reconfigure > {code} > dfs.datanode.balance.max.concurrent.moves > {code} without restarting DN. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9245) Fix findbugs warnings in hdfs-nfs/WriteCtx
[ https://issues.apache.org/jira/browse/HDFS-9245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mingliang Liu updated HDFS-9245: Attachment: HDFS-9245.000.patch Hi [~yzhangal], please review the patch v0. Thanks. > Fix findbugs warnings in hdfs-nfs/WriteCtx > -- > > Key: HDFS-9245 > URL: https://issues.apache.org/jira/browse/HDFS-9245 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Mingliang Liu >Assignee: Mingliang Liu > Attachments: HDFS-9245.000.patch > > > There are findbugs warnings as follows, brought by [HDFS-9092]. > It seems fine to ignore them by write a filter rule in the > {{findbugsExcludeFile.xml}} file. > {code:xml} > instanceHash="592511935f7cb9e5f97ef4c99a6c46c2" instanceOccurrenceNum="0" > priority="2" abbrev="IS" type="IS2_INCONSISTENT_SYNC" cweid="366" > instanceOccurrenceMax="0"> > Inconsistent synchronization > > Inconsistent synchronization of > org.apache.hadoop.hdfs.nfs.nfs3.WriteCtx.offset; locked 75% of time > > > sourcepath="org/apache/hadoop/hdfs/nfs/nfs3/WriteCtx.java" > sourcefile="WriteCtx.java" end="314"> > At WriteCtx.java:[lines 40-314] > > In class org.apache.hadoop.hdfs.nfs.nfs3.WriteCtx > > {code} > and > {code:xml} > instanceHash="4f3daa339eb819220f26c998369b02fe" instanceOccurrenceNum="0" > priority="2" abbrev="IS" type="IS2_INCONSISTENT_SYNC" cweid="366" > instanceOccurrenceMax="0"> > Inconsistent synchronization > > Inconsistent synchronization of > org.apache.hadoop.hdfs.nfs.nfs3.WriteCtx.originalCount; locked 50% of time > > > sourcepath="org/apache/hadoop/hdfs/nfs/nfs3/WriteCtx.java" > sourcefile="WriteCtx.java" end="314"> > At WriteCtx.java:[lines 40-314] > > In class org.apache.hadoop.hdfs.nfs.nfs3.WriteCtx > > name="originalCount" primary="true" signature="I"> > sourcepath="org/apache/hadoop/hdfs/nfs/nfs3/WriteCtx.java" > sourcefile="WriteCtx.java"> > In WriteCtx.java > > > Field org.apache.hadoop.hdfs.nfs.nfs3.WriteCtx.originalCount > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9245) Fix findbugs warnings in hdfs-nfs/WriteCtx
[ https://issues.apache.org/jira/browse/HDFS-9245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mingliang Liu updated HDFS-9245: Status: Patch Available (was: Open) > Fix findbugs warnings in hdfs-nfs/WriteCtx > -- > > Key: HDFS-9245 > URL: https://issues.apache.org/jira/browse/HDFS-9245 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Mingliang Liu >Assignee: Mingliang Liu > Attachments: HDFS-9245.000.patch > > > There are findbugs warnings as follows, brought by [HDFS-9092]. > It seems fine to ignore them by write a filter rule in the > {{findbugsExcludeFile.xml}} file. > {code:xml} > instanceHash="592511935f7cb9e5f97ef4c99a6c46c2" instanceOccurrenceNum="0" > priority="2" abbrev="IS" type="IS2_INCONSISTENT_SYNC" cweid="366" > instanceOccurrenceMax="0"> > Inconsistent synchronization > > Inconsistent synchronization of > org.apache.hadoop.hdfs.nfs.nfs3.WriteCtx.offset; locked 75% of time > > > sourcepath="org/apache/hadoop/hdfs/nfs/nfs3/WriteCtx.java" > sourcefile="WriteCtx.java" end="314"> > At WriteCtx.java:[lines 40-314] > > In class org.apache.hadoop.hdfs.nfs.nfs3.WriteCtx > > {code} > and > {code:xml} > instanceHash="4f3daa339eb819220f26c998369b02fe" instanceOccurrenceNum="0" > priority="2" abbrev="IS" type="IS2_INCONSISTENT_SYNC" cweid="366" > instanceOccurrenceMax="0"> > Inconsistent synchronization > > Inconsistent synchronization of > org.apache.hadoop.hdfs.nfs.nfs3.WriteCtx.originalCount; locked 50% of time > > > sourcepath="org/apache/hadoop/hdfs/nfs/nfs3/WriteCtx.java" > sourcefile="WriteCtx.java" end="314"> > At WriteCtx.java:[lines 40-314] > > In class org.apache.hadoop.hdfs.nfs.nfs3.WriteCtx > > name="originalCount" primary="true" signature="I"> > sourcepath="org/apache/hadoop/hdfs/nfs/nfs3/WriteCtx.java" > sourcefile="WriteCtx.java"> > In WriteCtx.java > > > Field org.apache.hadoop.hdfs.nfs.nfs3.WriteCtx.originalCount > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9220) Reading small file (< 512 bytes) that is open for append fails due to incorrect checksum
[ https://issues.apache.org/jira/browse/HDFS-9220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959884#comment-14959884 ] Hudson commented on HDFS-9220: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2440 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2440/]) HDFS-9220. Reading small file (< 512 bytes) that is open for append (kihwal: rev c7c36cbd6218f46c33d7fb2f60cd52cb29e6d720) * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestFileAppend2.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt > Reading small file (< 512 bytes) that is open for append fails due to > incorrect checksum > > > Key: HDFS-9220 > URL: https://issues.apache.org/jira/browse/HDFS-9220 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Bogdan Raducanu >Assignee: Jing Zhao >Priority: Blocker > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-9220.000.patch, HDFS-9220.001.patch, > HDFS-9220.002.patch, test2.java > > > Exception: > 2015-10-09 14:59:40 WARN DFSClient:1150 - fetchBlockByteRange(). Got a > checksum exception for /tmp/file0.05355529331575182 at > BP-353681639-10.10.10.10-1437493596883:blk_1075692769_9244882:0 from > DatanodeInfoWithStorage[10.10.10.10]:5001 > All 3 replicas cause this exception and the read fails entirely with: > BlockMissingException: Could not obtain block: > BP-353681639-10.10.10.10-1437493596883:blk_1075692769_9244882 > file=/tmp/file0.05355529331575182 > Code to reproduce is attached. > Does not happen in 2.7.0. > Data is read correctly if checksum verification is disabled. > More generally, the failure happens when reading from the last block of a > file and the last block has <= 512 bytes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9173) Erasure Coding: Lease recovery for striped file
[ https://issues.apache.org/jira/browse/HDFS-9173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959859#comment-14959859 ] Zhe Zhang commented on HDFS-9173: - Thanks Walter for the updates. Some additional comments: # In the current patch {{StripedRecoveryTask}} only shares a few lines of trivial code with {{RecoveryTask}}. It's a little odd to subclass it. # Fundamentally the 2 {{RecoveryTask}} types do share a lot of logics. They both go through steps 1, 2, 5 as described in the Javadoc. So here's an alternative way to structure the 2 classes: {code} public class RecoveryTaskContiguous { protected void recover() { // Step 1.1: callInitReplicaRecovery and get all blocks lengths from DataNodes, generate list of BlockRecord // Step 1.2: check if there's any FINALIZED replica } void syncBlockFinalized(List syncList) { } void syncBlockUnfinalized(List syncList) { } } public class RecoveryTaskStriped { protected void recover() { // Step 1.1: callInitReplicaRecovery and get all blocks lengths from DataNodes, generate list of BlockRecord } void syncBlockUnfinalized(List syncList) { } } {code} Step 1.1 is identical in both classes so we can use a shared static method to do it. The logic of synchronizing striped internal blocks is very similar to handling {{RBW}} and {{RWR}} replicas. We can use a shared {{syncBlockUnfinalized}} method. This suggested consolidation can be done separately. # In {{StripedRecoveryTask#recover}} we are calling {{callInitReplicaRecovery}} twice. Is the second call necessary? # {{StripedRecoveryTask#ecPolicy}} is unnecessary. Nits: # We can take this chance to fix the {{if}} statements without parenthesis # Can also update ""Convenience"" / "convenient" to "helper" / "util" when describing classes and methods # Steps 3 and 4 in the Javadoc should be marked as TODO or removed for now > Erasure Coding: Lease recovery for striped file > --- > > Key: HDFS-9173 > URL: https://issues.apache.org/jira/browse/HDFS-9173 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Walter Su >Assignee: Walter Su > Attachments: HDFS-9173.00.wip.patch, HDFS-9173.01.patch, > HDFS-9173.02.step125.patch, HDFS-9173.03.patch, HDFS-9173.04.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9129) Move the safemode block count into BlockManager
[ https://issues.apache.org/jira/browse/HDFS-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mingliang Liu updated HDFS-9129: Attachment: HDFS-9129.004.patch The v4 patch is to address the failing tests and further refactoring is possible. Please hold on before reviewing this. > Move the safemode block count into BlockManager > --- > > Key: HDFS-9129 > URL: https://issues.apache.org/jira/browse/HDFS-9129 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Haohui Mai >Assignee: Mingliang Liu > Attachments: HDFS-9129.000.patch, HDFS-9129.001.patch, > HDFS-9129.002.patch, HDFS-9129.003.patch, HDFS-9129.004.patch > > > The {{SafeMode}} needs to track whether there are enough blocks so that the > NN can get out of the safemode. These fields can moved to the > {{BlockManager}} class. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959826#comment-14959826 ] Daryn Sharp commented on HDFS-9239: --- It seems like a good idea at first, but I don't think the proposal solves the stated issues: * This prevents the NameNode from spuriously marking healthy DataNodes as stale or dead. * ... delayed DataNodes may be flagged as stale, and applications may erroneously choose to avoid accessing those nodes * ... DataNodes may be flagged as dead. In extreme cases, this can cause a NameNode to schedule wasteful rereplication activity. Let's say the NN can't service heartbeats to avoid false-staleness (stale defaults to 30s). That means it definitely can't process IBRs either. Would a lifeline to prevent the stale flag matter at this point? At this level of congestion, nearly all of the nodes are going stale. The staleness is probably the least of your worries. If nodes are marked dead from inability to keep up with heartbeats (defaults to ~10min), the cluster itself is already. Worrying about wasted replications is dubious because the NN can't issue replications if it can't process the heartbeats. That is not a heavy load scenario. From personal experience, it sounds like the fallout of a 120GB+ heap stop-the-world GC. The NN wakes up, heartbeat monitor starts marking everything dead. This sparks a replication storm, followed by invalidation storm, which the NN recovers from... unless it goes into another full GC. The lifeline might help slow the rise of false-dead nodes. However, I recently patched the heartbeat monitor to detect long GCs and be very gracious before marking nodes dead. If I've misinterpreted anything, please describe the incident that prompted this approach so we can see if it would have helped. > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Attachments: DataNode-Lifeline-Protocol.pdf > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9253) Refactor tests of libhdfs into a directory
[ https://issues.apache.org/jira/browse/HDFS-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959819#comment-14959819 ] Haohui Mai commented on HDFS-9253: -- The v0 patch moves the tests of libhdfs into the {{libhdfs-tests}} directory. It also moves {{hdfs.h}} (which is the only public include file) to {{libhdfs/include/hdfs}} so that other modules can only put {{hdfs.h}} into its include path. > Refactor tests of libhdfs into a directory > -- > > Key: HDFS-9253 > URL: https://issues.apache.org/jira/browse/HDFS-9253 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haohui Mai >Assignee: Haohui Mai > Attachments: HDFS-9253.000.patch > > > This jira proposes to refactor the current tests in libhdfs into a separate > directory. The refactor opens up the opportunity to reuse tests in libhdfs, > libwebhdfs and libhdfspp in HDFS-8707 and to also allow cross validation of > different implementation of the libhdfs API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9253) Refactor tests of libhdfs into a directory
[ https://issues.apache.org/jira/browse/HDFS-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haohui Mai updated HDFS-9253: - Attachment: HDFS-9253.000.patch > Refactor tests of libhdfs into a directory > -- > > Key: HDFS-9253 > URL: https://issues.apache.org/jira/browse/HDFS-9253 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haohui Mai >Assignee: Haohui Mai > Attachments: HDFS-9253.000.patch > > > This jira proposes to refactor the current tests in libhdfs into a separate > directory. The refactor opens up the opportunity to reuse tests in libhdfs, > libwebhdfs and libhdfspp in HDFS-8707 and to also allow cross validation of > different implementation of the libhdfs API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9253) Refactor tests of libhdfs into a directory
[ https://issues.apache.org/jira/browse/HDFS-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haohui Mai updated HDFS-9253: - Status: Patch Available (was: Open) > Refactor tests of libhdfs into a directory > -- > > Key: HDFS-9253 > URL: https://issues.apache.org/jira/browse/HDFS-9253 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haohui Mai >Assignee: Haohui Mai > Attachments: HDFS-9253.000.patch > > > This jira proposes to refactor the current tests in libhdfs into a separate > directory. The refactor opens up the opportunity to reuse tests in libhdfs, > libwebhdfs and libhdfspp in HDFS-8707 and to also allow cross validation of > different implementation of the libhdfs API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9253) Refactor tests of libhdfs into a directory
Haohui Mai created HDFS-9253: Summary: Refactor tests of libhdfs into a directory Key: HDFS-9253 URL: https://issues.apache.org/jira/browse/HDFS-9253 Project: Hadoop HDFS Issue Type: Improvement Reporter: Haohui Mai Assignee: Haohui Mai This jira proposes to refactor the current tests in libhdfs into a separate directory. The refactor opens up the opportunity to reuse tests in libhdfs, libwebhdfs and libhdfspp in HDFS-8707 and to also allow cross validation of different implementation of the libhdfs API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9251) Refactor TestWriteToReplica and TestFsDatasetImpl to avoid explicitly creating Files in tests code.
[ https://issues.apache.org/jira/browse/HDFS-9251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu updated HDFS-9251: Attachment: HDFS-9251.01.patch Address whitespace warnings. The findbugs warnings and test failures are not relevant. > Refactor TestWriteToReplica and TestFsDatasetImpl to avoid explicitly > creating Files in tests code. > --- > > Key: HDFS-9251 > URL: https://issues.apache.org/jira/browse/HDFS-9251 > Project: Hadoop HDFS > Issue Type: Improvement > Components: HDFS >Affects Versions: 2.7.1 >Reporter: Lei (Eddy) Xu >Assignee: Lei (Eddy) Xu > Attachments: HDFS-9251.00.patch, HDFS-9251.01.patch > > > In {{TestWriteToReplica}} and {{TestFsDatasetImpl}}, tests directly creates > block and metadata files: > {code} > replicaInfo.getBlockFile().createNewFile(); > replicaInfo.getMetaFile().createNewFile(); > {code} > It leaks the implementation details of {{FsDatasetImpl}}. This JIRA proposes > to use {{FsDatasetImplTestUtils}} (HDFS-9188) to create replicas. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8831) Trash Support for files in HDFS encryption zone
[ https://issues.apache.org/jira/browse/HDFS-8831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959792#comment-14959792 ] Mingliang Liu commented on HDFS-8831: - Thanks for working on this. The design doc is very helpful. > Trash Support for files in HDFS encryption zone > --- > > Key: HDFS-8831 > URL: https://issues.apache.org/jira/browse/HDFS-8831 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: encryption >Reporter: Xiaoyu Yao >Assignee: Xiaoyu Yao > Attachments: HDFS-8831-10152015.pdf > > > Currently, "Soft Delete" is only supported if the whole encryption zone is > deleted. If you delete files whinin the zone with trash feature enabled, you > will get error similar to the following > {code} > rm: Failed to move to trash: hdfs://HW11217.local:9000/z1_1/startnn.sh: > /z1_1/startnn.sh can't be moved from an encryption zone. > {code} > With HDFS-8830, we can support "Soft Delete" by adding the .Trash folder of > the file being deleted appropriately to the same encryption zone. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9251) Refactor TestWriteToReplica and TestFsDatasetImpl to avoid explicitly creating Files in tests code.
[ https://issues.apache.org/jira/browse/HDFS-9251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959786#comment-14959786 ] Hadoop QA commented on HDFS-9251: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 7m 58s | Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 4 new or modified test files. | | {color:green}+1{color} | javac | 7m 54s | There were no new javac warning messages. | | {color:green}+1{color} | release audit | 0m 20s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 25s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 0m 1s | The patch has 2 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 28s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 28s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | native | 1m 2s | Pre-build of native portion | | {color:red}-1{color} | hdfs tests | 50m 16s | Tests failed in hadoop-hdfs. | | | | 73m 28s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.hdfs.server.datanode.TestDirectoryScanner | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12766872/HDFS-9251.00.patch | | Optional Tests | javac unit findbugs checkstyle | | git revision | trunk / 8d2d3eb | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-HDFS-Build/13011/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html | | whitespace | https://builds.apache.org/job/PreCommit-HDFS-Build/13011/artifact/patchprocess/whitespace.txt | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13011/artifact/patchprocess/testrun_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/13011/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf900.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/13011/console | This message was automatically generated. > Refactor TestWriteToReplica and TestFsDatasetImpl to avoid explicitly > creating Files in tests code. > --- > > Key: HDFS-9251 > URL: https://issues.apache.org/jira/browse/HDFS-9251 > Project: Hadoop HDFS > Issue Type: Improvement > Components: HDFS >Affects Versions: 2.7.1 >Reporter: Lei (Eddy) Xu >Assignee: Lei (Eddy) Xu > Attachments: HDFS-9251.00.patch > > > In {{TestWriteToReplica}} and {{TestFsDatasetImpl}}, tests directly creates > block and metadata files: > {code} > replicaInfo.getBlockFile().createNewFile(); > replicaInfo.getMetaFile().createNewFile(); > {code} > It leaks the implementation details of {{FsDatasetImpl}}. This JIRA proposes > to use {{FsDatasetImplTestUtils}} (HDFS-9188) to create replicas. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8766) Implement a libhdfs(3) compatible API
[ https://issues.apache.org/jira/browse/HDFS-8766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959780#comment-14959780 ] Hadoop QA commented on HDFS-8766: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12766903/HDFS-8766.HDFS-8707.006.patch | | Optional Tests | javac unit | | git revision | HDFS-8707 / 4cd3b99 | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/13014/console | This message was automatically generated. > Implement a libhdfs(3) compatible API > - > > Key: HDFS-8766 > URL: https://issues.apache.org/jira/browse/HDFS-8766 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Reporter: James Clampffer >Assignee: James Clampffer > Attachments: HDFS-8766.HDFS-8707.000.patch, > HDFS-8766.HDFS-8707.001.patch, HDFS-8766.HDFS-8707.002.patch, > HDFS-8766.HDFS-8707.003.patch, HDFS-8766.HDFS-8707.004.patch, > HDFS-8766.HDFS-8707.005.patch, HDFS-8766.HDFS-8707.006.patch > > > Add a synchronous API that is compatible with the hdfs.h header used in > libhdfs and libhdfs3. This will make it possible for projects using > libhdfs/libhdfs3 to relink against libhdfspp with minimal changes. > This also provides a pure C interface that can be linked against projects > that aren't built in C++11 mode for various reasons but use the same > compiler. It also allows many other programming languages to access > libhdfspp through builtin FFI interfaces. > The libhdfs API is very similar to the posix file API which makes it easier > for programs built using posix filesystem calls to be modified to access HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8831) Trash Support for files in HDFS encryption zone
[ https://issues.apache.org/jira/browse/HDFS-8831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoyu Yao updated HDFS-8831: - Attachment: HDFS-8831-10152015.pdf Attach a design document for review and discussion. > Trash Support for files in HDFS encryption zone > --- > > Key: HDFS-8831 > URL: https://issues.apache.org/jira/browse/HDFS-8831 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: encryption >Reporter: Xiaoyu Yao >Assignee: Xiaoyu Yao > Attachments: HDFS-8831-10152015.pdf > > > Currently, "Soft Delete" is only supported if the whole encryption zone is > deleted. If you delete files whinin the zone with trash feature enabled, you > will get error similar to the following > {code} > rm: Failed to move to trash: hdfs://HW11217.local:9000/z1_1/startnn.sh: > /z1_1/startnn.sh can't be moved from an encryption zone. > {code} > With HDFS-8830, we can support "Soft Delete" by adding the .Trash folder of > the file being deleted appropriately to the same encryption zone. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8766) Implement a libhdfs(3) compatible API
[ https://issues.apache.org/jira/browse/HDFS-8766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Clampffer updated HDFS-8766: -- Attachment: HDFS-8766.HDFS-8707.006.patch Stripped stuff out. > Implement a libhdfs(3) compatible API > - > > Key: HDFS-8766 > URL: https://issues.apache.org/jira/browse/HDFS-8766 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Reporter: James Clampffer >Assignee: James Clampffer > Attachments: HDFS-8766.HDFS-8707.000.patch, > HDFS-8766.HDFS-8707.001.patch, HDFS-8766.HDFS-8707.002.patch, > HDFS-8766.HDFS-8707.003.patch, HDFS-8766.HDFS-8707.004.patch, > HDFS-8766.HDFS-8707.005.patch, HDFS-8766.HDFS-8707.006.patch > > > Add a synchronous API that is compatible with the hdfs.h header used in > libhdfs and libhdfs3. This will make it possible for projects using > libhdfs/libhdfs3 to relink against libhdfspp with minimal changes. > This also provides a pure C interface that can be linked against projects > that aren't built in C++11 mode for various reasons but use the same > compiler. It also allows many other programming languages to access > libhdfspp through builtin FFI interfaces. > The libhdfs API is very similar to the posix file API which makes it easier > for programs built using posix filesystem calls to be modified to access HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9198) Coalesce IBR processing in the NN
[ https://issues.apache.org/jira/browse/HDFS-9198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959752#comment-14959752 ] Hadoop QA commented on HDFS-9198: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12766881/HDFS-9198-trunk.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 8d2d3eb | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/13013/console | This message was automatically generated. > Coalesce IBR processing in the NN > - > > Key: HDFS-9198 > URL: https://issues.apache.org/jira/browse/HDFS-9198 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 2.0.0-alpha >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: HDFS-9198-branch2.patch, HDFS-9198-trunk.patch, > HDFS-9198-trunk.patch > > > IBRs from thousands of DNs under load will degrade NN performance due to > excessive write-lock contention from multiple IPC handler threads. The IBR > processing is quick, so the lock contention may be reduced by coalescing > multiple IBRs into a single write-lock transaction. The handlers will also > be freed up faster for other operations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9252) Change TestFileTruncate to FsDatasetTestUtils to get block file size and genstamp.
[ https://issues.apache.org/jira/browse/HDFS-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu updated HDFS-9252: Attachment: HDFS-9252.00.patch Add {{getDataLenght()}} and {{getPersistentGenStamp()}} to {{FsDatasetTestutils}}. > Change TestFileTruncate to FsDatasetTestUtils to get block file size and > genstamp. > -- > > Key: HDFS-9252 > URL: https://issues.apache.org/jira/browse/HDFS-9252 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.7.1 >Reporter: Lei (Eddy) Xu >Assignee: Lei (Eddy) Xu > Attachments: HDFS-9252.00.patch > > > {{TestFileTruncate}} verifies block size and genstamp by directly accessing > the local filesystem, e.g.: > {code} > assertTrue(cluster.getBlockMetadataFile(dn0, >newBlock.getBlock()).getName().endsWith( >newBlock.getBlock().getGenerationStamp() + ".meta")); > {code} > Lets abstract the fsdataset-special logic behind FsDatasetTestUtils. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9252) Change TestFileTruncate to FsDatasetTestUtils to get block file size and genstamp.
Lei (Eddy) Xu created HDFS-9252: --- Summary: Change TestFileTruncate to FsDatasetTestUtils to get block file size and genstamp. Key: HDFS-9252 URL: https://issues.apache.org/jira/browse/HDFS-9252 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 2.7.1 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu {{TestFileTruncate}} verifies block size and genstamp by directly accessing the local filesystem, e.g.: {code} assertTrue(cluster.getBlockMetadataFile(dn0, newBlock.getBlock()).getName().endsWith( newBlock.getBlock().getGenerationStamp() + ".meta")); {code} Lets abstract the fsdataset-special logic behind FsDatasetTestUtils. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9079) Erasure coding: preallocate multiple generation stamps and serialize updates from data streamers
[ https://issues.apache.org/jira/browse/HDFS-9079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959726#comment-14959726 ] Hadoop QA commented on HDFS-9079: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 22m 43s | Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 8m 49s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 11m 35s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 27s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 3m 7s | The applied patch generated 96 new checkstyle issues (total was 91, now 181). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 46s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 38s | The patch built with eclipse:eclipse. | | {color:red}-1{color} | findbugs | 5m 5s | The patch appears to introduce 3 new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | native | 3m 37s | Pre-build of native portion | | {color:red}-1{color} | hdfs tests | 68m 15s | Tests failed in hadoop-hdfs. | | {color:green}+1{color} | hdfs tests | 0m 34s | Tests passed in hadoop-hdfs-client. | | | | 126m 40s | | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-hdfs-client | | Failed unit tests | hadoop.hdfs.TestReplaceDatanodeOnFailure | | | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure | | | hadoop.hdfs.server.namenode.TestSecureNameNode | | | hadoop.hdfs.TestRollingUpgrade | | | hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes | | | hadoop.hdfs.server.namenode.TestFileTruncate | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12766857/HDFS-9079.03.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 8d2d3eb | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-HDFS-Build/13010/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/13010/artifact/patchprocess/diffcheckstylehadoop-hdfs-client.txt | | Findbugs warnings | https://builds.apache.org/job/PreCommit-HDFS-Build/13010/artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs-client.html | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13010/artifact/patchprocess/testrun_hadoop-hdfs.txt | | hadoop-hdfs-client test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13010/artifact/patchprocess/testrun_hadoop-hdfs-client.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/13010/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/13010/console | This message was automatically generated. > Erasure coding: preallocate multiple generation stamps and serialize updates > from data streamers > > > Key: HDFS-9079 > URL: https://issues.apache.org/jira/browse/HDFS-9079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Affects Versions: HDFS-7285 >Reporter: Zhe Zhang >Assignee: Zhe Zhang > Attachments: HDFS-9079-HDFS-7285.00.patch, HDFS-9079.01.patch, > HDFS-9079.02.patch, HDFS-9079.03.patch > > > A non-striped DataStreamer goes through the following steps in error handling: > {code} > 1) Finds error => 2) Asks NN for new GS => 3) Gets new GS from NN => 4) > Applies new GS to DN (createBlockOutputStream) => 5) Ack from DN => 6) > Updates block on NN > {code} > To simplify the above we can preallocate GS when NN creates a new striped > block group ({{FSN#createNewBlock}}). For each new striped block group we can > reserve {{NUM_PARITY_BLOCKS}} GS's. Then steps 1~3 in the above sequence can > be saved. If more than {{NUM_PARITY_BLOCKS}} errors have happened we > shouldn't try to further recover anyway. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9249) NPE thrown if an IOException is thrown in NameNode.
[ https://issues.apache.org/jira/browse/HDFS-9249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959715#comment-14959715 ] Hadoop QA commented on HDFS-9249: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 22m 50s | Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 8m 49s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 11m 42s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 25s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 32s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 41s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 36s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 3m 1s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | native | 4m 10s | Pre-build of native portion | | {color:red}-1{color} | hdfs tests | 62m 8s | Tests failed in hadoop-hdfs. | | | | 116m 59s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.hdfs.server.namenode.ha.TestEditLogTailer | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12766847/HDFS-9249.001.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 8d2d3eb | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-HDFS-Build/13009/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13009/artifact/patchprocess/testrun_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/13009/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/13009/console | This message was automatically generated. > NPE thrown if an IOException is thrown in NameNode. > - > > Key: HDFS-9249 > URL: https://issues.apache.org/jira/browse/HDFS-9249 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Wei-Chiu Chuang >Assignee: Wei-Chiu Chuang >Priority: Minor > Labels: supportability > Attachments: HDFS-9249.001.patch > > > This issue was found when running test case > TestBackupNode.testCheckpointNode, but upon closer look, the problem is not > due to the test case. > Looks like an IOException was thrown in > try { > initializeGenericKeys(conf, nsId, namenodeId); > initialize(conf); > try { > haContext.writeLock(); > state.prepareToEnterState(haContext); > state.enterState(haContext); > } finally { > haContext.writeUnlock(); > } > causing the namenode to stop, but the namesystem was not yet properly > instantiated, causing NPE. > I tried to reproduce locally, but to no avail. > Because I could not reproduce the bug, and the log does not indicate what > caused the IOException, I suggest make this a supportability JIRA to log the > exception for future improvement. > Stacktrace > java.lang.NullPointerException: null > at > org.apache.hadoop.hdfs.server.namenode.NameNode.getFSImage(NameNode.java:906) > at org.apache.hadoop.hdfs.server.namenode.BackupNode.stop(BackupNode.java:210) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:827) > at > org.apache.hadoop.hdfs.server.namenode.BackupNode.(BackupNode.java:89) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1474) > at > org.apache.hadoop.hdfs.server.namenode.TestBackupNode.startBackupNode(TestBackupNode.java:102) > at > org.apache.hadoop.hdfs.server.namenode.TestBackupNode.testCheckpoint(TestBackupNode.java:298) > at > org.apache.hadoop.hdfs.server.namenode.TestBackupNode.testCheckpointNode(TestBackupNode.java:130) > The last few lines of log: > 2015-10-14 19:45:07,807 INFO namenode.NameNode > (NameNode.java:createNameNode(1422)) - createNameNo
[jira] [Commented] (HDFS-9236) Missing sanity check for block size during block recovery
[ https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959676#comment-14959676 ] Hadoop QA commented on HDFS-9236: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 18m 43s | Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 8m 12s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 35s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 28s | The applied patch generated 2 new checkstyle issues (total was 142, now 142). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 31s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 31s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | native | 3m 15s | Pre-build of native portion | | {color:green}+1{color} | hdfs tests | 49m 37s | Tests passed in hadoop-hdfs. | | | | 96m 54s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12766845/HDFS-9236.003.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 8d2d3eb | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-HDFS-Build/13008/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/13008/artifact/patchprocess/diffcheckstylehadoop-hdfs.txt | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13008/artifact/patchprocess/testrun_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/13008/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf900.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/13008/console | This message was automatically generated. > Missing sanity check for block size during block recovery > - > > Key: HDFS-9236 > URL: https://issues.apache.org/jira/browse/HDFS-9236 > Project: Hadoop HDFS > Issue Type: Bug > Components: HDFS >Affects Versions: 2.7.1 >Reporter: Tony Wu >Assignee: Tony Wu > Attachments: HDFS-9236.001.patch, HDFS-9236.002.patch, > HDFS-9236.003.patch > > > Ran into an issue while running test against faulty data-node code. > Currently in DataNode.java: > {code:java} > /** Block synchronization */ > void syncBlock(RecoveringBlock rBlock, > List syncList) throws IOException { > … > // Calculate the best available replica state. > ReplicaState bestState = ReplicaState.RWR; > … > // Calculate list of nodes that will participate in the recovery > // and the new block size > List participatingList = new ArrayList(); > final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId, > -1, recoveryId); > switch(bestState) { > … > case RBW: > case RWR: > long minLength = Long.MAX_VALUE; > for(BlockRecord r : syncList) { > ReplicaState rState = r.rInfo.getOriginalReplicaState(); > if(rState == bestState) { > minLength = Math.min(minLength, r.rInfo.getNumBytes()); > participatingList.add(r); > } > } > newBlock.setNumBytes(minLength); > break; > … > } > … > nn.commitBlockSynchronization(block, > newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false, > datanodes, storages); > } > {code} > This code is called by the DN coordinating the block recovery. In the above > case, it is possible for none of the rState (reported by DNs with copies of > the replica being recovered) to match the bestState. This can either be > caused by faulty DN code or stale/modified/corrupted files on DN. When this > happens the DN will end up reporting the minLengh of Long.MAX_VALUE. > Unfortunately there is no check on the NN for replica length. See > FSNamesystem.java: > {code:java} > void commitBlockSynchronization(ExtendedBlock oldBlock,
[jira] [Updated] (HDFS-9198) Coalesce IBR processing in the NN
[ https://issues.apache.org/jira/browse/HDFS-9198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated HDFS-9198: -- Attachment: HDFS-9198-trunk.patch Took care of minor findbugs warning, cleanup up most of the silly style stuff. Some of the complaints about metrics I don't think are valid due to the annotation magic that occurs. Updated the tests to flush the block ops queue to prevent races. Changed the queue offer/add to offer/put. > Coalesce IBR processing in the NN > - > > Key: HDFS-9198 > URL: https://issues.apache.org/jira/browse/HDFS-9198 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 2.0.0-alpha >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: HDFS-9198-branch2.patch, HDFS-9198-trunk.patch, > HDFS-9198-trunk.patch > > > IBRs from thousands of DNs under load will degrade NN performance due to > excessive write-lock contention from multiple IPC handler threads. The IBR > processing is quick, so the lock contention may be reduced by coalescing > multiple IBRs into a single write-lock transaction. The handlers will also > be freed up faster for other operations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9083) Replication violates block placement policy.
[ https://issues.apache.org/jira/browse/HDFS-9083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jitendra Nath Pandey updated HDFS-9083: --- Priority: Blocker (was: Major) > Replication violates block placement policy. > > > Key: HDFS-9083 > URL: https://issues.apache.org/jira/browse/HDFS-9083 > Project: Hadoop HDFS > Issue Type: Bug > Components: HDFS, namenode >Affects Versions: 2.6.0 >Reporter: Rushabh S Shah >Assignee: Rushabh S Shah >Priority: Blocker > > Recently we are noticing many cases in which all the replica of the block are > residing on the same rack. > During the block creation, the block placement policy was honored. > But after node failure event in some specific manner, the block ends up in > such state. > On investigating more I found out that BlockManager#blockHasEnoughRacks is > dependent on the config (net.topology.script.file.name) > {noformat} > if (!this.shouldCheckForEnoughRacks) { > return true; > } > {noformat} > We specify DNSToSwitchMapping implementation (our own custom implementation) > via net.topology.node.switch.mapping.impl and no longer use > net.topology.script.file.name config. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9220) Reading small file (< 512 bytes) that is open for append fails due to incorrect checksum
[ https://issues.apache.org/jira/browse/HDFS-9220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959577#comment-14959577 ] Hudson commented on HDFS-9220: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #502 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/502/]) HDFS-9220. Reading small file (< 512 bytes) that is open for append (kihwal: rev c7c36cbd6218f46c33d7fb2f60cd52cb29e6d720) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestFileAppend2.java > Reading small file (< 512 bytes) that is open for append fails due to > incorrect checksum > > > Key: HDFS-9220 > URL: https://issues.apache.org/jira/browse/HDFS-9220 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Bogdan Raducanu >Assignee: Jing Zhao >Priority: Blocker > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-9220.000.patch, HDFS-9220.001.patch, > HDFS-9220.002.patch, test2.java > > > Exception: > 2015-10-09 14:59:40 WARN DFSClient:1150 - fetchBlockByteRange(). Got a > checksum exception for /tmp/file0.05355529331575182 at > BP-353681639-10.10.10.10-1437493596883:blk_1075692769_9244882:0 from > DatanodeInfoWithStorage[10.10.10.10]:5001 > All 3 replicas cause this exception and the read fails entirely with: > BlockMissingException: Could not obtain block: > BP-353681639-10.10.10.10-1437493596883:blk_1075692769_9244882 > file=/tmp/file0.05355529331575182 > Code to reproduce is attached. > Does not happen in 2.7.0. > Data is read correctly if checksum verification is disabled. > More generally, the failure happens when reading from the last block of a > file and the last block has <= 512 bytes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9250) LocatedBlock#addCachedLoc may throw ArrayStoreException when cache is empty
[ https://issues.apache.org/jira/browse/HDFS-9250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959574#comment-14959574 ] Xiao Chen commented on HDFS-9250: - The Findbugs warning and test failure are not relevant. Please review. Thanks. > LocatedBlock#addCachedLoc may throw ArrayStoreException when cache is empty > --- > > Key: HDFS-9250 > URL: https://issues.apache.org/jira/browse/HDFS-9250 > Project: Hadoop HDFS > Issue Type: Bug > Components: HDFS >Reporter: Xiao Chen >Assignee: Xiao Chen > Attachments: HDFS-9250.001.patch > > > We may see the following exception: > {noformat} > java.lang.ArrayStoreException > at java.util.ArrayList.toArray(ArrayList.java:389) > at > org.apache.hadoop.hdfs.protocol.LocatedBlock.addCachedLoc(LocatedBlock.java:205) > at > org.apache.hadoop.hdfs.server.namenode.CacheManager.setCachedLocations(CacheManager.java:907) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1974) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1873) > {noformat} > The cause is that in LocatedBlock.java, when {{addCachedLoc}}: > - Passed in parameter {{loc}}, which is type {{DatanodeDescriptor}}, is added > to {{cachedList}} > - {{cachedList}} was assigned to {{EMPTY_LOCS}}, which is type > {{DatanodeInfoWithStorage}}. > Both {{DatanodeDescriptor}} and {{DatanodeInfoWithStorage}} are subclasses of > {{DatanodeInfo}} but do not inherit from each other, resulting in the > ArrayStoreException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9251) Refactor TestWriteToReplica and TestFsDatasetImpl to avoid explicitly creating Files in tests code.
[ https://issues.apache.org/jira/browse/HDFS-9251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu updated HDFS-9251: Status: Patch Available (was: Open) > Refactor TestWriteToReplica and TestFsDatasetImpl to avoid explicitly > creating Files in tests code. > --- > > Key: HDFS-9251 > URL: https://issues.apache.org/jira/browse/HDFS-9251 > Project: Hadoop HDFS > Issue Type: Improvement > Components: HDFS >Affects Versions: 2.7.1 >Reporter: Lei (Eddy) Xu >Assignee: Lei (Eddy) Xu > Attachments: HDFS-9251.00.patch > > > In {{TestWriteToReplica}} and {{TestFsDatasetImpl}}, tests directly creates > block and metadata files: > {code} > replicaInfo.getBlockFile().createNewFile(); > replicaInfo.getMetaFile().createNewFile(); > {code} > It leaks the implementation details of {{FsDatasetImpl}}. This JIRA proposes > to use {{FsDatasetImplTestUtils}} (HDFS-9188) to create replicas. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9251) Refactor TestWriteToReplica and TestFsDatasetImpl to avoid explicitly creating Files in tests code.
[ https://issues.apache.org/jira/browse/HDFS-9251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu updated HDFS-9251: Attachment: HDFS-9251.00.patch This patch: * Add {{CreateFinalizedReplica}}, {{CreateRbw}}, {{CreateReplicaInPipeline}}, {{CreateRBW}} and {{CreateReplicaWaitingToBeRecovered}} to {{FsDatasetTestUtils}}. * Refactor {{TestFsDatasetImpl}} and {{TestWriteToReplica}} to use them. > Refactor TestWriteToReplica and TestFsDatasetImpl to avoid explicitly > creating Files in tests code. > --- > > Key: HDFS-9251 > URL: https://issues.apache.org/jira/browse/HDFS-9251 > Project: Hadoop HDFS > Issue Type: Improvement > Components: HDFS >Affects Versions: 2.7.1 >Reporter: Lei (Eddy) Xu >Assignee: Lei (Eddy) Xu > Attachments: HDFS-9251.00.patch > > > In {{TestWriteToReplica}} and {{TestFsDatasetImpl}}, tests directly creates > block and metadata files: > {code} > replicaInfo.getBlockFile().createNewFile(); > replicaInfo.getMetaFile().createNewFile(); > {code} > It leaks the implementation details of {{FsDatasetImpl}}. This JIRA proposes > to use {{FsDatasetImplTestUtils}} (HDFS-9188) to create replicas. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9251) Refactor TestWriteToReplica and TestFsDatasetImpl to avoid explicitly creating Files in tests code.
Lei (Eddy) Xu created HDFS-9251: --- Summary: Refactor TestWriteToReplica and TestFsDatasetImpl to avoid explicitly creating Files in tests code. Key: HDFS-9251 URL: https://issues.apache.org/jira/browse/HDFS-9251 Project: Hadoop HDFS Issue Type: Improvement Components: HDFS Affects Versions: 2.7.1 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu In {{TestWriteToReplica}} and {{TestFsDatasetImpl}}, tests directly creates block and metadata files: {code} replicaInfo.getBlockFile().createNewFile(); replicaInfo.getMetaFile().createNewFile(); {code} It leaks the implementation details of {{FsDatasetImpl}}. This JIRA proposes to use {{FsDatasetImplTestUtils}} (HDFS-9188) to create replicas. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9184) Logging HDFS operation's caller context into audit logs
[ https://issues.apache.org/jira/browse/HDFS-9184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mingliang Liu updated HDFS-9184: Attachment: HDFS-9184.007.patch Thanks for your view, [~jnp]. The v7 patch addresses the latest comments. > Logging HDFS operation's caller context into audit logs > --- > > Key: HDFS-9184 > URL: https://issues.apache.org/jira/browse/HDFS-9184 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Mingliang Liu >Assignee: Mingliang Liu > Attachments: HDFS-9184.000.patch, HDFS-9184.001.patch, > HDFS-9184.002.patch, HDFS-9184.003.patch, HDFS-9184.004.patch, > HDFS-9184.005.patch, HDFS-9184.006.patch, HDFS-9184.007.patch > > > For a given HDFS operation (e.g. delete file), it's very helpful to track > which upper level job issues it. The upper level callers may be specific > Oozie tasks, MR jobs, and hive queries. One scenario is that the namenode > (NN) is abused/spammed, the operator may want to know immediately which MR > job should be blamed so that she can kill it. To this end, the caller context > contains at least the application-dependent "tracking id". > There are several existing techniques that may be related to this problem. > 1. Currently the HDFS audit log tracks the users of the the operation which > is obviously not enough. It's common that the same user issues multiple jobs > at the same time. Even for a single top level task, tracking back to a > specific caller in a chain of operations of the whole workflow (e.g.Oozie -> > Hive -> Yarn) is hard, if not impossible. > 2. HDFS integrated {{htrace}} support for providing tracing information > across multiple layers. The span is created in many places interconnected > like a tree structure which relies on offline analysis across RPC boundary. > For this use case, {{htrace}} has to be enabled at 100% sampling rate which > introduces significant overhead. Moreover, passing additional information > (via annotations) other than span id from root of the tree to leaf is a > significant additional work. > 3. In [HDFS-4680 | https://issues.apache.org/jira/browse/HDFS-4680], there > are some related discussion on this topic. The final patch implemented the > tracking id as a part of delegation token. This protects the tracking > information from being changed or impersonated. However, kerberos > authenticated connections or insecure connections don't have tokens. > [HADOOP-8779] proposes to use tokens in all the scenarios, but that might > mean changes to several upstream projects and is a major change in their > security implementation. > We propose another approach to address this problem. We also treat HDFS audit > log as a good place for after-the-fact root cause analysis. We propose to put > the caller id (e.g. Hive query id) in threadlocals. Specially, on client side > the threadlocal object is passed to NN as a part of RPC header (optional), > while on sever side NN retrieves it from header and put it to {{Handler}}'s > threadlocals. Finally in {{FSNamesystem}}, HDFS audit logger will record the > caller context for each operation. In this way, the existing code is not > affected. > It is still challenging to keep "lying" client from abusing the caller > context. Our proposal is to add a {{signature}} field to the caller context. > The client choose to provide its signature along with the caller id. The > operator may need to validate the signature at the time of offline analysis. > The NN is not responsible for validating the signature online. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9250) LocatedBlock#addCachedLoc may throw ArrayStoreException when cache is empty
[ https://issues.apache.org/jira/browse/HDFS-9250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959561#comment-14959561 ] Hadoop QA commented on HDFS-9250: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 20m 24s | Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 8m 3s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 27s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 25s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 2m 57s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 31s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 36s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 36s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | native | 3m 15s | Pre-build of native portion | | {color:red}-1{color} | hdfs tests | 50m 26s | Tests failed in hadoop-hdfs. | | {color:green}+1{color} | hdfs tests | 0m 32s | Tests passed in hadoop-hdfs-client. | | | | 103m 17s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.hdfs.TestBlockStoragePolicy | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12766835/HDFS-9250.001.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 8d2d3eb | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-HDFS-Build/13006/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13006/artifact/patchprocess/testrun_hadoop-hdfs.txt | | hadoop-hdfs-client test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13006/artifact/patchprocess/testrun_hadoop-hdfs-client.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/13006/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/13006/console | This message was automatically generated. > LocatedBlock#addCachedLoc may throw ArrayStoreException when cache is empty > --- > > Key: HDFS-9250 > URL: https://issues.apache.org/jira/browse/HDFS-9250 > Project: Hadoop HDFS > Issue Type: Bug > Components: HDFS >Reporter: Xiao Chen >Assignee: Xiao Chen > Attachments: HDFS-9250.001.patch > > > We may see the following exception: > {noformat} > java.lang.ArrayStoreException > at java.util.ArrayList.toArray(ArrayList.java:389) > at > org.apache.hadoop.hdfs.protocol.LocatedBlock.addCachedLoc(LocatedBlock.java:205) > at > org.apache.hadoop.hdfs.server.namenode.CacheManager.setCachedLocations(CacheManager.java:907) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1974) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1873) > {noformat} > The cause is that in LocatedBlock.java, when {{addCachedLoc}}: > - Passed in parameter {{loc}}, which is type {{DatanodeDescriptor}}, is added > to {{cachedList}} > - {{cachedList}} was assigned to {{EMPTY_LOCS}}, which is type > {{DatanodeInfoWithStorage}}. > Both {{DatanodeDescriptor}} and {{DatanodeInfoWithStorage}} are subclasses of > {{DatanodeInfo}} but do not inherit from each other, resulting in the > ArrayStoreException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9129) Move the safemode block count into BlockManager
[ https://issues.apache.org/jira/browse/HDFS-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959556#comment-14959556 ] Hadoop QA commented on HDFS-9129: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 20m 35s | Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 3 new or modified test files. | | {color:green}+1{color} | javac | 9m 0s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 11m 32s | There were no new javadoc warning messages. | | {color:red}-1{color} | release audit | 0m 21s | The applied patch generated 1 release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 39s | The applied patch generated 30 new checkstyle issues (total was 626, now 602). | | {color:green}+1{color} | whitespace | 0m 3s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 39s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 43s | The patch built with eclipse:eclipse. | | {color:red}-1{color} | findbugs | 0m 24s | Post-patch findbugs hadoop-hdfs-project/hadoop-hdfs compilation is broken. | | {color:green}+1{color} | findbugs | 0m 24s | The patch does not introduce any new Findbugs (version ) warnings. | | {color:green}+1{color} | native | 0m 31s | Pre-build of native portion | | {color:red}-1{color} | hdfs tests | 53m 29s | Tests failed in hadoop-hdfs. | | | | 100m 1s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.hdfs.tools.TestDFSHAAdminMiniCluster | | | hadoop.fs.TestHdfsNativeCodeLoader | | | hadoop.hdfs.security.TestDelegationToken | | | hadoop.hdfs.tools.TestDFSZKFailoverController | | | hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12766673/HDFS-9129.003.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 8d2d3eb | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-HDFS-Build/13007/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html | | Release Audit | https://builds.apache.org/job/PreCommit-HDFS-Build/13007/artifact/patchprocess/patchReleaseAuditProblems.txt | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/13007/artifact/patchprocess/diffcheckstylehadoop-hdfs.txt | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13007/artifact/patchprocess/testrun_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/13007/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/13007/console | This message was automatically generated. > Move the safemode block count into BlockManager > --- > > Key: HDFS-9129 > URL: https://issues.apache.org/jira/browse/HDFS-9129 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Haohui Mai >Assignee: Mingliang Liu > Attachments: HDFS-9129.000.patch, HDFS-9129.001.patch, > HDFS-9129.002.patch, HDFS-9129.003.patch > > > The {{SafeMode}} needs to track whether there are enough blocks so that the > NN can get out of the safemode. These fields can moved to the > {{BlockManager}} class. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9208) Disabling atime may fail clients like distCp
[ https://issues.apache.org/jira/browse/HDFS-9208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959496#comment-14959496 ] Mingliang Liu commented on HDFS-9208: - Hi [~kihwal], this is on my list this week but I'm not working on this actively. Sorry for the delay. I need more context for the options. I assign it back to you in case you're blocked. > Disabling atime may fail clients like distCp > > > Key: HDFS-9208 > URL: https://issues.apache.org/jira/browse/HDFS-9208 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kihwal Lee >Assignee: Mingliang Liu > > When atime is disabled, {{setTimes()}} throws an exception if the passed-in > atime is not -1. But since atime is not -1, distCp fails when it tries to > set the mtime and atime. > There are several options: > 1) make distCp check for 0 atime and call {{setTimes()}} with -1. I am not > very enthusiastic about it. > 2) make NN also accept 0 atime in addition to -1, when the atime support is > disabled. > 3) support setting mtime & atime regardless of the atime support. The main > reason why atime is disabled is to avoid edit logging/syncing during > {{getBlockLocations()}} read calls. Explicit setting can be allowed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9208) Disabling atime may fail clients like distCp
[ https://issues.apache.org/jira/browse/HDFS-9208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mingliang Liu updated HDFS-9208: Assignee: Kihwal Lee (was: Mingliang Liu) > Disabling atime may fail clients like distCp > > > Key: HDFS-9208 > URL: https://issues.apache.org/jira/browse/HDFS-9208 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kihwal Lee >Assignee: Kihwal Lee > > When atime is disabled, {{setTimes()}} throws an exception if the passed-in > atime is not -1. But since atime is not -1, distCp fails when it tries to > set the mtime and atime. > There are several options: > 1) make distCp check for 0 atime and call {{setTimes()}} with -1. I am not > very enthusiastic about it. > 2) make NN also accept 0 atime in addition to -1, when the atime support is > disabled. > 3) support setting mtime & atime regardless of the atime support. The main > reason why atime is disabled is to avoid edit logging/syncing during > {{getBlockLocations()}} read calls. Explicit setting can be allowed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9220) Reading small file (< 512 bytes) that is open for append fails due to incorrect checksum
[ https://issues.apache.org/jira/browse/HDFS-9220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959475#comment-14959475 ] Hudson commented on HDFS-9220: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #538 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/538/]) HDFS-9220. Reading small file (< 512 bytes) that is open for append (kihwal: rev c7c36cbd6218f46c33d7fb2f60cd52cb29e6d720) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestFileAppend2.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java > Reading small file (< 512 bytes) that is open for append fails due to > incorrect checksum > > > Key: HDFS-9220 > URL: https://issues.apache.org/jira/browse/HDFS-9220 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Bogdan Raducanu >Assignee: Jing Zhao >Priority: Blocker > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-9220.000.patch, HDFS-9220.001.patch, > HDFS-9220.002.patch, test2.java > > > Exception: > 2015-10-09 14:59:40 WARN DFSClient:1150 - fetchBlockByteRange(). Got a > checksum exception for /tmp/file0.05355529331575182 at > BP-353681639-10.10.10.10-1437493596883:blk_1075692769_9244882:0 from > DatanodeInfoWithStorage[10.10.10.10]:5001 > All 3 replicas cause this exception and the read fails entirely with: > BlockMissingException: Could not obtain block: > BP-353681639-10.10.10.10-1437493596883:blk_1075692769_9244882 > file=/tmp/file0.05355529331575182 > Code to reproduce is attached. > Does not happen in 2.7.0. > Data is read correctly if checksum verification is disabled. > More generally, the failure happens when reading from the last block of a > file and the last block has <= 512 bytes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9079) Erasure coding: preallocate multiple generation stamps and serialize updates from data streamers
[ https://issues.apache.org/jira/browse/HDFS-9079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HDFS-9079: Attachment: HDFS-9079.03.patch Updating the patch to fix all reported test failures. Main change is to add logic to get new block token when current one expires. I'm currently working on adding Javadocs and fixing exception handling. > Erasure coding: preallocate multiple generation stamps and serialize updates > from data streamers > > > Key: HDFS-9079 > URL: https://issues.apache.org/jira/browse/HDFS-9079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding >Affects Versions: HDFS-7285 >Reporter: Zhe Zhang >Assignee: Zhe Zhang > Attachments: HDFS-9079-HDFS-7285.00.patch, HDFS-9079.01.patch, > HDFS-9079.02.patch, HDFS-9079.03.patch > > > A non-striped DataStreamer goes through the following steps in error handling: > {code} > 1) Finds error => 2) Asks NN for new GS => 3) Gets new GS from NN => 4) > Applies new GS to DN (createBlockOutputStream) => 5) Ack from DN => 6) > Updates block on NN > {code} > To simplify the above we can preallocate GS when NN creates a new striped > block group ({{FSN#createNewBlock}}). For each new striped block group we can > reserve {{NUM_PARITY_BLOCKS}} GS's. Then steps 1~3 in the above sequence can > be saved. If more than {{NUM_PARITY_BLOCKS}} errors have happened we > shouldn't try to further recover anyway. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9198) Coalesce IBR processing in the NN
[ https://issues.apache.org/jira/browse/HDFS-9198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959448#comment-14959448 ] Daryn Sharp commented on HDFS-9198: --- Most of the test failures are a race condition from IBRs not be sync now. Will update shortly. All you watchers, any comments on the approach? > Coalesce IBR processing in the NN > - > > Key: HDFS-9198 > URL: https://issues.apache.org/jira/browse/HDFS-9198 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 2.0.0-alpha >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: HDFS-9198-branch2.patch, HDFS-9198-trunk.patch > > > IBRs from thousands of DNs under load will degrade NN performance due to > excessive write-lock contention from multiple IPC handler threads. The IBR > processing is quick, so the lock contention may be reduced by coalescing > multiple IBRs into a single write-lock transaction. The handlers will also > be freed up faster for other operations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7964) Add support for async edit logging
[ https://issues.apache.org/jira/browse/HDFS-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated HDFS-7964: -- Attachment: HDFS-7964.patch Updated, simplified. > Add support for async edit logging > -- > > Key: HDFS-7964 > URL: https://issues.apache.org/jira/browse/HDFS-7964 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 2.0.2-alpha >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Attachments: HDFS-7964.patch, HDFS-7964.patch > > > Edit logging is a major source of contention within the NN. LogEdit is > called within the namespace write log, while logSync is called outside of the > lock to allow greater concurrency. The handler thread remains busy until > logSync returns to provide the client with a durability guarantee for the > response. > Write heavy RPC load and/or slow IO causes handlers to stall in logSync. > Although the write lock is not held, readers are limited/starved and the call > queue fills. Combining an edit log thread with postponed RPC responses from > HADOOP-10300 will provide the same durability guarantee but immediately free > up the handlers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9208) Disabling atime may fail clients like distCp
[ https://issues.apache.org/jira/browse/HDFS-9208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959431#comment-14959431 ] Kihwal Lee commented on HDFS-9208: -- [~liuml07] Are you actively working on this? > Disabling atime may fail clients like distCp > > > Key: HDFS-9208 > URL: https://issues.apache.org/jira/browse/HDFS-9208 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kihwal Lee >Assignee: Mingliang Liu > > When atime is disabled, {{setTimes()}} throws an exception if the passed-in > atime is not -1. But since atime is not -1, distCp fails when it tries to > set the mtime and atime. > There are several options: > 1) make distCp check for 0 atime and call {{setTimes()}} with -1. I am not > very enthusiastic about it. > 2) make NN also accept 0 atime in addition to -1, when the atime support is > disabled. > 3) support setting mtime & atime regardless of the atime support. The main > reason why atime is disabled is to avoid edit logging/syncing during > {{getBlockLocations()}} read calls. Explicit setting can be allowed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9249) NPE thrown if an IOException is thrown in NameNode.
[ https://issues.apache.org/jira/browse/HDFS-9249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang updated HDFS-9249: -- Status: Patch Available (was: Open) > NPE thrown if an IOException is thrown in NameNode. > - > > Key: HDFS-9249 > URL: https://issues.apache.org/jira/browse/HDFS-9249 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Wei-Chiu Chuang >Assignee: Wei-Chiu Chuang >Priority: Minor > Labels: supportability > Attachments: HDFS-9249.001.patch > > > This issue was found when running test case > TestBackupNode.testCheckpointNode, but upon closer look, the problem is not > due to the test case. > Looks like an IOException was thrown in > try { > initializeGenericKeys(conf, nsId, namenodeId); > initialize(conf); > try { > haContext.writeLock(); > state.prepareToEnterState(haContext); > state.enterState(haContext); > } finally { > haContext.writeUnlock(); > } > causing the namenode to stop, but the namesystem was not yet properly > instantiated, causing NPE. > I tried to reproduce locally, but to no avail. > Because I could not reproduce the bug, and the log does not indicate what > caused the IOException, I suggest make this a supportability JIRA to log the > exception for future improvement. > Stacktrace > java.lang.NullPointerException: null > at > org.apache.hadoop.hdfs.server.namenode.NameNode.getFSImage(NameNode.java:906) > at org.apache.hadoop.hdfs.server.namenode.BackupNode.stop(BackupNode.java:210) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:827) > at > org.apache.hadoop.hdfs.server.namenode.BackupNode.(BackupNode.java:89) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1474) > at > org.apache.hadoop.hdfs.server.namenode.TestBackupNode.startBackupNode(TestBackupNode.java:102) > at > org.apache.hadoop.hdfs.server.namenode.TestBackupNode.testCheckpoint(TestBackupNode.java:298) > at > org.apache.hadoop.hdfs.server.namenode.TestBackupNode.testCheckpointNode(TestBackupNode.java:130) > The last few lines of log: > 2015-10-14 19:45:07,807 INFO namenode.NameNode > (NameNode.java:createNameNode(1422)) - createNameNode [-checkpoint] > 2015-10-14 19:45:07,807 INFO impl.MetricsSystemImpl > (MetricsSystemImpl.java:init(158)) - CheckpointNode metrics system started > (again) > 2015-10-14 19:45:07,808 INFO namenode.NameNode > (NameNode.java:setClientNamenodeAddress(402)) - fs.defaultFS is > hdfs://localhost:37835 > 2015-10-14 19:45:07,808 INFO namenode.NameNode > (NameNode.java:setClientNamenodeAddress(422)) - Clients are to use > localhost:37835 to access this namenode/service. > 2015-10-14 19:45:07,810 INFO hdfs.MiniDFSCluster > (MiniDFSCluster.java:shutdown(1708)) - Shutting down the Mini HDFS Cluster > 2015-10-14 19:45:07,810 INFO namenode.FSNamesystem > (FSNamesystem.java:stopActiveServices(1298)) - Stopping services started for > active state > 2015-10-14 19:45:07,811 INFO namenode.FSEditLog > (FSEditLog.java:endCurrentLogSegment(1228)) - Ending log segment 1 > 2015-10-14 19:45:07,811 INFO namenode.FSNamesystem > (FSNamesystem.java:run(5306)) - NameNodeEditLogRoller was interrupted, exiting > 2015-10-14 19:45:07,811 INFO namenode.FSEditLog > (FSEditLog.java:printStatistics(703)) - Number of transactions: 3 Total time > for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of > syncs: 4 SyncTimes(ms): 2 1 > 2015-10-14 19:45:07,811 INFO namenode.FSNamesystem > (FSNamesystem.java:run(5373)) - LazyPersistFileScrubber was interrupted, > exiting > 2015-10-14 19:45:07,822 INFO namenode.FileJournalManager > (FileJournalManager.java:finalizeLogSegment(142)) - Finalizing edits file > /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1/current/edits_inprogress_001 > -> > /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1/current/edits_001-003 > 2015-10-14 19:45:07,835 INFO namenode.FileJournalManager > (FileJournalManager.java:finalizeLogSegment(142)) - Finalizing edits file > /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name2/current/edits_inprogress_001 > -> > /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name2/current/edits_001-003 > 2015-10-14 19:45:07,836 INFO blockmanagement.CacheReplicationMonitor > (CacheReplicationMonitor.java:run(169)) - Shutting down > CacheReplicationMonitor > 2015-10-14 19:45:07,836 INFO ipc.Server (Server.java:stop(2485)) - Stopp
[jira] [Commented] (HDFS-9236) Missing sanity check for block size during block recovery
[ https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959417#comment-14959417 ] Yongjun Zhang commented on HDFS-9236: - Hi [~twu], Thanks for the updated rev 3 which looks reasonable to me. Hi [~kihwal], would you please help taking a look? really appreciate it. Thanks. > Missing sanity check for block size during block recovery > - > > Key: HDFS-9236 > URL: https://issues.apache.org/jira/browse/HDFS-9236 > Project: Hadoop HDFS > Issue Type: Bug > Components: HDFS >Affects Versions: 2.7.1 >Reporter: Tony Wu >Assignee: Tony Wu > Attachments: HDFS-9236.001.patch, HDFS-9236.002.patch, > HDFS-9236.003.patch > > > Ran into an issue while running test against faulty data-node code. > Currently in DataNode.java: > {code:java} > /** Block synchronization */ > void syncBlock(RecoveringBlock rBlock, > List syncList) throws IOException { > … > // Calculate the best available replica state. > ReplicaState bestState = ReplicaState.RWR; > … > // Calculate list of nodes that will participate in the recovery > // and the new block size > List participatingList = new ArrayList(); > final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId, > -1, recoveryId); > switch(bestState) { > … > case RBW: > case RWR: > long minLength = Long.MAX_VALUE; > for(BlockRecord r : syncList) { > ReplicaState rState = r.rInfo.getOriginalReplicaState(); > if(rState == bestState) { > minLength = Math.min(minLength, r.rInfo.getNumBytes()); > participatingList.add(r); > } > } > newBlock.setNumBytes(minLength); > break; > … > } > … > nn.commitBlockSynchronization(block, > newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false, > datanodes, storages); > } > {code} > This code is called by the DN coordinating the block recovery. In the above > case, it is possible for none of the rState (reported by DNs with copies of > the replica being recovered) to match the bestState. This can either be > caused by faulty DN code or stale/modified/corrupted files on DN. When this > happens the DN will end up reporting the minLengh of Long.MAX_VALUE. > Unfortunately there is no check on the NN for replica length. See > FSNamesystem.java: > {code:java} > void commitBlockSynchronization(ExtendedBlock oldBlock, > long newgenerationstamp, long newlength, > boolean closeFile, boolean deleteblock, DatanodeID[] newtargets, > String[] newtargetstorages) throws IOException { > … > if (deleteblock) { > Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock); > boolean remove = iFile.removeLastBlock(blockToDel) != null; > if (remove) { > blockManager.removeBlock(storedBlock); > } > } else { > // update last block > if(!copyTruncate) { > storedBlock.setGenerationStamp(newgenerationstamp); > > // XXX block length is updated without any check <<< storedBlock.setNumBytes(newlength); > } > … > if (closeFile) { > LOG.info("commitBlockSynchronization(oldBlock=" + oldBlock > + ", file=" + src > + (copyTruncate ? ", newBlock=" + truncatedBlock > : ", newgenerationstamp=" + newgenerationstamp) > + ", newlength=" + newlength > + ", newtargets=" + Arrays.asList(newtargets) + ") successful"); > } else { > LOG.info("commitBlockSynchronization(" + oldBlock + ") successful"); > } > } > {code} > After this point the block length becomes Long.MAX_VALUE. Any subsequent > block report (even with correct length) will cause the block to be marked as > corrupted. Since this is block could be the last block of the file. If this > happens and the client goes away, NN won’t be able to recover the lease and > close the file because the last block is under-replicated. > I believe we need to have a sanity check for block size on both DN and NN to > prevent such case from happening. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9249) NPE thrown if an IOException is thrown in NameNode.
[ https://issues.apache.org/jira/browse/HDFS-9249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang updated HDFS-9249: -- Attachment: HDFS-9249.001.patch Check for null pointer, and add more verbose log info when IOException is thrown. > NPE thrown if an IOException is thrown in NameNode. > - > > Key: HDFS-9249 > URL: https://issues.apache.org/jira/browse/HDFS-9249 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Wei-Chiu Chuang >Assignee: Wei-Chiu Chuang >Priority: Minor > Labels: supportability > Attachments: HDFS-9249.001.patch > > > This issue was found when running test case > TestBackupNode.testCheckpointNode, but upon closer look, the problem is not > due to the test case. > Looks like an IOException was thrown in > try { > initializeGenericKeys(conf, nsId, namenodeId); > initialize(conf); > try { > haContext.writeLock(); > state.prepareToEnterState(haContext); > state.enterState(haContext); > } finally { > haContext.writeUnlock(); > } > causing the namenode to stop, but the namesystem was not yet properly > instantiated, causing NPE. > I tried to reproduce locally, but to no avail. > Because I could not reproduce the bug, and the log does not indicate what > caused the IOException, I suggest make this a supportability JIRA to log the > exception for future improvement. > Stacktrace > java.lang.NullPointerException: null > at > org.apache.hadoop.hdfs.server.namenode.NameNode.getFSImage(NameNode.java:906) > at org.apache.hadoop.hdfs.server.namenode.BackupNode.stop(BackupNode.java:210) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:827) > at > org.apache.hadoop.hdfs.server.namenode.BackupNode.(BackupNode.java:89) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1474) > at > org.apache.hadoop.hdfs.server.namenode.TestBackupNode.startBackupNode(TestBackupNode.java:102) > at > org.apache.hadoop.hdfs.server.namenode.TestBackupNode.testCheckpoint(TestBackupNode.java:298) > at > org.apache.hadoop.hdfs.server.namenode.TestBackupNode.testCheckpointNode(TestBackupNode.java:130) > The last few lines of log: > 2015-10-14 19:45:07,807 INFO namenode.NameNode > (NameNode.java:createNameNode(1422)) - createNameNode [-checkpoint] > 2015-10-14 19:45:07,807 INFO impl.MetricsSystemImpl > (MetricsSystemImpl.java:init(158)) - CheckpointNode metrics system started > (again) > 2015-10-14 19:45:07,808 INFO namenode.NameNode > (NameNode.java:setClientNamenodeAddress(402)) - fs.defaultFS is > hdfs://localhost:37835 > 2015-10-14 19:45:07,808 INFO namenode.NameNode > (NameNode.java:setClientNamenodeAddress(422)) - Clients are to use > localhost:37835 to access this namenode/service. > 2015-10-14 19:45:07,810 INFO hdfs.MiniDFSCluster > (MiniDFSCluster.java:shutdown(1708)) - Shutting down the Mini HDFS Cluster > 2015-10-14 19:45:07,810 INFO namenode.FSNamesystem > (FSNamesystem.java:stopActiveServices(1298)) - Stopping services started for > active state > 2015-10-14 19:45:07,811 INFO namenode.FSEditLog > (FSEditLog.java:endCurrentLogSegment(1228)) - Ending log segment 1 > 2015-10-14 19:45:07,811 INFO namenode.FSNamesystem > (FSNamesystem.java:run(5306)) - NameNodeEditLogRoller was interrupted, exiting > 2015-10-14 19:45:07,811 INFO namenode.FSEditLog > (FSEditLog.java:printStatistics(703)) - Number of transactions: 3 Total time > for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of > syncs: 4 SyncTimes(ms): 2 1 > 2015-10-14 19:45:07,811 INFO namenode.FSNamesystem > (FSNamesystem.java:run(5373)) - LazyPersistFileScrubber was interrupted, > exiting > 2015-10-14 19:45:07,822 INFO namenode.FileJournalManager > (FileJournalManager.java:finalizeLogSegment(142)) - Finalizing edits file > /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1/current/edits_inprogress_001 > -> > /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1/current/edits_001-003 > 2015-10-14 19:45:07,835 INFO namenode.FileJournalManager > (FileJournalManager.java:finalizeLogSegment(142)) - Finalizing edits file > /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name2/current/edits_inprogress_001 > -> > /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name2/current/edits_001-003 > 2015-10-14 19:45:07,836 INFO blockmanagement.CacheReplicationMonitor > (CacheReplicationMonitor.java:run(169)) - Shutting down > CacheReplicationMon
[jira] [Commented] (HDFS-9249) NPE thrown if an IOException is thrown in NameNode.
[ https://issues.apache.org/jira/browse/HDFS-9249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959412#comment-14959412 ] Wei-Chiu Chuang commented on HDFS-9249: --- Hadoop JIRA does not allow me to edit comments. But what I am saying is that instead of wildly guessing Kerberos is to blame, add more log info to expose the problem when it happens. After all, this is a bug that looks rarely happens. > NPE thrown if an IOException is thrown in NameNode. > - > > Key: HDFS-9249 > URL: https://issues.apache.org/jira/browse/HDFS-9249 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Wei-Chiu Chuang >Assignee: Wei-Chiu Chuang >Priority: Minor > Labels: supportability > > This issue was found when running test case > TestBackupNode.testCheckpointNode, but upon closer look, the problem is not > due to the test case. > Looks like an IOException was thrown in > try { > initializeGenericKeys(conf, nsId, namenodeId); > initialize(conf); > try { > haContext.writeLock(); > state.prepareToEnterState(haContext); > state.enterState(haContext); > } finally { > haContext.writeUnlock(); > } > causing the namenode to stop, but the namesystem was not yet properly > instantiated, causing NPE. > I tried to reproduce locally, but to no avail. > Because I could not reproduce the bug, and the log does not indicate what > caused the IOException, I suggest make this a supportability JIRA to log the > exception for future improvement. > Stacktrace > java.lang.NullPointerException: null > at > org.apache.hadoop.hdfs.server.namenode.NameNode.getFSImage(NameNode.java:906) > at org.apache.hadoop.hdfs.server.namenode.BackupNode.stop(BackupNode.java:210) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:827) > at > org.apache.hadoop.hdfs.server.namenode.BackupNode.(BackupNode.java:89) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1474) > at > org.apache.hadoop.hdfs.server.namenode.TestBackupNode.startBackupNode(TestBackupNode.java:102) > at > org.apache.hadoop.hdfs.server.namenode.TestBackupNode.testCheckpoint(TestBackupNode.java:298) > at > org.apache.hadoop.hdfs.server.namenode.TestBackupNode.testCheckpointNode(TestBackupNode.java:130) > The last few lines of log: > 2015-10-14 19:45:07,807 INFO namenode.NameNode > (NameNode.java:createNameNode(1422)) - createNameNode [-checkpoint] > 2015-10-14 19:45:07,807 INFO impl.MetricsSystemImpl > (MetricsSystemImpl.java:init(158)) - CheckpointNode metrics system started > (again) > 2015-10-14 19:45:07,808 INFO namenode.NameNode > (NameNode.java:setClientNamenodeAddress(402)) - fs.defaultFS is > hdfs://localhost:37835 > 2015-10-14 19:45:07,808 INFO namenode.NameNode > (NameNode.java:setClientNamenodeAddress(422)) - Clients are to use > localhost:37835 to access this namenode/service. > 2015-10-14 19:45:07,810 INFO hdfs.MiniDFSCluster > (MiniDFSCluster.java:shutdown(1708)) - Shutting down the Mini HDFS Cluster > 2015-10-14 19:45:07,810 INFO namenode.FSNamesystem > (FSNamesystem.java:stopActiveServices(1298)) - Stopping services started for > active state > 2015-10-14 19:45:07,811 INFO namenode.FSEditLog > (FSEditLog.java:endCurrentLogSegment(1228)) - Ending log segment 1 > 2015-10-14 19:45:07,811 INFO namenode.FSNamesystem > (FSNamesystem.java:run(5306)) - NameNodeEditLogRoller was interrupted, exiting > 2015-10-14 19:45:07,811 INFO namenode.FSEditLog > (FSEditLog.java:printStatistics(703)) - Number of transactions: 3 Total time > for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of > syncs: 4 SyncTimes(ms): 2 1 > 2015-10-14 19:45:07,811 INFO namenode.FSNamesystem > (FSNamesystem.java:run(5373)) - LazyPersistFileScrubber was interrupted, > exiting > 2015-10-14 19:45:07,822 INFO namenode.FileJournalManager > (FileJournalManager.java:finalizeLogSegment(142)) - Finalizing edits file > /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1/current/edits_inprogress_001 > -> > /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1/current/edits_001-003 > 2015-10-14 19:45:07,835 INFO namenode.FileJournalManager > (FileJournalManager.java:finalizeLogSegment(142)) - Finalizing edits file > /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name2/current/edits_inprogress_001 > -> > /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name2/current/edits_001-003 > 2015-10-14 19:45:07,
[jira] [Updated] (HDFS-9236) Missing sanity check for block size during block recovery
[ https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Wu updated HDFS-9236: -- Attachment: HDFS-9236.003.patch Addressed [~yzhangal]'s review comments. > Missing sanity check for block size during block recovery > - > > Key: HDFS-9236 > URL: https://issues.apache.org/jira/browse/HDFS-9236 > Project: Hadoop HDFS > Issue Type: Bug > Components: HDFS >Affects Versions: 2.7.1 >Reporter: Tony Wu >Assignee: Tony Wu > Attachments: HDFS-9236.001.patch, HDFS-9236.002.patch, > HDFS-9236.003.patch > > > Ran into an issue while running test against faulty data-node code. > Currently in DataNode.java: > {code:java} > /** Block synchronization */ > void syncBlock(RecoveringBlock rBlock, > List syncList) throws IOException { > … > // Calculate the best available replica state. > ReplicaState bestState = ReplicaState.RWR; > … > // Calculate list of nodes that will participate in the recovery > // and the new block size > List participatingList = new ArrayList(); > final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId, > -1, recoveryId); > switch(bestState) { > … > case RBW: > case RWR: > long minLength = Long.MAX_VALUE; > for(BlockRecord r : syncList) { > ReplicaState rState = r.rInfo.getOriginalReplicaState(); > if(rState == bestState) { > minLength = Math.min(minLength, r.rInfo.getNumBytes()); > participatingList.add(r); > } > } > newBlock.setNumBytes(minLength); > break; > … > } > … > nn.commitBlockSynchronization(block, > newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false, > datanodes, storages); > } > {code} > This code is called by the DN coordinating the block recovery. In the above > case, it is possible for none of the rState (reported by DNs with copies of > the replica being recovered) to match the bestState. This can either be > caused by faulty DN code or stale/modified/corrupted files on DN. When this > happens the DN will end up reporting the minLengh of Long.MAX_VALUE. > Unfortunately there is no check on the NN for replica length. See > FSNamesystem.java: > {code:java} > void commitBlockSynchronization(ExtendedBlock oldBlock, > long newgenerationstamp, long newlength, > boolean closeFile, boolean deleteblock, DatanodeID[] newtargets, > String[] newtargetstorages) throws IOException { > … > if (deleteblock) { > Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock); > boolean remove = iFile.removeLastBlock(blockToDel) != null; > if (remove) { > blockManager.removeBlock(storedBlock); > } > } else { > // update last block > if(!copyTruncate) { > storedBlock.setGenerationStamp(newgenerationstamp); > > // XXX block length is updated without any check <<< storedBlock.setNumBytes(newlength); > } > … > if (closeFile) { > LOG.info("commitBlockSynchronization(oldBlock=" + oldBlock > + ", file=" + src > + (copyTruncate ? ", newBlock=" + truncatedBlock > : ", newgenerationstamp=" + newgenerationstamp) > + ", newlength=" + newlength > + ", newtargets=" + Arrays.asList(newtargets) + ") successful"); > } else { > LOG.info("commitBlockSynchronization(" + oldBlock + ") successful"); > } > } > {code} > After this point the block length becomes Long.MAX_VALUE. Any subsequent > block report (even with correct length) will cause the block to be marked as > corrupted. Since this is block could be the last block of the file. If this > happens and the client goes away, NN won’t be able to recover the lease and > close the file because the last block is under-replicated. > I believe we need to have a sanity check for block size on both DN and NN to > prevent such case from happening. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9249) NPE thrown if an IOException is thrown in NameNode.
[ https://issues.apache.org/jira/browse/HDFS-9249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959400#comment-14959400 ] Wei-Chiu Chuang commented on HDFS-9249: --- [~ste...@apache.org] Thanks for the suggestion. The exception was thrown when auth is default (i.e. SIMPLE). I did what you suggested, and instead of NPE at BackupNode, an IOException is thrown by NameNode, but unlike BackupNode.stop(), NameNode.stop() checks if namesystem is null. Additionally, I looked further and found there are other IOException possibilities at other places. So I think in addition to logging the exception, BackupNode should also check for the null pointer. > NPE thrown if an IOException is thrown in NameNode. > - > > Key: HDFS-9249 > URL: https://issues.apache.org/jira/browse/HDFS-9249 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Wei-Chiu Chuang >Assignee: Wei-Chiu Chuang >Priority: Minor > Labels: supportability > > This issue was found when running test case > TestBackupNode.testCheckpointNode, but upon closer look, the problem is not > due to the test case. > Looks like an IOException was thrown in > try { > initializeGenericKeys(conf, nsId, namenodeId); > initialize(conf); > try { > haContext.writeLock(); > state.prepareToEnterState(haContext); > state.enterState(haContext); > } finally { > haContext.writeUnlock(); > } > causing the namenode to stop, but the namesystem was not yet properly > instantiated, causing NPE. > I tried to reproduce locally, but to no avail. > Because I could not reproduce the bug, and the log does not indicate what > caused the IOException, I suggest make this a supportability JIRA to log the > exception for future improvement. > Stacktrace > java.lang.NullPointerException: null > at > org.apache.hadoop.hdfs.server.namenode.NameNode.getFSImage(NameNode.java:906) > at org.apache.hadoop.hdfs.server.namenode.BackupNode.stop(BackupNode.java:210) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:827) > at > org.apache.hadoop.hdfs.server.namenode.BackupNode.(BackupNode.java:89) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1474) > at > org.apache.hadoop.hdfs.server.namenode.TestBackupNode.startBackupNode(TestBackupNode.java:102) > at > org.apache.hadoop.hdfs.server.namenode.TestBackupNode.testCheckpoint(TestBackupNode.java:298) > at > org.apache.hadoop.hdfs.server.namenode.TestBackupNode.testCheckpointNode(TestBackupNode.java:130) > The last few lines of log: > 2015-10-14 19:45:07,807 INFO namenode.NameNode > (NameNode.java:createNameNode(1422)) - createNameNode [-checkpoint] > 2015-10-14 19:45:07,807 INFO impl.MetricsSystemImpl > (MetricsSystemImpl.java:init(158)) - CheckpointNode metrics system started > (again) > 2015-10-14 19:45:07,808 INFO namenode.NameNode > (NameNode.java:setClientNamenodeAddress(402)) - fs.defaultFS is > hdfs://localhost:37835 > 2015-10-14 19:45:07,808 INFO namenode.NameNode > (NameNode.java:setClientNamenodeAddress(422)) - Clients are to use > localhost:37835 to access this namenode/service. > 2015-10-14 19:45:07,810 INFO hdfs.MiniDFSCluster > (MiniDFSCluster.java:shutdown(1708)) - Shutting down the Mini HDFS Cluster > 2015-10-14 19:45:07,810 INFO namenode.FSNamesystem > (FSNamesystem.java:stopActiveServices(1298)) - Stopping services started for > active state > 2015-10-14 19:45:07,811 INFO namenode.FSEditLog > (FSEditLog.java:endCurrentLogSegment(1228)) - Ending log segment 1 > 2015-10-14 19:45:07,811 INFO namenode.FSNamesystem > (FSNamesystem.java:run(5306)) - NameNodeEditLogRoller was interrupted, exiting > 2015-10-14 19:45:07,811 INFO namenode.FSEditLog > (FSEditLog.java:printStatistics(703)) - Number of transactions: 3 Total time > for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of > syncs: 4 SyncTimes(ms): 2 1 > 2015-10-14 19:45:07,811 INFO namenode.FSNamesystem > (FSNamesystem.java:run(5373)) - LazyPersistFileScrubber was interrupted, > exiting > 2015-10-14 19:45:07,822 INFO namenode.FileJournalManager > (FileJournalManager.java:finalizeLogSegment(142)) - Finalizing edits file > /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1/current/edits_inprogress_001 > -> > /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1/current/edits_001-003 > 2015-10-14 19:45:07,835 INFO namenode.FileJournalManager > (FileJournalManager.java:finalizeLogSegment(142)) - Finalizing edits file > /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/dat
[jira] [Commented] (HDFS-9173) Erasure Coding: Lease recovery for striped file
[ https://issues.apache.org/jira/browse/HDFS-9173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959398#comment-14959398 ] Rakesh R commented on HDFS-9173: Awesome work [~walter.k.su], its really interesting. I've few comments: # {{getSafeLength}} -> Could you please tell me the behavior of a file where it doesn't have any block with full cell size of 64k. For example, all the blocks having less than the CELL_SIZE number of bytes. Say CELL_SIZE=64 * 1024, now blocks are blk1=1024 blk2=2*1024 blk3=3*1024 blk4=4*1024 blk5=5*104 blk6=6*1024 number of bytes etc. Please add a test if not included. # The patch is quite large, just a suggestion to simplify the review & patch rework effort. I could see you have created {{BlockRecoveryWorker}}, {{RecoveryTask}} classes to segregate the logic and done few improvements. Its nice idea. If everyone agrees, it would be good to create another jira to do these pre-requisite changes separately and can safely push it within no time. Later in this jira, will do focused review/test for the {{StripedRecoveryTask#recover()}} related logic. Any thoughts? > Erasure Coding: Lease recovery for striped file > --- > > Key: HDFS-9173 > URL: https://issues.apache.org/jira/browse/HDFS-9173 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Walter Su >Assignee: Walter Su > Attachments: HDFS-9173.00.wip.patch, HDFS-9173.01.patch, > HDFS-9173.02.step125.patch, HDFS-9173.03.patch, HDFS-9173.04.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9184) Logging HDFS operation's caller context into audit logs
[ https://issues.apache.org/jira/browse/HDFS-9184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959385#comment-14959385 ] Jitendra Nath Pandey commented on HDFS-9184: [~liuml07], I think it will be a good idea to move the new configurations to common instead of having them in hdfs, because CallerContext is defined in common. > Logging HDFS operation's caller context into audit logs > --- > > Key: HDFS-9184 > URL: https://issues.apache.org/jira/browse/HDFS-9184 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Mingliang Liu >Assignee: Mingliang Liu > Attachments: HDFS-9184.000.patch, HDFS-9184.001.patch, > HDFS-9184.002.patch, HDFS-9184.003.patch, HDFS-9184.004.patch, > HDFS-9184.005.patch, HDFS-9184.006.patch > > > For a given HDFS operation (e.g. delete file), it's very helpful to track > which upper level job issues it. The upper level callers may be specific > Oozie tasks, MR jobs, and hive queries. One scenario is that the namenode > (NN) is abused/spammed, the operator may want to know immediately which MR > job should be blamed so that she can kill it. To this end, the caller context > contains at least the application-dependent "tracking id". > There are several existing techniques that may be related to this problem. > 1. Currently the HDFS audit log tracks the users of the the operation which > is obviously not enough. It's common that the same user issues multiple jobs > at the same time. Even for a single top level task, tracking back to a > specific caller in a chain of operations of the whole workflow (e.g.Oozie -> > Hive -> Yarn) is hard, if not impossible. > 2. HDFS integrated {{htrace}} support for providing tracing information > across multiple layers. The span is created in many places interconnected > like a tree structure which relies on offline analysis across RPC boundary. > For this use case, {{htrace}} has to be enabled at 100% sampling rate which > introduces significant overhead. Moreover, passing additional information > (via annotations) other than span id from root of the tree to leaf is a > significant additional work. > 3. In [HDFS-4680 | https://issues.apache.org/jira/browse/HDFS-4680], there > are some related discussion on this topic. The final patch implemented the > tracking id as a part of delegation token. This protects the tracking > information from being changed or impersonated. However, kerberos > authenticated connections or insecure connections don't have tokens. > [HADOOP-8779] proposes to use tokens in all the scenarios, but that might > mean changes to several upstream projects and is a major change in their > security implementation. > We propose another approach to address this problem. We also treat HDFS audit > log as a good place for after-the-fact root cause analysis. We propose to put > the caller id (e.g. Hive query id) in threadlocals. Specially, on client side > the threadlocal object is passed to NN as a part of RPC header (optional), > while on sever side NN retrieves it from header and put it to {{Handler}}'s > threadlocals. Finally in {{FSNamesystem}}, HDFS audit logger will record the > caller context for each operation. In this way, the existing code is not > affected. > It is still challenging to keep "lying" client from abusing the caller > context. Our proposal is to add a {{signature}} field to the caller context. > The client choose to provide its signature along with the caller id. The > operator may need to validate the signature at the time of offline analysis. > The NN is not responsible for validating the signature online. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-9207) Move the implementation to the hdfs-native-client module
[ https://issues.apache.org/jira/browse/HDFS-9207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haohui Mai resolved HDFS-9207. -- Resolution: Fixed Committed to the HDFS-8707 branch. Thanks James and Bob for the reviews! > Move the implementation to the hdfs-native-client module > > > Key: HDFS-9207 > URL: https://issues.apache.org/jira/browse/HDFS-9207 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Reporter: Haohui Mai >Assignee: Haohui Mai > Attachments: HDFS-9207.000.patch > > > The implementation of libhdfspp should be moved to the new hdfs-native-client > module as HDFS-9170 has landed in trunk and branch-2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9223) Code cleanup for DatanodeDescriptor and HeartbeatManager
[ https://issues.apache.org/jira/browse/HDFS-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haohui Mai updated HDFS-9223: - Issue Type: Sub-task (was: Bug) Parent: HDFS-8966 > Code cleanup for DatanodeDescriptor and HeartbeatManager > > > Key: HDFS-9223 > URL: https://issues.apache.org/jira/browse/HDFS-9223 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Reporter: Jing Zhao >Assignee: Jing Zhao >Priority: Minor > Fix For: 2.8.0 > > Attachments: HDFS-9223.000.patch, HDFS-9223.001.patch, > HDFS-9223.002.patch, HDFS-9223.003.patch > > > Some code cleanup for {{DatanodeDescriptor}} and {{HeartbeatManager}}. The > changes include: > # Change {{DataDescriptor#isAlive}} and {{DatanodeDescriptor#needKeyUpdate}} > from public to private > # Use EnumMap for {{HeartbeatManager#storageTypeStatesMap}} > # Move the {{isInStartupSafeMode}} out of the namesystem lock in > {{heartbeatCheck}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4015) Safemode should count and report orphaned blocks
[ https://issues.apache.org/jira/browse/HDFS-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959367#comment-14959367 ] Anu Engineer commented on HDFS-4015: [~liuml07] Thanks for looking that the Hadoop QA results. I did look at test results just to double check. 2 of them are failures related to globbing that has already been reverted. Other failures are mostly timing related and not related to this patch. > Safemode should count and report orphaned blocks > > > Key: HDFS-4015 > URL: https://issues.apache.org/jira/browse/HDFS-4015 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.0.0 >Reporter: Todd Lipcon >Assignee: Anu Engineer > Attachments: HDFS-4015.001.patch, HDFS-4015.002.patch, > HDFS-4015.003.patch, HDFS-4015.004.patch, HDFS-4015.005.patch > > > The safemode status currently reports the number of unique reported blocks > compared to the total number of blocks referenced by the namespace. However, > it does not report the inverse: blocks which are reported by datanodes but > not referenced by the namespace. > In the case that an admin accidentally starts up from an old image, this can > be confusing: safemode and fsck will show "corrupt files", which are the > files which actually have been deleted but got resurrected by restarting from > the old image. This will convince them that they can safely force leave > safemode and remove these files -- after all, they know that those files > should really have been deleted. However, they're not aware that leaving > safemode will also unrecoverably delete a bunch of other block files which > have been orphaned due to the namespace rollback. > I'd like to consider reporting something like: "90 of expected 100 > blocks have been reported. Additionally, 1 blocks have been reported > which do not correspond to any file in the namespace. Forcing exit of > safemode will unrecoverably remove those data blocks" > Whether this statistic is also used for some kind of "inverse safe mode" is > the logical next step, but just reporting it as a warning seems easy enough > to accomplish and worth doing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9220) Reading small file (< 512 bytes) that is open for append fails due to incorrect checksum
[ https://issues.apache.org/jira/browse/HDFS-9220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959361#comment-14959361 ] Hudson commented on HDFS-9220: -- FAILURE: Integrated in Hadoop-Yarn-trunk #1274 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/1274/]) HDFS-9220. Reading small file (< 512 bytes) that is open for append (kihwal: rev c7c36cbd6218f46c33d7fb2f60cd52cb29e6d720) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestFileAppend2.java > Reading small file (< 512 bytes) that is open for append fails due to > incorrect checksum > > > Key: HDFS-9220 > URL: https://issues.apache.org/jira/browse/HDFS-9220 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Bogdan Raducanu >Assignee: Jing Zhao >Priority: Blocker > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-9220.000.patch, HDFS-9220.001.patch, > HDFS-9220.002.patch, test2.java > > > Exception: > 2015-10-09 14:59:40 WARN DFSClient:1150 - fetchBlockByteRange(). Got a > checksum exception for /tmp/file0.05355529331575182 at > BP-353681639-10.10.10.10-1437493596883:blk_1075692769_9244882:0 from > DatanodeInfoWithStorage[10.10.10.10]:5001 > All 3 replicas cause this exception and the read fails entirely with: > BlockMissingException: Could not obtain block: > BP-353681639-10.10.10.10-1437493596883:blk_1075692769_9244882 > file=/tmp/file0.05355529331575182 > Code to reproduce is attached. > Does not happen in 2.7.0. > Data is read correctly if checksum verification is disabled. > More generally, the failure happens when reading from the last block of a > file and the last block has <= 512 bytes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9092) Nfs silently drops overlapping write requests and causes data copying to fail
[ https://issues.apache.org/jira/browse/HDFS-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959355#comment-14959355 ] Mingliang Liu commented on HDFS-9092: - s/Do/Does/ > Nfs silently drops overlapping write requests and causes data copying to fail > - > > Key: HDFS-9092 > URL: https://issues.apache.org/jira/browse/HDFS-9092 > Project: Hadoop HDFS > Issue Type: Bug > Components: nfs >Affects Versions: 2.7.1 >Reporter: Yongjun Zhang >Assignee: Yongjun Zhang > Fix For: 2.8.0 > > Attachments: HDFS-9092.001.patch, HDFS-9092.002.patch > > > When NOT using 'sync' option, the NFS writes may issue the following warning: > org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Got an overlapping write > (1248751616, 1249677312), nextOffset=1248752400. Silently drop it now > and the size of data copied via NFS will stay at 1248752400. > Found what happened is: > 1. The write requests from client are sent asynchronously. > 2. The NFS gateway has handler to handle the incoming requests by creating an > internal write request structuire and put it into cache; > 3. In parallel, a separate thread in NFS gateway takes requests out from the > cache and writes the data to HDFS. > The current offset is how much data has been written by the write thread in > 3. The detection of overlapping write request happens in 2, but it only > checks the write request against the curent offset, and trim the request if > necessary. Because the write requests are sent asynchronously, if two > requests are beyond the current offset, and they overlap, it's not detected > and both are put into the cache. This cause the symptom reported in this case > at step 3. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-3059) ssl-server.xml causes NullPointer
[ https://issues.apache.org/jira/browse/HDFS-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959332#comment-14959332 ] Yongjun Zhang commented on HDFS-3059: - HI [~xiaochen], Thanks for working on this issue. I browsed it and have some comments/question: {code} LOG.warn("IOException caught when getting password, setting password " + "to null. Exception:\"" + ioe.getMessage() + "\"."); {code} to: {code} LOG.warn("Setting password to null since IOException is caught when getting password", ioe); {code} 2. Add comma to " is specified make sure it is a relative path" as "is specified, make sure it is a relative path" 3. Would you please explain why the following comments? maybe add the explanation as an addition to the comment. {code} // This is only needed when starting SNN as a daemon, // and no need to run it if called from shell command. {code} Thanks. > ssl-server.xml causes NullPointer > - > > Key: HDFS-3059 > URL: https://issues.apache.org/jira/browse/HDFS-3059 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, security >Affects Versions: 2.7.1 > Environment: in core-site.xml: > {code:xml} > > hadoop.security.authentication > kerberos > > > hadoop.security.authorization > true > > {code} > in hdfs-site.xml: > {code:xml} > > dfs.https.server.keystore.resource > /etc/hadoop/conf/ssl-server.xml > > > dfs.https.enable > true > > > ...other security props > > {code} >Reporter: Evert Lammerts >Assignee: Xiao Chen >Priority: Minor > Labels: BB2015-05-TBR > Attachments: HDFS-3059.02.patch, HDFS-3059.03.patch, > HDFS-3059.04.patch, HDFS-3059.05.patch, HDFS-3059.patch, HDFS-3059.patch.2 > > > If ssl is enabled (dfs.https.enable) but ssl-server.xml is not available, a > DN will crash during startup while setting up an SSL socket with a > NullPointerException: > {noformat}12/03/07 17:08:36 DEBUG security.Krb5AndCertsSslSocketConnector: > useKerb = false, useCerts = true > jetty.ssl.password : jetty.ssl.keypassword : 12/03/07 17:08:36 INFO > mortbay.log: jetty-6.1.26.cloudera.1 > 12/03/07 17:08:36 INFO mortbay.log: Started > selectchannelconnec...@p-worker35.alley.sara.nl:1006 > 12/03/07 17:08:36 DEBUG security.Krb5AndCertsSslSocketConnector: Creating new > KrbServerSocket for: 0.0.0.0 > 12/03/07 17:08:36 WARN mortbay.log: java.lang.NullPointerException > 12/03/07 17:08:36 WARN mortbay.log: failed > Krb5AndCertsSslSocketConnector@0.0.0.0:50475: java.io.IOException: > !JsseListener: java.lang.NullPointerException > 12/03/07 17:08:36 WARN mortbay.log: failed Server@604788d5: > java.io.IOException: !JsseListener: java.lang.NullPointerException > 12/03/07 17:08:36 INFO mortbay.log: Stopped > Krb5AndCertsSslSocketConnector@0.0.0.0:50475 > 12/03/07 17:08:36 INFO mortbay.log: Stopped > selectchannelconnec...@p-worker35.alley.sara.nl:1006 > 12/03/07 17:08:37 INFO datanode.DataNode: Waiting for threadgroup to exit, > active threads is 0{noformat} > The same happens if I set an absolute path to an existing > dfs.https.server.keystore.resource - in this case the file cannot be found > but not even a WARN is given. > Since in dfs.https.server.keystore.resource we know we need to have 4 > properties specified (ssl.server.truststore.location, > ssl.server.keystore.location, ssl.server.keystore.password, and > ssl.server.keystore.keypassword) we should check if they are set and throw an > IOException if they are not. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9092) Nfs silently drops overlapping write requests and causes data copying to fail
[ https://issues.apache.org/jira/browse/HDFS-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959329#comment-14959329 ] Mingliang Liu commented on HDFS-9092: - It happens to me sometimes when the Jenkins did not report findbugs, in which case I have to run it locally to double check. It will be nice if we can find out why. As to the warning itself, I think the unsynchronized access is read only, and for LOG/toString purpose. Do it make sense to make the {{offset}} and {{originalCount}} volatile? > Nfs silently drops overlapping write requests and causes data copying to fail > - > > Key: HDFS-9092 > URL: https://issues.apache.org/jira/browse/HDFS-9092 > Project: Hadoop HDFS > Issue Type: Bug > Components: nfs >Affects Versions: 2.7.1 >Reporter: Yongjun Zhang >Assignee: Yongjun Zhang > Fix For: 2.8.0 > > Attachments: HDFS-9092.001.patch, HDFS-9092.002.patch > > > When NOT using 'sync' option, the NFS writes may issue the following warning: > org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Got an overlapping write > (1248751616, 1249677312), nextOffset=1248752400. Silently drop it now > and the size of data copied via NFS will stay at 1248752400. > Found what happened is: > 1. The write requests from client are sent asynchronously. > 2. The NFS gateway has handler to handle the incoming requests by creating an > internal write request structuire and put it into cache; > 3. In parallel, a separate thread in NFS gateway takes requests out from the > cache and writes the data to HDFS. > The current offset is how much data has been written by the write thread in > 3. The detection of overlapping write request happens in 2, but it only > checks the write request against the curent offset, and trim the request if > necessary. Because the write requests are sent asynchronously, if two > requests are beyond the current offset, and they overlap, it's not detected > and both are put into the cache. This cause the symptom reported in this case > at step 3. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9250) LocatedBlock#addCachedLoc may throw ArrayStoreException when cache is empty
[ https://issues.apache.org/jira/browse/HDFS-9250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Chen updated HDFS-9250: Status: Patch Available (was: Open) > LocatedBlock#addCachedLoc may throw ArrayStoreException when cache is empty > --- > > Key: HDFS-9250 > URL: https://issues.apache.org/jira/browse/HDFS-9250 > Project: Hadoop HDFS > Issue Type: Bug > Components: HDFS >Reporter: Xiao Chen >Assignee: Xiao Chen > Attachments: HDFS-9250.001.patch > > > We may see the following exception: > {noformat} > java.lang.ArrayStoreException > at java.util.ArrayList.toArray(ArrayList.java:389) > at > org.apache.hadoop.hdfs.protocol.LocatedBlock.addCachedLoc(LocatedBlock.java:205) > at > org.apache.hadoop.hdfs.server.namenode.CacheManager.setCachedLocations(CacheManager.java:907) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1974) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1873) > {noformat} > The cause is that in LocatedBlock.java, when {{addCachedLoc}}: > - Passed in parameter {{loc}}, which is type {{DatanodeDescriptor}}, is added > to {{cachedList}} > - {{cachedList}} was assigned to {{EMPTY_LOCS}}, which is type > {{DatanodeInfoWithStorage}}. > Both {{DatanodeDescriptor}} and {{DatanodeInfoWithStorage}} are subclasses of > {{DatanodeInfo}} but do not inherit from each other, resulting in the > ArrayStoreException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)