[jira] [Created] (HDFS-16872) Fix log throttling by declaring LogThrottlingHelper as static members
Chengbing Liu created HDFS-16872: Summary: Fix log throttling by declaring LogThrottlingHelper as static members Key: HDFS-16872 URL: https://issues.apache.org/jira/browse/HDFS-16872 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.3.4 Reporter: Chengbing Liu In our production cluster with Observer NameNode enabled, we have plenty of logs printed by {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}. The {{LogThrottlingHelper}} doesn't seem to work. {noformat} 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Start loading edits file ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 17686250688] maxTxnsToRead = 9223372036854775807 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 17686250688]' to transaction ID 17686250688 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688]' to transaction ID 17686250688 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 17686250688]) of total size 527.0, total edits 1.0, total load time 0.0 ms 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Start loading edits file ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 17686250693] maxTxnsToRead = 9223372036854775807 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 17686250693]' to transaction ID 17686250689 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693]' to transaction ID 17686250689 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 17686250693]) of total size 890.0, total edits 5.0, total load time 1.0 ms {noformat} After some digging, I found the cause is that {{LogThrottlingHelper}}'s are declared as instance variables of all the enclosing classes, including {{FSImage}}, {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}. Therefore the logging frequency will not be limited across different instances. For classes with only limited number of instances, such as {{FSImage}}, this is fine. For others whose instances are created frequently, such as {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}, it will result in plenty of logs. This can be fixed by declaring {{LogThrottlingHelper}}'s as static members. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13791) Limit logging frequency of edit tail related statements
[ https://issues.apache.org/jira/browse/HDFS-13791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624119#comment-17624119 ] Chengbing Liu commented on HDFS-13791: -- In our production cluster with Observer NameNode enabled, we have plenty of logs printed by {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}. The {{LogThrottlingHelper}} doesn't seem to work. {noformat} 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Start loading edits file ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 17686250688] maxTxnsToRead = 9223372036854775807 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 17686250688]' to transaction ID 17686250688 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688]' to transaction ID 17686250688 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 17686250688]) of total size 527.0, total edits 1.0, total load time 0.0 ms 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Start loading edits file ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 17686250693] maxTxnsToRead = 9223372036854775807 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 17686250693]' to transaction ID 17686250689 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693]' to transaction ID 17686250689 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 17686250693]) of total size 890.0, total edits 5.0, total load time 1.0 ms {noformat} After some digging, I found the cause is that {{LogThrottlingHelper}}'s are declared as instance variables of all the enclosing classes, including {{FSImage}}, {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}. Therefore the logging frequency will not be limited across different instances. For classes with only limited number of instances, such as {{FSImage}}, this is fine. For others whose instances will be created continuously, such as {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}, it will result in plenty of logs. [~xkrogen] How about making them static variables? > Limit logging frequency of edit tail related statements > --- > > Key: HDFS-13791 > URL: https://issues.apache.org/jira/browse/HDFS-13791 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs, qjm >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > Fix For: HDFS-12943, 3.3.0 > > Attachments: HDFS-13791-HDFS-12943.000.patch, > HDFS-13791-HDFS-12943.001.patch, HDFS-13791-HDFS-12943.002.patch, > HDFS-13791-HDFS-12943.003.patch, HDFS-13791-HDFS-12943.004.patch, > HDFS-13791-HDFS-12943.005.patch, HDFS-13791-HDFS-12943.006.patch > > > There are a number of log statements that occur every time new edits are > tailed by a Standby NameNode. When edits are tailing only on the order of > every tens of seconds, this is fine. With the work in HDFS-13150, however, > edits may be tailed every few milliseconds, which can flood the logs with > tailing-related statements. We should throttle it to limit it to printing at > most, say, once per 5 seconds. > We can implement logic similar to that used in HDFS-10713. This may be > slightly more tricky since the log statements are distributed across a few > classes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-8708) DFSClient should ignore dfs.client.retry.policy.enabled for HA proxies
[ https://issues.apache.org/jira/browse/HDFS-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900527#comment-16900527 ] Chengbing Liu commented on HDFS-8708: - [~ayushtkn] [~shv] Could you please review the change if you have time? Thanks! > DFSClient should ignore dfs.client.retry.policy.enabled for HA proxies > -- > > Key: HDFS-8708 > URL: https://issues.apache.org/jira/browse/HDFS-8708 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0, 3.1.2 >Reporter: Jitendra Nath Pandey >Assignee: Chengbing Liu >Priority: Critical > Attachments: HDFS-8708.001.patch, HDFS-8708.002.patch > > > DFSClient should ignore dfs.client.retry.policy.enabled for HA proxies to > ensure fast failover. Otherwise, dfsclient retries the NN which is no longer > active and delays the failover. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-8708) DFSClient should ignore dfs.client.retry.policy.enabled for HA proxies
[ https://issues.apache.org/jira/browse/HDFS-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896946#comment-16896946 ] Chengbing Liu commented on HDFS-8708: - Uploaded HDFS-8708.002.patch to fix checkstyle issue. > DFSClient should ignore dfs.client.retry.policy.enabled for HA proxies > -- > > Key: HDFS-8708 > URL: https://issues.apache.org/jira/browse/HDFS-8708 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0, 3.1.2 >Reporter: Jitendra Nath Pandey >Assignee: Chengbing Liu >Priority: Critical > Attachments: HDFS-8708.001.patch, HDFS-8708.002.patch > > > DFSClient should ignore dfs.client.retry.policy.enabled for HA proxies to > ensure fast failover. Otherwise, dfsclient retries the NN which is no longer > active and delays the failover. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-8708) DFSClient should ignore dfs.client.retry.policy.enabled for HA proxies
[ https://issues.apache.org/jira/browse/HDFS-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-8708: Attachment: HDFS-8708.002.patch > DFSClient should ignore dfs.client.retry.policy.enabled for HA proxies > -- > > Key: HDFS-8708 > URL: https://issues.apache.org/jira/browse/HDFS-8708 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0, 3.1.2 >Reporter: Jitendra Nath Pandey >Assignee: Chengbing Liu >Priority: Critical > Attachments: HDFS-8708.001.patch, HDFS-8708.002.patch > > > DFSClient should ignore dfs.client.retry.policy.enabled for HA proxies to > ensure fast failover. Otherwise, dfsclient retries the NN which is no longer > active and delays the failover. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-8708) DFSClient should ignore dfs.client.retry.policy.enabled for HA proxies
[ https://issues.apache.org/jira/browse/HDFS-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-8708: Target Version/s: (was: 2.8.0) > DFSClient should ignore dfs.client.retry.policy.enabled for HA proxies > -- > > Key: HDFS-8708 > URL: https://issues.apache.org/jira/browse/HDFS-8708 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0, 3.1.2 >Reporter: Jitendra Nath Pandey >Assignee: Chengbing Liu >Priority: Critical > Attachments: HDFS-8708.001.patch > > > DFSClient should ignore dfs.client.retry.policy.enabled for HA proxies to > ensure fast failover. Otherwise, dfsclient retries the NN which is no longer > active and delays the failover. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-8708) DFSClient should ignore dfs.client.retry.policy.enabled for HA proxies
[ https://issues.apache.org/jira/browse/HDFS-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-8708: Attachment: HDFS-8708.001.patch > DFSClient should ignore dfs.client.retry.policy.enabled for HA proxies > -- > > Key: HDFS-8708 > URL: https://issues.apache.org/jira/browse/HDFS-8708 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jitendra Nath Pandey >Assignee: Brahma Reddy Battula >Priority: Critical > Attachments: HDFS-8708.001.patch > > > DFSClient should ignore dfs.client.retry.policy.enabled for HA proxies to > ensure fast failover. Otherwise, dfsclient retries the NN which is no longer > active and delays the failover. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-8708) DFSClient should ignore dfs.client.retry.policy.enabled for HA proxies
[ https://issues.apache.org/jira/browse/HDFS-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-8708: Assignee: Chengbing Liu (was: Brahma Reddy Battula) Affects Version/s: 3.2.0 3.1.2 Status: Patch Available (was: Reopened) > DFSClient should ignore dfs.client.retry.policy.enabled for HA proxies > -- > > Key: HDFS-8708 > URL: https://issues.apache.org/jira/browse/HDFS-8708 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.2, 3.2.0 >Reporter: Jitendra Nath Pandey >Assignee: Chengbing Liu >Priority: Critical > Attachments: HDFS-8708.001.patch > > > DFSClient should ignore dfs.client.retry.policy.enabled for HA proxies to > ensure fast failover. Otherwise, dfsclient retries the NN which is no longer > active and delays the failover. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Reopened] (HDFS-8708) DFSClient should ignore dfs.client.retry.policy.enabled for HA proxies
[ https://issues.apache.org/jira/browse/HDFS-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu reopened HDFS-8708: - I have different opinion so I'm reopening this issue. In our production environment, we have both HA and non-HA clusters. A client should be able to access both kinds of clusters. This is our dilemma. By setting dfs.client.retry.policy.enabled = true, currently we see: 1) HA nameservice: in case of nn1 shutdown, will still attempt connecting to nn1 many times (11min by default) before failover, which is undesired 2) non-HA namenode: keep retrying to connect for 11min by default By setting dfs.client.retry.policy.enabled = false, currently we see: 1) HA nameservice: fast failover, everything works fine 2) non-HA namenode: no retry will be made in case of connection failure, which is undesired We would like to ensure fast failover with HA mode as well as multiple retries with non-HA mode, and we cannot achieve this with current implementation. Proposed code change: In {{NameNodeProxiesClient.createProxyWithAlignmentContext}}, {{defaultPolicy}} should not be passed to {{ClientProtocol}} when {{withRetries}} is false (HA mode). Instead, TRY_ONCE_THEN_FAIL can be used to ensure fast failover. > DFSClient should ignore dfs.client.retry.policy.enabled for HA proxies > -- > > Key: HDFS-8708 > URL: https://issues.apache.org/jira/browse/HDFS-8708 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jitendra Nath Pandey >Assignee: Brahma Reddy Battula >Priority: Critical > > DFSClient should ignore dfs.client.retry.policy.enabled for HA proxies to > ensure fast failover. Otherwise, dfsclient retries the NN which is no longer > active and delays the failover. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-7048) Incorrect Dispatcher#Source wait/notify leads to early termination
[ https://issues.apache.org/jira/browse/HDFS-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15991943#comment-15991943 ] Chengbing Liu commented on HDFS-7048: - Thanks [~shv] > Incorrect Dispatcher#Source wait/notify leads to early termination > -- > > Key: HDFS-7048 > URL: https://issues.apache.org/jira/browse/HDFS-7048 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: balancer & mover >Affects Versions: 2.6.0, 2.7.0 >Reporter: Andrew Wang >Assignee: Chengbing Liu > Attachments: HDFS-7048.01.patch > > > Split off from HDFS-6621. The Balancer attempts to wake up scheduler threads > early as sources finish, but the synchronization with wait and notify is > incorrect. This ticks the failure count, which can lead to early termination. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-7048) Incorrect Dispatcher#Source wait/notify leads to early termination
[ https://issues.apache.org/jira/browse/HDFS-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15972254#comment-15972254 ] Chengbing Liu commented on HDFS-7048: - Somehow I cannot unassign myself, can someone help? > Incorrect Dispatcher#Source wait/notify leads to early termination > -- > > Key: HDFS-7048 > URL: https://issues.apache.org/jira/browse/HDFS-7048 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: balancer & mover >Affects Versions: 2.6.0, 2.7.0 >Reporter: Andrew Wang >Assignee: Chengbing Liu > Attachments: HDFS-7048.01.patch > > > Split off from HDFS-6621. The Balancer attempts to wake up scheduler threads > early as sources finish, but the synchronization with wait and notify is > incorrect. This ticks the failure count, which can lead to early termination. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-7048) Incorrect Dispatcher#Source wait/notify leads to early termination
[ https://issues.apache.org/jira/browse/HDFS-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15972246#comment-15972246 ] Chengbing Liu commented on HDFS-7048: - [~Weizhan Zeng], I currently have no test environment with latest code, sorry about this. Feel free to take over. > Incorrect Dispatcher#Source wait/notify leads to early termination > -- > > Key: HDFS-7048 > URL: https://issues.apache.org/jira/browse/HDFS-7048 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: balancer & mover >Affects Versions: 2.6.0, 2.7.0 >Reporter: Andrew Wang >Assignee: Chengbing Liu > Attachments: HDFS-7048.01.patch > > > Split off from HDFS-6621. The Balancer attempts to wake up scheduler threads > early as sources finish, but the synchronization with wait and notify is > incorrect. This ticks the failure count, which can lead to early termination. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-7048) Incorrect Dispatcher#Source wait/notify leads to early termination
[ https://issues.apache.org/jira/browse/HDFS-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-7048: Status: Open (was: Patch Available) > Incorrect Dispatcher#Source wait/notify leads to early termination > -- > > Key: HDFS-7048 > URL: https://issues.apache.org/jira/browse/HDFS-7048 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: balancer & mover >Affects Versions: 2.7.0, 2.6.0 >Reporter: Andrew Wang >Assignee: Chengbing Liu > Attachments: HDFS-7048.01.patch > > > Split off from HDFS-6621. The Balancer attempts to wake up scheduler threads > early as sources finish, but the synchronization with wait and notify is > incorrect. This ticks the failure count, which can lead to early termination. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-8825) Enhancements to Balancer
[ https://issues.apache.org/jira/browse/HDFS-8825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133600#comment-15133600 ] Chengbing Liu commented on HDFS-8825: - [~szetszwo], I just added HDFS-7048 as a sub-task, since the dispatcher's wait/notify issue has not been addressed in the above tasks. The attached patch in HDFS-7048 will of course need rebasing, but the idea is still useful in my opinion. Please correct me if I missed something. > Enhancements to Balancer > > > Key: HDFS-8825 > URL: https://issues.apache.org/jira/browse/HDFS-8825 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Reporter: Tsz Wo Nicholas Sze >Assignee: Tsz Wo Nicholas Sze > > This is an umbrella JIRA to enhance Balancer. The goal is to make it runs > faster, more efficient and improve its usability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7048) Incorrect Dispatcher#Source wait/notify leads to early termination
[ https://issues.apache.org/jira/browse/HDFS-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-7048: Summary: Incorrect Dispatcher#Source wait/notify leads to early termination (was: Incorrect Balancer#Source wait/notify leads to early termination) > Incorrect Dispatcher#Source wait/notify leads to early termination > -- > > Key: HDFS-7048 > URL: https://issues.apache.org/jira/browse/HDFS-7048 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: balancer & mover >Affects Versions: 2.6.0, 2.7.0 >Reporter: Andrew Wang >Assignee: Chengbing Liu > Attachments: HDFS-7048.01.patch > > > Split off from HDFS-6621. The Balancer attempts to wake up scheduler threads > early as sources finish, but the synchronization with wait and notify is > incorrect. This ticks the failure count, which can lead to early termination. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7048) Incorrect Balancer#Source wait/notify leads to early termination
[ https://issues.apache.org/jira/browse/HDFS-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-7048: Issue Type: Sub-task (was: Bug) Parent: HDFS-8825 > Incorrect Balancer#Source wait/notify leads to early termination > > > Key: HDFS-7048 > URL: https://issues.apache.org/jira/browse/HDFS-7048 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: balancer & mover >Affects Versions: 2.6.0, 2.7.0 >Reporter: Andrew Wang >Assignee: Chengbing Liu > Attachments: HDFS-7048.01.patch > > > Split off from HDFS-6621. The Balancer attempts to wake up scheduler threads > early as sources finish, but the synchronization with wait and notify is > incorrect. This ticks the failure count, which can lead to early termination. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9276) Failed to Update HDFS Delegation Token for long running application in HA mode
[ https://issues.apache.org/jira/browse/HDFS-9276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-9276: Fix Version/s: (was: 3.0.0) Affects Version/s: 2.7.1 Status: Patch Available (was: Open) > Failed to Update HDFS Delegation Token for long running application in HA mode > -- > > Key: HDFS-9276 > URL: https://issues.apache.org/jira/browse/HDFS-9276 > Project: Hadoop HDFS > Issue Type: Bug > Components: fs, ha, security >Affects Versions: 2.7.1 >Reporter: Liangliang Gu >Assignee: Liangliang Gu > Attachments: HDFS-9276.01.patch, debug1.PNG, debug2.PNG > > > The Scenario is as follows: > 1. NameNode HA is enabled. > 2. Kerberos is enabled. > 3. HDFS Delegation Token (not Keytab or TGT) is used to communicate with > NameNode. > 4. We want to update the HDFS Delegation Token for long running applicatons. > HDFS Client will generate private tokens for each NameNode. When we update > the HDFS Delegation Token, these private tokens will not be updated, which > will cause token expired. > This bug can be reproduced by the following program: > {code} > import java.security.PrivilegedExceptionAction > import org.apache.hadoop.conf.Configuration > import org.apache.hadoop.fs.{FileSystem, Path} > import org.apache.hadoop.security.UserGroupInformation > object HadoopKerberosTest { > def main(args: Array[String]): Unit = { > val keytab = "/path/to/keytab/xxx.keytab" > val principal = "x...@abc.com" > val creds1 = new org.apache.hadoop.security.Credentials() > val ugi1 = > UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab) > ugi1.doAs(new PrivilegedExceptionAction[Void] { > // Get a copy of the credentials > override def run(): Void = { > val fs = FileSystem.get(new Configuration()) > fs.addDelegationTokens("test", creds1) > null > } > }) > val ugi = UserGroupInformation.createRemoteUser("test") > ugi.addCredentials(creds1) > ugi.doAs(new PrivilegedExceptionAction[Void] { > // Get a copy of the credentials > override def run(): Void = { > var i = 0 > while (true) { > val creds1 = new org.apache.hadoop.security.Credentials() > val ugi1 = > UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab) > ugi1.doAs(new PrivilegedExceptionAction[Void] { > // Get a copy of the credentials > override def run(): Void = { > val fs = FileSystem.get(new Configuration()) > fs.addDelegationTokens("test", creds1) > null > } > }) > UserGroupInformation.getCurrentUser.addCredentials(creds1) > val fs = FileSystem.get( new Configuration()) > i += 1 > println() > println(i) > println(fs.listFiles(new Path("/user"), false)) > Thread.sleep(60 * 1000) > } > null > } > }) > } > } > {code} > To reproduce the bug, please set the following configuration to Name Node: > {code} > dfs.namenode.delegation.token.max-lifetime = 10min > dfs.namenode.delegation.key.update-interval = 3min > dfs.namenode.delegation.token.renew-interval = 3min > {code} > The bug will occure after 3 minutes. > The stacktrace is: > {code} > Exception in thread "main" > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (HDFS_DELEGATION_TOKEN token 330156 for test) is expired > at org.apache.hadoop.ipc.Client.call(Client.java:1347) > at org.apache.hadoop.ipc.Client.call(Client.java:1300) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:651) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1679) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1106) > at > org.a
[jira] [Commented] (HDFS-7785) Improve diagnostics information for HttpPutFailedException
[ https://issues.apache.org/jira/browse/HDFS-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968329#comment-14968329 ] Chengbing Liu commented on HDFS-7785: - [~yzhangal], please refer to HDFS-7798, where the standby namenode failed to do checkpoint. > Improve diagnostics information for HttpPutFailedException > -- > > Key: HDFS-7785 > URL: https://issues.apache.org/jira/browse/HDFS-7785 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Fix For: 2.7.0 > > Attachments: HDFS-7785.01.patch, HDFS-7785.01.patch > > > One of our namenode logs shows the following exception message. > ... > Caused by: > org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpPutFailedException: > org.apache.hadoop.security.authentication.util.SignerException: Invalid > signature > at > org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294) > ... > {{HttpPutFailedException}} should have its detailed information, such as > status code and url, shown in the log to help debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7048) Incorrect Balancer#Source wait/notify leads to early termination
[ https://issues.apache.org/jira/browse/HDFS-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587803#comment-14587803 ] Chengbing Liu commented on HDFS-7048: - The failed test is unrelated. Perhaps [~andrew.wang] can take a look at the patch? Thanks. > Incorrect Balancer#Source wait/notify leads to early termination > > > Key: HDFS-7048 > URL: https://issues.apache.org/jira/browse/HDFS-7048 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer & mover >Affects Versions: 2.6.0, 2.7.0 >Reporter: Andrew Wang >Assignee: Chengbing Liu > Attachments: HDFS-7048.01.patch > > > Split off from HDFS-6621. The Balancer attempts to wake up scheduler threads > early as sources finish, but the synchronization with wait and notify is > incorrect. This ticks the failure count, which can lead to early termination. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7048) Incorrect Balancer#Source wait/notify leads to early termination
[ https://issues.apache.org/jira/browse/HDFS-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-7048: Affects Version/s: 2.7.0 > Incorrect Balancer#Source wait/notify leads to early termination > > > Key: HDFS-7048 > URL: https://issues.apache.org/jira/browse/HDFS-7048 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer & mover >Affects Versions: 2.6.0, 2.7.0 >Reporter: Andrew Wang >Assignee: Chengbing Liu > Attachments: HDFS-7048.01.patch > > > Split off from HDFS-6621. The Balancer attempts to wake up scheduler threads > early as sources finish, but the synchronization with wait and notify is > incorrect. This ticks the failure count, which can lead to early termination. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7048) Incorrect Balancer#Source wait/notify leads to early termination
[ https://issues.apache.org/jira/browse/HDFS-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-7048: Target Version/s: (was: 2.6.0) > Incorrect Balancer#Source wait/notify leads to early termination > > > Key: HDFS-7048 > URL: https://issues.apache.org/jira/browse/HDFS-7048 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer & mover >Affects Versions: 2.6.0, 2.7.0 >Reporter: Andrew Wang >Assignee: Chengbing Liu > Attachments: HDFS-7048.01.patch > > > Split off from HDFS-6621. The Balancer attempts to wake up scheduler threads > early as sources finish, but the synchronization with wait and notify is > incorrect. This ticks the failure count, which can lead to early termination. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7048) Incorrect Balancer#Source wait/notify leads to early termination
[ https://issues.apache.org/jira/browse/HDFS-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582869#comment-14582869 ] Chengbing Liu commented on HDFS-7048: - Here is a bit explanation for the patch. On our production cluster, the balancer worked slowly. For an iteration planning to move ~500GB data, the actual moved data would be ~5GB. After some digging, {{Source#dispatchBlocks()}} always exits prematurely at the following code, where I added a logging to inform user the anomalies. {code} // jump out of while-loop after 5 iterations. if (noPendingMoveIteration >= MAX_NO_PENDING_MOVE_ITERATIONS) { resetScheduledSize(); } {code} This is because we use a global {{Dispatcher.this}} for wait and notify, which will wake up all the unrelated {{Source}}s, even if they did not have any {{PendingMove}} finished. The correct way should be to wait and notify on the {{StorageGroup}}, both source and target, since the DataXceiver shares the threads for sending and receiving. As for the wait timeout, I think we might increase this a little bit to prevent timing out too often. Actually we are using 60 seconds now in our production cluster without problem. However, as I increase the timeout, some test cases will fail slowly or even time out. These test cases include some obviously unmovable cases, and should exit immediately in my opinion. But we can fix that later. > Incorrect Balancer#Source wait/notify leads to early termination > > > Key: HDFS-7048 > URL: https://issues.apache.org/jira/browse/HDFS-7048 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer & mover >Affects Versions: 2.6.0 >Reporter: Andrew Wang >Assignee: Chengbing Liu > Attachments: HDFS-7048.01.patch > > > Split off from HDFS-6621. The Balancer attempts to wake up scheduler threads > early as sources finish, but the synchronization with wait and notify is > incorrect. This ticks the failure count, which can lead to early termination. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7048) Incorrect Balancer#Source wait/notify leads to early termination
[ https://issues.apache.org/jira/browse/HDFS-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-7048: Attachment: HDFS-7048.01.patch the uploaded patch wait/notify on the source and target, instead of Dispatcher.this. > Incorrect Balancer#Source wait/notify leads to early termination > > > Key: HDFS-7048 > URL: https://issues.apache.org/jira/browse/HDFS-7048 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer & mover >Affects Versions: 2.6.0 >Reporter: Andrew Wang >Assignee: Chengbing Liu > Attachments: HDFS-7048.01.patch > > > Split off from HDFS-6621. The Balancer attempts to wake up scheduler threads > early as sources finish, but the synchronization with wait and notify is > incorrect. This ticks the failure count, which can lead to early termination. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7048) Incorrect Balancer#Source wait/notify leads to early termination
[ https://issues.apache.org/jira/browse/HDFS-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-7048: Status: Patch Available (was: Open) > Incorrect Balancer#Source wait/notify leads to early termination > > > Key: HDFS-7048 > URL: https://issues.apache.org/jira/browse/HDFS-7048 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer & mover >Affects Versions: 2.6.0 >Reporter: Andrew Wang >Assignee: Chengbing Liu > Attachments: HDFS-7048.01.patch > > > Split off from HDFS-6621. The Balancer attempts to wake up scheduler threads > early as sources finish, but the synchronization with wait and notify is > incorrect. This ticks the failure count, which can lead to early termination. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HDFS-7048) Incorrect Balancer#Source wait/notify leads to early termination
[ https://issues.apache.org/jira/browse/HDFS-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu reassigned HDFS-7048: --- Assignee: Chengbing Liu > Incorrect Balancer#Source wait/notify leads to early termination > > > Key: HDFS-7048 > URL: https://issues.apache.org/jira/browse/HDFS-7048 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer & mover >Affects Versions: 2.6.0 >Reporter: Andrew Wang >Assignee: Chengbing Liu > > Split off from HDFS-6621. The Balancer attempts to wake up scheduler threads > early as sources finish, but the synchronization with wait and notify is > incorrect. This ticks the failure count, which can lead to early termination. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8113) Add check for null BlockCollection pointers in BlockInfoContiguous structures
[ https://issues.apache.org/jira/browse/HDFS-8113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14543749#comment-14543749 ] Chengbing Liu commented on HDFS-8113: - Just an update: I have done a NN-failover and the NPE never appears again. So I think it's an issue with the active NN's in-memory data structure. The fsimage is OK. > Add check for null BlockCollection pointers in BlockInfoContiguous structures > - > > Key: HDFS-8113 > URL: https://issues.apache.org/jira/browse/HDFS-8113 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0, 2.7.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Labels: BB2015-05-TBR > Fix For: 2.8.0 > > Attachments: HDFS-8113.02.patch, HDFS-8113.patch > > > The following copy constructor can throw NullPointerException if {{bc}} is > null. > {code} > protected BlockInfoContiguous(BlockInfoContiguous from) { > this(from, from.bc.getBlockReplication()); > this.bc = from.bc; > } > {code} > We have observed that some DataNodes keeps failing doing block reports with > NameNode. The stacktrace is as follows. Though we are not using the latest > version, the problem still exists. > {quote} > 2015-03-08 19:28:13,442 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > RemoteException in offerService > org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): > java.lang.NullPointerException > at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.(BlockInfo.java:80) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockToMarkCorrupt.(BlockManager.java:1696) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.checkReplicaCorrupt(BlockManager.java:2185) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReportedBlock(BlockManager.java:2047) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.reportDiff(BlockManager.java:1950) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1823) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1750) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1069) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26382) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1623) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8113) Add check for null BlockCollection pointers in BlockInfoContiguous structures
[ https://issues.apache.org/jira/browse/HDFS-8113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14536283#comment-14536283 ] Chengbing Liu commented on HDFS-8113: - Thanks Colin. > Add check for null BlockCollection pointers in BlockInfoContiguous structures > - > > Key: HDFS-8113 > URL: https://issues.apache.org/jira/browse/HDFS-8113 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0, 2.7.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Labels: BB2015-05-TBR > Fix For: 2.8.0 > > Attachments: HDFS-8113.02.patch, HDFS-8113.patch > > > The following copy constructor can throw NullPointerException if {{bc}} is > null. > {code} > protected BlockInfoContiguous(BlockInfoContiguous from) { > this(from, from.bc.getBlockReplication()); > this.bc = from.bc; > } > {code} > We have observed that some DataNodes keeps failing doing block reports with > NameNode. The stacktrace is as follows. Though we are not using the latest > version, the problem still exists. > {quote} > 2015-03-08 19:28:13,442 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > RemoteException in offerService > org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): > java.lang.NullPointerException > at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.(BlockInfo.java:80) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockToMarkCorrupt.(BlockManager.java:1696) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.checkReplicaCorrupt(BlockManager.java:2185) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReportedBlock(BlockManager.java:2047) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.reportDiff(BlockManager.java:1950) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1823) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1750) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1069) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26382) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1623) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8113) NullPointerException in BlockInfoContiguous causes block report failure
[ https://issues.apache.org/jira/browse/HDFS-8113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14530684#comment-14530684 ] Chengbing Liu commented on HDFS-8113: - Hi [~walter.k.su], I haven't tried restart/failover NN yet. I have analyzed fsimage by oiv tool, and there is no orphan blocks. So the fsimage looks fine. The only possibility I can think of is that the active NN has problem with in-memory data structure. I will do a NN-failover shortly and see if the problem vanishes. > NullPointerException in BlockInfoContiguous causes block report failure > --- > > Key: HDFS-8113 > URL: https://issues.apache.org/jira/browse/HDFS-8113 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0, 2.7.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Labels: BB2015-05-TBR > Attachments: HDFS-8113.02.patch, HDFS-8113.patch > > > The following copy constructor can throw NullPointerException if {{bc}} is > null. > {code} > protected BlockInfoContiguous(BlockInfoContiguous from) { > this(from, from.bc.getBlockReplication()); > this.bc = from.bc; > } > {code} > We have observed that some DataNodes keeps failing doing block reports with > NameNode. The stacktrace is as follows. Though we are not using the latest > version, the problem still exists. > {quote} > 2015-03-08 19:28:13,442 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > RemoteException in offerService > org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): > java.lang.NullPointerException > at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.(BlockInfo.java:80) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockToMarkCorrupt.(BlockManager.java:1696) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.checkReplicaCorrupt(BlockManager.java:2185) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReportedBlock(BlockManager.java:2047) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.reportDiff(BlockManager.java:1950) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1823) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1750) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1069) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26382) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1623) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8113) NullPointerException in BlockInfoContiguous causes block report failure
[ https://issues.apache.org/jira/browse/HDFS-8113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-8113: Affects Version/s: 2.7.0 > NullPointerException in BlockInfoContiguous causes block report failure > --- > > Key: HDFS-8113 > URL: https://issues.apache.org/jira/browse/HDFS-8113 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0, 2.7.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Labels: BB2015-05-TBR > Attachments: HDFS-8113.02.patch, HDFS-8113.patch > > > The following copy constructor can throw NullPointerException if {{bc}} is > null. > {code} > protected BlockInfoContiguous(BlockInfoContiguous from) { > this(from, from.bc.getBlockReplication()); > this.bc = from.bc; > } > {code} > We have observed that some DataNodes keeps failing doing block reports with > NameNode. The stacktrace is as follows. Though we are not using the latest > version, the problem still exists. > {quote} > 2015-03-08 19:28:13,442 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > RemoteException in offerService > org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): > java.lang.NullPointerException > at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.(BlockInfo.java:80) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockToMarkCorrupt.(BlockManager.java:1696) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.checkReplicaCorrupt(BlockManager.java:2185) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReportedBlock(BlockManager.java:2047) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.reportDiff(BlockManager.java:1950) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1823) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1750) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1069) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26382) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1623) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8113) NullPointerException in BlockInfoContiguous causes block report failure
[ https://issues.apache.org/jira/browse/HDFS-8113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14529849#comment-14529849 ] Chengbing Liu commented on HDFS-8113: - Created HDFS-8330 for further tracking. [~cmccabe] Would you mind committing this? > NullPointerException in BlockInfoContiguous causes block report failure > --- > > Key: HDFS-8113 > URL: https://issues.apache.org/jira/browse/HDFS-8113 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Labels: BB2015-05-TBR > Attachments: HDFS-8113.02.patch, HDFS-8113.patch > > > The following copy constructor can throw NullPointerException if {{bc}} is > null. > {code} > protected BlockInfoContiguous(BlockInfoContiguous from) { > this(from, from.bc.getBlockReplication()); > this.bc = from.bc; > } > {code} > We have observed that some DataNodes keeps failing doing block reports with > NameNode. The stacktrace is as follows. Though we are not using the latest > version, the problem still exists. > {quote} > 2015-03-08 19:28:13,442 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > RemoteException in offerService > org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): > java.lang.NullPointerException > at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.(BlockInfo.java:80) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockToMarkCorrupt.(BlockManager.java:1696) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.checkReplicaCorrupt(BlockManager.java:2185) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReportedBlock(BlockManager.java:2047) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.reportDiff(BlockManager.java:1950) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1823) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1750) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1069) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26382) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1623) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8330) BlockInfoContiguous in blocksMap can have null BlockCollection
Chengbing Liu created HDFS-8330: --- Summary: BlockInfoContiguous in blocksMap can have null BlockCollection Key: HDFS-8330 URL: https://issues.apache.org/jira/browse/HDFS-8330 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.7.0 Reporter: Chengbing Liu In blocksMap, we have seen situations that some {{BlockInfoContiguous}} have its {{BlockCollection == null}}. This indicates orphan blocks which do not belong to any file. See HDFS-8113 for more discussions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8113) NullPointerException in BlockInfoContiguous causes block report failure
[ https://issues.apache.org/jira/browse/HDFS-8113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14501431#comment-14501431 ] Chengbing Liu commented on HDFS-8113: - Yes, indeed. It is too hard to analyze the issue without the stacktrace. Maybe we can fix the copy constructor first and leave furthur investigation of the root cause later? > NullPointerException in BlockInfoContiguous causes block report failure > --- > > Key: HDFS-8113 > URL: https://issues.apache.org/jira/browse/HDFS-8113 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: HDFS-8113.02.patch, HDFS-8113.patch > > > The following copy constructor can throw NullPointerException if {{bc}} is > null. > {code} > protected BlockInfoContiguous(BlockInfoContiguous from) { > this(from, from.bc.getBlockReplication()); > this.bc = from.bc; > } > {code} > We have observed that some DataNodes keeps failing doing block reports with > NameNode. The stacktrace is as follows. Though we are not using the latest > version, the problem still exists. > {quote} > 2015-03-08 19:28:13,442 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > RemoteException in offerService > org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): > java.lang.NullPointerException > at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.(BlockInfo.java:80) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockToMarkCorrupt.(BlockManager.java:1696) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.checkReplicaCorrupt(BlockManager.java:2185) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReportedBlock(BlockManager.java:2047) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.reportDiff(BlockManager.java:1950) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1823) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1750) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1069) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26382) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1623) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8113) NullPointerException in BlockInfoContiguous causes block report failure
[ https://issues.apache.org/jira/browse/HDFS-8113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14499804#comment-14499804 ] Chengbing Liu commented on HDFS-8113: - [~vinayrpet] Genstamp on all other nodes is 76017688, yes. The stacktrace I gave in the description was wrong, I believe. Current stacktrace is missing thanks to the JVM's feature. > NullPointerException in BlockInfoContiguous causes block report failure > --- > > Key: HDFS-8113 > URL: https://issues.apache.org/jira/browse/HDFS-8113 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: HDFS-8113.02.patch, HDFS-8113.patch > > > The following copy constructor can throw NullPointerException if {{bc}} is > null. > {code} > protected BlockInfoContiguous(BlockInfoContiguous from) { > this(from, from.bc.getBlockReplication()); > this.bc = from.bc; > } > {code} > We have observed that some DataNodes keeps failing doing block reports with > NameNode. The stacktrace is as follows. Though we are not using the latest > version, the problem still exists. > {quote} > 2015-03-08 19:28:13,442 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > RemoteException in offerService > org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): > java.lang.NullPointerException > at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.(BlockInfo.java:80) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockToMarkCorrupt.(BlockManager.java:1696) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.checkReplicaCorrupt(BlockManager.java:2185) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReportedBlock(BlockManager.java:2047) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.reportDiff(BlockManager.java:1950) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1823) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1750) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1069) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26382) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1623) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8113) NullPointerException in BlockInfoContiguous causes block report failure
[ https://issues.apache.org/jira/browse/HDFS-8113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14499591#comment-14499591 ] Chengbing Liu commented on HDFS-8113: - Thanks [~vinayrpet] for your advice! I got the following debug logs. {quote} 2015-04-17 15:38:54,801 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Reported block blk_1143745403_70011665 on 10.153.80.84:1004 size 2631763 replicaState = FINALIZED 2015-04-17 15:38:54,801 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: In memory blockUCState = COMPLETE 2015-04-17 15:38:54,801 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Reported block blk_1185006557_111278782 on 10.153.80.84:1004 size 19005434 replicaState = FINALIZED 2015-04-17 15:38:54,801 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Reported block blk_1189413471_115690616 on 10.153.80.84:1004 size 99678737 replicaState = FINALIZED 2015-04-17 15:38:54,801 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Reported block blk_1171261663_97530254 on 10.153.80.84:1004 size 13847 replicaState = FINALIZED 2015-04-17 15:38:54,801 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Reported block blk_1149751102_76017688 on 10.153.80.84:1004 size 6702 replicaState = FINALIZED 2015-04-17 15:38:54,801 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: In memory blockUCState = COMPLETE 2015-04-17 15:38:54,801 WARN org.apache.hadoop.ipc.Server: IPC Server handler 109 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.blockReport from 10.153.80.84:38504 Call#4258262 R etry#0 java.lang.NullPointerException {quote} The stacktrace is missing due to JVM default optimization... OmitStackTraceInFastThrow is the default option, and I didn't unset it. It will recompile a method if it has thrown some exception too many times. The stacktrace in the issue description is got from a DN a month ago. >From the above logs, it is a FINALIZED block in a report that caused the NPE. >So the stacktrace in the description is incorrect. Really sorry for that. Then I checked the last block blk_1149751102_76017688 with oiv against fsimage. The file is OK. I can download it through FS shell. I also checked all three DNs containing this block, and they all have the same file, genstamp and meta. It seems the active NameNode's holding incorrect information on this block. > NullPointerException in BlockInfoContiguous causes block report failure > --- > > Key: HDFS-8113 > URL: https://issues.apache.org/jira/browse/HDFS-8113 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: HDFS-8113.02.patch, HDFS-8113.patch > > > The following copy constructor can throw NullPointerException if {{bc}} is > null. > {code} > protected BlockInfoContiguous(BlockInfoContiguous from) { > this(from, from.bc.getBlockReplication()); > this.bc = from.bc; > } > {code} > We have observed that some DataNodes keeps failing doing block reports with > NameNode. The stacktrace is as follows. Though we are not using the latest > version, the problem still exists. > {quote} > 2015-03-08 19:28:13,442 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > RemoteException in offerService > org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): > java.lang.NullPointerException > at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.(BlockInfo.java:80) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockToMarkCorrupt.(BlockManager.java:1696) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.checkReplicaCorrupt(BlockManager.java:2185) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReportedBlock(BlockManager.java:2047) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.reportDiff(BlockManager.java:1950) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1823) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1750) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1069) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26382) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call
[jira] [Updated] (HDFS-8113) NullPointerException in BlockInfoContiguous causes block report failure
[ https://issues.apache.org/jira/browse/HDFS-8113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-8113: Attachment: HDFS-8113.02.patch Added a unit test for the copy constructor. I suggest dealing with null-checks in another JIRA, since there might be some discussions on how to handle these "null" situations. > NullPointerException in BlockInfoContiguous causes block report failure > --- > > Key: HDFS-8113 > URL: https://issues.apache.org/jira/browse/HDFS-8113 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: HDFS-8113.02.patch, HDFS-8113.patch > > > The following copy constructor can throw NullPointerException if {{bc}} is > null. > {code} > protected BlockInfoContiguous(BlockInfoContiguous from) { > this(from, from.bc.getBlockReplication()); > this.bc = from.bc; > } > {code} > We have observed that some DataNodes keeps failing doing block reports with > NameNode. The stacktrace is as follows. Though we are not using the latest > version, the problem still exists. > {quote} > 2015-03-08 19:28:13,442 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > RemoteException in offerService > org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): > java.lang.NullPointerException > at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.(BlockInfo.java:80) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockToMarkCorrupt.(BlockManager.java:1696) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.checkReplicaCorrupt(BlockManager.java:2185) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReportedBlock(BlockManager.java:2047) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.reportDiff(BlockManager.java:1950) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1823) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1750) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1069) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26382) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1623) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8113) NullPointerException in BlockInfoContiguous causes block report failure
[ https://issues.apache.org/jira/browse/HDFS-8113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497606#comment-14497606 ] Chengbing Liu commented on HDFS-8113: - Hi [~qwertymaniac] and [~atm], this is one of the test sequences I did yesterday, but was still not able to reproduce the issue. The problem is that if you delete the file, the block will not be in {{blocksMap}}, then we won't be able to reproduce it. To reproduce this, we must make sure that the {{blockInfo}} is in {{blocksMap}} and {{blockInfo.bc == null}}. I tried several test sequences with no luck. I just tried cleaning the rbw directory and restarted the DataNode. However, the problem still exists. Maybe you have ideas about this? And [~cmccabe], are you suggesting the patch here is ok or we should additionally check nullity for each {{storedBlock.getBlockCollection()}}? > NullPointerException in BlockInfoContiguous causes block report failure > --- > > Key: HDFS-8113 > URL: https://issues.apache.org/jira/browse/HDFS-8113 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: HDFS-8113.patch > > > The following copy constructor can throw NullPointerException if {{bc}} is > null. > {code} > protected BlockInfoContiguous(BlockInfoContiguous from) { > this(from, from.bc.getBlockReplication()); > this.bc = from.bc; > } > {code} > We have observed that some DataNodes keeps failing doing block reports with > NameNode. The stacktrace is as follows. Though we are not using the latest > version, the problem still exists. > {quote} > 2015-03-08 19:28:13,442 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > RemoteException in offerService > org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): > java.lang.NullPointerException > at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.(BlockInfo.java:80) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockToMarkCorrupt.(BlockManager.java:1696) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.checkReplicaCorrupt(BlockManager.java:2185) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReportedBlock(BlockManager.java:2047) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.reportDiff(BlockManager.java:1950) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1823) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1750) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1069) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26382) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1623) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8113) NullPointerException in BlockInfoContiguous causes block report failure
[ https://issues.apache.org/jira/browse/HDFS-8113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496334#comment-14496334 ] Chengbing Liu commented on HDFS-8113: - [~vinayrpet] Actually, whenever I start the problematic DataNode, NPE happens in every block report. That doesn't seem to be a transient problem as you have mentioned. Is it possible that the file is deleted without removal of the blocks? > NullPointerException in BlockInfoContiguous causes block report failure > --- > > Key: HDFS-8113 > URL: https://issues.apache.org/jira/browse/HDFS-8113 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: HDFS-8113.patch > > > The following copy constructor can throw NullPointerException if {{bc}} is > null. > {code} > protected BlockInfoContiguous(BlockInfoContiguous from) { > this(from, from.bc.getBlockReplication()); > this.bc = from.bc; > } > {code} > We have observed that some DataNodes keeps failing doing block reports with > NameNode. The stacktrace is as follows. Though we are not using the latest > version, the problem still exists. > {quote} > 2015-03-08 19:28:13,442 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > RemoteException in offerService > org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): > java.lang.NullPointerException > at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.(BlockInfo.java:80) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockToMarkCorrupt.(BlockManager.java:1696) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.checkReplicaCorrupt(BlockManager.java:2185) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReportedBlock(BlockManager.java:2047) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.reportDiff(BlockManager.java:1950) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1823) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1750) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1069) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26382) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1623) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8113) NullPointerException in BlockInfoContiguous causes block report failure
[ https://issues.apache.org/jira/browse/HDFS-8113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496183#comment-14496183 ] Chengbing Liu commented on HDFS-8113: - Hi [~atm] and [~cmccabe], from the stacktrace we know that the {{reportedState}} is RBW or RWR, and condition {{storedBlock.getGenerationStamp() != reported.getGenerationStamp()}} is satisfied. Since {{storedBlock}} is an entry in {{blocksMap}}, the file/block should not have been deleted. I did some tests using MiniDFSCluster. The result is as follows: - If a file is deleted, then {{BlockInfo}} is removed from {{blocksMap}}. - If a file is not deleted, then {{BlockInfo.bc}} is the file, which cannot be null. I'm wondering if it could happen that a block does not belong to any file, yet it does exist? Could you kindly explain this? Thanks! > NullPointerException in BlockInfoContiguous causes block report failure > --- > > Key: HDFS-8113 > URL: https://issues.apache.org/jira/browse/HDFS-8113 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: HDFS-8113.patch > > > The following copy constructor can throw NullPointerException if {{bc}} is > null. > {code} > protected BlockInfoContiguous(BlockInfoContiguous from) { > this(from, from.bc.getBlockReplication()); > this.bc = from.bc; > } > {code} > We have observed that some DataNodes keeps failing doing block reports with > NameNode. The stacktrace is as follows. Though we are not using the latest > version, the problem still exists. > {quote} > 2015-03-08 19:28:13,442 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > RemoteException in offerService > org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): > java.lang.NullPointerException > at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.(BlockInfo.java:80) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockToMarkCorrupt.(BlockManager.java:1696) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.checkReplicaCorrupt(BlockManager.java:2185) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReportedBlock(BlockManager.java:2047) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.reportDiff(BlockManager.java:1950) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1823) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1750) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1069) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26382) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1623) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8113) NullPointerException in BlockInfoContiguous causes block report failure
[ https://issues.apache.org/jira/browse/HDFS-8113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491960#comment-14491960 ] Chengbing Liu commented on HDFS-8113: - The following code in {{BlockManager#processReportedBlock}} returns {{BlockInfoContiguous}} with {{BlockCollection}} equal to {{null}}: {code} BlockInfoContiguous storedBlock = blocksMap.getStoredBlock(block); {code} There are two methods that can add entries to {{blocksMap}}: - {{BlocksMap#addBlockCollection(BlockInfoContiguous b, BlockCollection bc)}}, we should check whether {{bc}} is {{null}}. - {{BlocksMap#replaceBlock(BlockInfoContiguous newBlock)}}, we should check whether {{newBlock.getBlockCollection()}} is {{null}}. Both methods are called from many places. To get more debug information, I think we should at least log it as WARN or ERROR if the {{BlockCollection}} happens to be {{null}}. > NullPointerException in BlockInfoContiguous causes block report failure > --- > > Key: HDFS-8113 > URL: https://issues.apache.org/jira/browse/HDFS-8113 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: HDFS-8113.patch > > > The following copy constructor can throw NullPointerException if {{bc}} is > null. > {code} > protected BlockInfoContiguous(BlockInfoContiguous from) { > this(from, from.bc.getBlockReplication()); > this.bc = from.bc; > } > {code} > We have observed that some DataNodes keeps failing doing block reports with > NameNode. The stacktrace is as follows. Though we are not using the latest > version, the problem still exists. > {quote} > 2015-03-08 19:28:13,442 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > RemoteException in offerService > org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): > java.lang.NullPointerException > at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.(BlockInfo.java:80) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockToMarkCorrupt.(BlockManager.java:1696) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.checkReplicaCorrupt(BlockManager.java:2185) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReportedBlock(BlockManager.java:2047) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.reportDiff(BlockManager.java:1950) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1823) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1750) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1069) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26382) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1623) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8113) NullPointerException in BlockInfoContiguous causes block report failure
[ https://issues.apache.org/jira/browse/HDFS-8113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490776#comment-14490776 ] Chengbing Liu commented on HDFS-8113: - Aaron, thanks for the classification. I agree with you that we should find out what causes the {{BlockCollection}} to be {{null}}. I will look into this shortly. In my opinion, we should divide the issue into two: the problem with {{BlockInfoContiguous}} itself and the probable misuse of it. For the problem with {{BlockInfoContiguous}} itself, it cannot guarantee that people who instantiate it have updated the {{BlockCollection}} before calling the copy constructor. I find it in the earliest commit that I can see on GitHub, which is HADOOP-7560 on Aug 25, 2011. The second problem, the misuse of {{BlockInfoContiguous}}, might be introduced recently. Should we deal with it in another JIRA? > NullPointerException in BlockInfoContiguous causes block report failure > --- > > Key: HDFS-8113 > URL: https://issues.apache.org/jira/browse/HDFS-8113 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: HDFS-8113.patch > > > The following copy constructor can throw NullPointerException if {{bc}} is > null. > {code} > protected BlockInfoContiguous(BlockInfoContiguous from) { > this(from, from.bc.getBlockReplication()); > this.bc = from.bc; > } > {code} > We have observed that some DataNodes keeps failing doing block reports with > NameNode. The stacktrace is as follows. Though we are not using the latest > version, the problem still exists. > {quote} > 2015-03-08 19:28:13,442 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > RemoteException in offerService > org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): > java.lang.NullPointerException > at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.(BlockInfo.java:80) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockToMarkCorrupt.(BlockManager.java:1696) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.checkReplicaCorrupt(BlockManager.java:2185) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReportedBlock(BlockManager.java:2047) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.reportDiff(BlockManager.java:1950) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1823) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1750) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1069) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26382) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1623) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8113) NullPointerException in BlockInfoContiguous causes block report failure
[ https://issues.apache.org/jira/browse/HDFS-8113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-8113: Affects Version/s: 2.7.0 > NullPointerException in BlockInfoContiguous causes block report failure > --- > > Key: HDFS-8113 > URL: https://issues.apache.org/jira/browse/HDFS-8113 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0, 2.7.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: HDFS-8113.patch > > > The following copy constructor can throw NullPointerException if {{bc}} is > null. > {code} > protected BlockInfoContiguous(BlockInfoContiguous from) { > this(from, from.bc.getBlockReplication()); > this.bc = from.bc; > } > {code} > We have observed that some DataNodes keeps failing doing block reports with > NameNode. The stacktrace is as follows. Though we are not using the latest > version, the problem still exists. > {quote} > 2015-03-08 19:28:13,442 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > RemoteException in offerService > org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): > java.lang.NullPointerException > at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.(BlockInfo.java:80) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockToMarkCorrupt.(BlockManager.java:1696) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.checkReplicaCorrupt(BlockManager.java:2185) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReportedBlock(BlockManager.java:2047) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.reportDiff(BlockManager.java:1950) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1823) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1750) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1069) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26382) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1623) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8113) NullPointerException in BlockInfoContiguous causes block report failure
[ https://issues.apache.org/jira/browse/HDFS-8113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-8113: Status: Patch Available (was: Open) > NullPointerException in BlockInfoContiguous causes block report failure > --- > > Key: HDFS-8113 > URL: https://issues.apache.org/jira/browse/HDFS-8113 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: HDFS-8113.patch > > > The following copy constructor can throw NullPointerException if {{bc}} is > null. > {code} > protected BlockInfoContiguous(BlockInfoContiguous from) { > this(from, from.bc.getBlockReplication()); > this.bc = from.bc; > } > {code} > We have observed that some DataNodes keeps failing doing block reports with > NameNode. The stacktrace is as follows. Though we are not using the latest > version, the problem still exists. > {quote} > 2015-03-08 19:28:13,442 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > RemoteException in offerService > org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): > java.lang.NullPointerException > at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.(BlockInfo.java:80) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockToMarkCorrupt.(BlockManager.java:1696) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.checkReplicaCorrupt(BlockManager.java:2185) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReportedBlock(BlockManager.java:2047) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.reportDiff(BlockManager.java:1950) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1823) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1750) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1069) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26382) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1623) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8113) NullPointerException in BlockInfoContiguous causes block report failure
[ https://issues.apache.org/jira/browse/HDFS-8113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-8113: Attachment: HDFS-8113.patch Uploaded a patch to fix this. > NullPointerException in BlockInfoContiguous causes block report failure > --- > > Key: HDFS-8113 > URL: https://issues.apache.org/jira/browse/HDFS-8113 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: HDFS-8113.patch > > > The following copy constructor can throw NullPointerException if {{bc}} is > null. > {code} > protected BlockInfoContiguous(BlockInfoContiguous from) { > this(from, from.bc.getBlockReplication()); > this.bc = from.bc; > } > {code} > We have observed that some DataNodes keeps failing doing block reports with > NameNode. The stacktrace is as follows. Though we are not using the latest > version, the problem still exists. > {quote} > 2015-03-08 19:28:13,442 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > RemoteException in offerService > org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): > java.lang.NullPointerException > at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.(BlockInfo.java:80) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockToMarkCorrupt.(BlockManager.java:1696) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.checkReplicaCorrupt(BlockManager.java:2185) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReportedBlock(BlockManager.java:2047) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.reportDiff(BlockManager.java:1950) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1823) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1750) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1069) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26382) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1623) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8113) NullPointerException in BlockInfoContiguous causes block report failure
Chengbing Liu created HDFS-8113: --- Summary: NullPointerException in BlockInfoContiguous causes block report failure Key: HDFS-8113 URL: https://issues.apache.org/jira/browse/HDFS-8113 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.6.0 Reporter: Chengbing Liu Assignee: Chengbing Liu The following copy constructor can throw NullPointerException if {{bc}} is null. {code} protected BlockInfoContiguous(BlockInfoContiguous from) { this(from, from.bc.getBlockReplication()); this.bc = from.bc; } {code} We have observed that some DataNodes keeps failing doing block reports with NameNode. The stacktrace is as follows. Though we are not using the latest version, the problem still exists. {quote} 2015-03-08 19:28:13,442 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: RemoteException in offerService org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.(BlockInfo.java:80) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockToMarkCorrupt.(BlockManager.java:1696) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.checkReplicaCorrupt(BlockManager.java:2185) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReportedBlock(BlockManager.java:2047) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.reportDiff(BlockManager.java:1950) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1823) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1750) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1069) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152) at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26382) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1623) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7785) Improve diagnostics information for HttpPutFailedException
[ https://issues.apache.org/jira/browse/HDFS-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14344247#comment-14344247 ] Chengbing Liu commented on HDFS-7785: - Thanks [~wheat9] for committing. > Improve diagnostics information for HttpPutFailedException > -- > > Key: HDFS-7785 > URL: https://issues.apache.org/jira/browse/HDFS-7785 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Fix For: 2.7.0 > > Attachments: HDFS-7785.01.patch, HDFS-7785.01.patch > > > One of our namenode logs shows the following exception message. > ... > Caused by: > org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpPutFailedException: > org.apache.hadoop.security.authentication.util.SignerException: Invalid > signature > at > org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294) > ... > {{HttpPutFailedException}} should have its detailed information, such as > status code and url, shown in the log to help debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7785) Improve diagnostics for HttpPutFailedException
[ https://issues.apache.org/jira/browse/HDFS-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-7785: Attachment: HDFS-7785.01.patch Re-upload to trigger Jenkins. (Cancelling and submitting does not work) > Improve diagnostics for HttpPutFailedException > -- > > Key: HDFS-7785 > URL: https://issues.apache.org/jira/browse/HDFS-7785 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: HDFS-7785.01.patch, HDFS-7785.01.patch > > > One of our namenode logs shows the following exception message. > ... > Caused by: > org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpPutFailedException: > org.apache.hadoop.security.authentication.util.SignerException: Invalid > signature > at > org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294) > ... > {{HttpPutFailedException}} should have its detailed information, such as > status code and url, shown in the log to help debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7785) Improve diagnostics for HttpPutFailedException
[ https://issues.apache.org/jira/browse/HDFS-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-7785: Status: Patch Available (was: Open) > Improve diagnostics for HttpPutFailedException > -- > > Key: HDFS-7785 > URL: https://issues.apache.org/jira/browse/HDFS-7785 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: HDFS-7785.01.patch > > > One of our namenode logs shows the following exception message. > ... > Caused by: > org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpPutFailedException: > org.apache.hadoop.security.authentication.util.SignerException: Invalid > signature > at > org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294) > ... > {{HttpPutFailedException}} should have its detailed information, such as > status code and url, shown in the log to help debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7785) Improve diagnostics for HttpPutFailedException
[ https://issues.apache.org/jira/browse/HDFS-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-7785: Status: Open (was: Patch Available) > Improve diagnostics for HttpPutFailedException > -- > > Key: HDFS-7785 > URL: https://issues.apache.org/jira/browse/HDFS-7785 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: HDFS-7785.01.patch > > > One of our namenode logs shows the following exception message. > ... > Caused by: > org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpPutFailedException: > org.apache.hadoop.security.authentication.util.SignerException: Invalid > signature > at > org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294) > ... > {{HttpPutFailedException}} should have its detailed information, such as > status code and url, shown in the log to help debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7785) Improve diagnostics for HttpPutFailedException
[ https://issues.apache.org/jira/browse/HDFS-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14340024#comment-14340024 ] Chengbing Liu commented on HDFS-7785: - The Jenkins message seems incorrect, since this patch does not include any tests. [~ste...@apache.org] Can you show me how to retrigger the build? Thanks. > Improve diagnostics for HttpPutFailedException > -- > > Key: HDFS-7785 > URL: https://issues.apache.org/jira/browse/HDFS-7785 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: HDFS-7785.01.patch > > > One of our namenode logs shows the following exception message. > ... > Caused by: > org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpPutFailedException: > org.apache.hadoop.security.authentication.util.SignerException: Invalid > signature > at > org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294) > ... > {{HttpPutFailedException}} should have its detailed information, such as > status code and url, shown in the log to help debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7785) Improve diagnostics for HttpPutFailedException
[ https://issues.apache.org/jira/browse/HDFS-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-7785: Summary: Improve diagnostics for HttpPutFailedException (was: Add detailed message for HttpPutFailedException) > Improve diagnostics for HttpPutFailedException > -- > > Key: HDFS-7785 > URL: https://issues.apache.org/jira/browse/HDFS-7785 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: HDFS-7785.01.patch > > > One of our namenode logs shows the following exception message. > ... > Caused by: > org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpPutFailedException: > org.apache.hadoop.security.authentication.util.SignerException: Invalid > signature > at > org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294) > ... > {{HttpPutFailedException}} should have its detailed information, such as > status code and url, shown in the log to help debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7798) Checkpointing failure caused by shared KerberosAuthenticator
[ https://issues.apache.org/jira/browse/HDFS-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324161#comment-14324161 ] Chengbing Liu commented on HDFS-7798: - Thanks [~hitliuyi] for review and committing! > Checkpointing failure caused by shared KerberosAuthenticator > > > Key: HDFS-7798 > URL: https://issues.apache.org/jira/browse/HDFS-7798 > Project: Hadoop HDFS > Issue Type: Bug > Components: security >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu >Priority: Critical > Fix For: 2.7.0 > > Attachments: HDFS-7798.01.patch > > > We have observed in our real cluster occasional checkpointing failure. The > standby NameNode was not able to upload image to the active NameNode. > After some digging, the root cause appears to be a shared > {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is > designed as a use-once instance, and is not stateless. It has attributes such > as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling > {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is > going to have race condition, resulting in a failed image uploading. > Therefore for the first step, without breaking the current API, I propose we > create a new {{KerberosAuthenticator}} instance for each connection, to make > checkpointing work. We may consider making {{Authenticator}} design and > implementation stateless afterwards, as {{ConnectionConfigurator}} does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7798) Checkpointing failure caused by shared KerberosAuthenticator
[ https://issues.apache.org/jira/browse/HDFS-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-7798: Affects Version/s: 2.6.0 > Checkpointing failure caused by shared KerberosAuthenticator > > > Key: HDFS-7798 > URL: https://issues.apache.org/jira/browse/HDFS-7798 > Project: Hadoop HDFS > Issue Type: Bug > Components: security >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu >Priority: Critical > Attachments: HDFS-7798.01.patch > > > We have observed in our real cluster occasional checkpointing failure. The > standby NameNode was not able to upload image to the active NameNode. > After some digging, the root cause appears to be a shared > {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is > designed as a use-once instance, and is not stateless. It has attributes such > as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling > {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is > going to have race condition, resulting in a failed image uploading. > Therefore for the first step, without breaking the current API, I propose we > create a new {{KerberosAuthenticator}} instance for each connection, to make > checkpointing work. We may consider making {{Authenticator}} design and > implementation stateless afterwards, as {{ConnectionConfigurator}} does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7798) Checkpointing failure caused by shared KerberosAuthenticator
[ https://issues.apache.org/jira/browse/HDFS-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14321793#comment-14321793 ] Chengbing Liu commented on HDFS-7798: - The checkpointing failure happens when image uploading and edit log fetching comes at the same time. > Checkpointing failure caused by shared KerberosAuthenticator > > > Key: HDFS-7798 > URL: https://issues.apache.org/jira/browse/HDFS-7798 > Project: Hadoop HDFS > Issue Type: Bug > Components: security >Reporter: Chengbing Liu >Assignee: Chengbing Liu >Priority: Critical > Attachments: HDFS-7798.01.patch > > > We have observed in our real cluster occasional checkpointing failure. The > standby NameNode was not able to upload image to the active NameNode. > After some digging, the root cause appears to be a shared > {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is > designed as a use-once instance, and is not stateless. It has attributes such > as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling > {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is > going to have race condition, resulting in a failed image uploading. > Therefore for the first step, without breaking the current API, I propose we > create a new {{KerberosAuthenticator}} instance for each connection, to make > checkpointing work. We may consider making {{Authenticator}} design and > implementation stateless afterwards, as {{ConnectionConfigurator}} does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7798) Checkpointing failure caused by shared KerberosAuthenticator
Chengbing Liu created HDFS-7798: --- Summary: Checkpointing failure caused by shared KerberosAuthenticator Key: HDFS-7798 URL: https://issues.apache.org/jira/browse/HDFS-7798 Project: Hadoop HDFS Issue Type: Bug Components: security Reporter: Chengbing Liu Priority: Critical We have observed in our real cluster occasionally checkpointing failure. The standby NameNode was not able to upload image to the active NameNode. After some digging, the root cause appears to be a shared {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is designed as a use-once instance, and is not stateless. It has attributes such as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is going to have race condition, resulting in a failed image uploading. Therefore for the first step, without breaking the current API, I propose we create a new {{KerberosAuthenticator}} instance for each connection, to make checkpointing work. We may consider making {{Authenticator}} design and implementation stateless afterwards, as {{ConnectionConfigurator}} does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7798) Checkpointing failure caused by shared KerberosAuthenticator
[ https://issues.apache.org/jira/browse/HDFS-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-7798: Description: We have observed in our real cluster occasional checkpointing failure. The standby NameNode was not able to upload image to the active NameNode. After some digging, the root cause appears to be a shared {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is designed as a use-once instance, and is not stateless. It has attributes such as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is going to have race condition, resulting in a failed image uploading. Therefore for the first step, without breaking the current API, I propose we create a new {{KerberosAuthenticator}} instance for each connection, to make checkpointing work. We may consider making {{Authenticator}} design and implementation stateless afterwards, as {{ConnectionConfigurator}} does. was: We have observed in our real cluster occasionally checkpointing failure. The standby NameNode was not able to upload image to the active NameNode. After some digging, the root cause appears to be a shared {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is designed as a use-once instance, and is not stateless. It has attributes such as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is going to have race condition, resulting in a failed image uploading. Therefore for the first step, without breaking the current API, I propose we create a new {{KerberosAuthenticator}} instance for each connection, to make checkpointing work. We may consider making {{Authenticator}} design and implementation stateless afterwards, as {{ConnectionConfigurator}} does. > Checkpointing failure caused by shared KerberosAuthenticator > > > Key: HDFS-7798 > URL: https://issues.apache.org/jira/browse/HDFS-7798 > Project: Hadoop HDFS > Issue Type: Bug > Components: security >Reporter: Chengbing Liu >Priority: Critical > > We have observed in our real cluster occasional checkpointing failure. The > standby NameNode was not able to upload image to the active NameNode. > After some digging, the root cause appears to be a shared > {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is > designed as a use-once instance, and is not stateless. It has attributes such > as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling > {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is > going to have race condition, resulting in a failed image uploading. > Therefore for the first step, without breaking the current API, I propose we > create a new {{KerberosAuthenticator}} instance for each connection, to make > checkpointing work. We may consider making {{Authenticator}} design and > implementation stateless afterwards, as {{ConnectionConfigurator}} does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7798) Checkpointing failure caused by shared KerberosAuthenticator
[ https://issues.apache.org/jira/browse/HDFS-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-7798: Attachment: HDFS-7798.01.patch > Checkpointing failure caused by shared KerberosAuthenticator > > > Key: HDFS-7798 > URL: https://issues.apache.org/jira/browse/HDFS-7798 > Project: Hadoop HDFS > Issue Type: Bug > Components: security >Reporter: Chengbing Liu >Priority: Critical > Attachments: HDFS-7798.01.patch > > > We have observed in our real cluster occasional checkpointing failure. The > standby NameNode was not able to upload image to the active NameNode. > After some digging, the root cause appears to be a shared > {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is > designed as a use-once instance, and is not stateless. It has attributes such > as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling > {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is > going to have race condition, resulting in a failed image uploading. > Therefore for the first step, without breaking the current API, I propose we > create a new {{KerberosAuthenticator}} instance for each connection, to make > checkpointing work. We may consider making {{Authenticator}} design and > implementation stateless afterwards, as {{ConnectionConfigurator}} does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7798) Checkpointing failure caused by shared KerberosAuthenticator
[ https://issues.apache.org/jira/browse/HDFS-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-7798: Assignee: Chengbing Liu Status: Patch Available (was: Open) > Checkpointing failure caused by shared KerberosAuthenticator > > > Key: HDFS-7798 > URL: https://issues.apache.org/jira/browse/HDFS-7798 > Project: Hadoop HDFS > Issue Type: Bug > Components: security >Reporter: Chengbing Liu >Assignee: Chengbing Liu >Priority: Critical > Attachments: HDFS-7798.01.patch > > > We have observed in our real cluster occasional checkpointing failure. The > standby NameNode was not able to upload image to the active NameNode. > After some digging, the root cause appears to be a shared > {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is > designed as a use-once instance, and is not stateless. It has attributes such > as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling > {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is > going to have race condition, resulting in a failed image uploading. > Therefore for the first step, without breaking the current API, I propose we > create a new {{KerberosAuthenticator}} instance for each connection, to make > checkpointing work. We may consider making {{Authenticator}} design and > implementation stateless afterwards, as {{ConnectionConfigurator}} does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7785) Add detailed message for HttpPutFailedException
[ https://issues.apache.org/jira/browse/HDFS-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-7785: Attachment: HDFS-7785.01.patch > Add detailed message for HttpPutFailedException > --- > > Key: HDFS-7785 > URL: https://issues.apache.org/jira/browse/HDFS-7785 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0 >Reporter: Chengbing Liu > Attachments: HDFS-7785.01.patch > > > One of our namenode logs shows the following exception message. > ... > Caused by: > org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpPutFailedException: > org.apache.hadoop.security.authentication.util.SignerException: Invalid > signature > at > org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294) > ... > {{HttpPutFailedException}} should have its detailed information, such as > status code and url, shown in the log to help debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7785) Add detailed message for HttpPutFailedException
[ https://issues.apache.org/jira/browse/HDFS-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-7785: Assignee: Chengbing Liu Status: Patch Available (was: Open) > Add detailed message for HttpPutFailedException > --- > > Key: HDFS-7785 > URL: https://issues.apache.org/jira/browse/HDFS-7785 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: HDFS-7785.01.patch > > > One of our namenode logs shows the following exception message. > ... > Caused by: > org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpPutFailedException: > org.apache.hadoop.security.authentication.util.SignerException: Invalid > signature > at > org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294) > ... > {{HttpPutFailedException}} should have its detailed information, such as > status code and url, shown in the log to help debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7785) Add detailed message for HttpPutFailedException
Chengbing Liu created HDFS-7785: --- Summary: Add detailed message for HttpPutFailedException Key: HDFS-7785 URL: https://issues.apache.org/jira/browse/HDFS-7785 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.6.0 Reporter: Chengbing Liu One of our namenode logs shows the following exception message. ... Caused by: org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpPutFailedException: org.apache.hadoop.security.authentication.util.SignerException: Invalid signature at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294) ... {{HttpPutFailedException}} should have its detailed information, such as status code and url, shown in the log to help debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7162) Wrong path when deleting through fuse-dfs a file which already exists in trash
[ https://issues.apache.org/jira/browse/HDFS-7162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14156180#comment-14156180 ] Chengbing Liu commented on HDFS-7162: - I think there are some misunderstandings, probably due to the title is not quite clear. So let me clarify what the patch actually does. Two problems are fixed in HDFS-7162.2.patch: - Say we want to delete the file {{/path/to/file}}, and somehow the file {{/user/yourname/.Trash/Current/path/to/file}} exists, we expect the file to be moved as {{/user/yourname/.Trash/Current/path/to/file.1}}. The actual thing it did was moving the file to {{/user/yourname/.Trash/Current/path/tofile.1}}, where a slash is missing. - When judging if the file to be deleted ({{abs_path}}) is already in the trash, we compare the {{trash_base}} with {{abs_path}}. The problem is exactly as Colin has pointed out. But I don't think we could just add a slash to the end of {{trash_base}}, since the given {{abs_path}} can end with {{/user/yourname/.Trash/Current}} with no slash at the end. In this case, adding a slash to the end of {{trash_base}} would not delete the whold {{/user/yourname/.Trash/Current}} directory. > Wrong path when deleting through fuse-dfs a file which already exists in trash > -- > > Key: HDFS-7162 > URL: https://issues.apache.org/jira/browse/HDFS-7162 > Project: Hadoop HDFS > Issue Type: Bug > Components: fuse-dfs >Affects Versions: 3.0.0, 2.5.1 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: HDFS-7162.2.patch, HDFS-7162.patch > > > HDFS-4913 lacks a slash in renaming existing trash file. Very small fix for > this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7162) Wrong path when deleting through fuse-dfs a file which already exists in trash
[ https://issues.apache.org/jira/browse/HDFS-7162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152832#comment-14152832 ] Chengbing Liu commented on HDFS-7162: - [~cmccabe] Simply adding a slash at the end of {{trash_base}} won't work since the abs_path could be {{/user/yourname/.Trash/Current}}, which should and will not be deleted then. I have added another check for this in the second patch. And the previous fix was about the missing slash between {{target_dir}} and {{pcomp}}, which has nothing to do with the slash after {{Current}}. Please help review the new patch, thanks! > Wrong path when deleting through fuse-dfs a file which already exists in trash > -- > > Key: HDFS-7162 > URL: https://issues.apache.org/jira/browse/HDFS-7162 > Project: Hadoop HDFS > Issue Type: Bug > Components: fuse-dfs >Affects Versions: 3.0.0, 2.5.1 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: HDFS-7162.2.patch, HDFS-7162.patch > > > HDFS-4913 lacks a slash in renaming existing trash file. Very small fix for > this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7162) Wrong path when deleting through fuse-dfs a file which already exists in trash
[ https://issues.apache.org/jira/browse/HDFS-7162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-7162: Attachment: HDFS-7162.2.patch Updated patch. Now it handles the following abs_path: - /user/yourname/.Trash/Current - /user/yourname/.Trash/Current/ - /user/yourname/.Trash/Currently - /user/yourname/.Trash/Current/path/to/file > Wrong path when deleting through fuse-dfs a file which already exists in trash > -- > > Key: HDFS-7162 > URL: https://issues.apache.org/jira/browse/HDFS-7162 > Project: Hadoop HDFS > Issue Type: Bug > Components: fuse-dfs >Affects Versions: 3.0.0, 2.5.1 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: HDFS-7162.2.patch, HDFS-7162.patch > > > HDFS-4913 lacks a slash in renaming existing trash file. Very small fix for > this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7162) Wrong path when deleting through fuse-dfs a file which already exists in trash
[ https://issues.apache.org/jira/browse/HDFS-7162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-7162: Attachment: HDFS-7162.patch > Wrong path when deleting through fuse-dfs a file which already exists in trash > -- > > Key: HDFS-7162 > URL: https://issues.apache.org/jira/browse/HDFS-7162 > Project: Hadoop HDFS > Issue Type: Bug > Components: fuse-dfs >Affects Versions: 3.0.0, 2.5.1 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > Attachments: HDFS-7162.patch > > > HDFS-4913 lacks a slash in renaming existing trash file. Very small fix for > this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7162) Wrong path when deleting through fuse-dfs a file which already exists in trash
[ https://issues.apache.org/jira/browse/HDFS-7162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated HDFS-7162: Assignee: Chengbing Liu Status: Patch Available (was: Open) Fix wrong path and remove a debug statement. > Wrong path when deleting through fuse-dfs a file which already exists in trash > -- > > Key: HDFS-7162 > URL: https://issues.apache.org/jira/browse/HDFS-7162 > Project: Hadoop HDFS > Issue Type: Bug > Components: fuse-dfs >Affects Versions: 2.5.1, 3.0.0 >Reporter: Chengbing Liu >Assignee: Chengbing Liu > > HDFS-4913 lacks a slash in renaming existing trash file. Very small fix for > this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7162) Wrong path when deleting through fuse-dfs a file which already exists in trash
Chengbing Liu created HDFS-7162: --- Summary: Wrong path when deleting through fuse-dfs a file which already exists in trash Key: HDFS-7162 URL: https://issues.apache.org/jira/browse/HDFS-7162 Project: Hadoop HDFS Issue Type: Bug Components: fuse-dfs Affects Versions: 2.5.1, 3.0.0 Reporter: Chengbing Liu HDFS-4913 lacks a slash in renaming existing trash file. Very small fix for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)