[jira] [Created] (HDFS-15069) DecommissionMonitor thread will block forever while it encountered an unchecked exception.
Xudong Cao created HDFS-15069: - Summary: DecommissionMonitor thread will block forever while it encountered an unchecked exception. Key: HDFS-15069 URL: https://issues.apache.org/jira/browse/HDFS-15069 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 3.1.3 Reporter: Xudong Cao Assignee: Xudong Cao Attachments: stack_on_16_12.png, stack_on_16_42.png More than once, we have observed that during decommissioning of a large number of dns, the thread DecommissionMonitor-0 will stop scheduling, blocking for a long time, and there will be no exception logs or notifications at all. e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. The stack of DecommissionMonitor-0 looks like this: 1. stack on 2019.12.17 16:12 !stack_on_16_12.png! 2. stack on 2019.12.17 16:42 !stack_on_16_42.png! It can be seen that during half an hour, this thread has not been scheduled at all, its Waited count has not changed. We think the cause of the problem is: 1. The DecommissionMonitor task submitted by NameNode encounters an unchecked exception during its running , and then this task will be never executed again. 2. But NameNode does not care about the ScheduledFuture of this task, and never calls ScheduledFuture.get(), so the unchecked exception thrown by the task above will always be placed there, no one knows. After that, the subsequent phenomenon is: 1.The ScheduledExecutorService thread DecommissionMonitor-0 will block forever in ThreadPoolExecutor.getTask (). 2. The previously submitted task DecommissionMonitor will be never executed again. 3. No logs or notifications let us know exactly what had happened. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15069) DecommissionMonitor thread will block forever while it encountered an unchecked exception.
[ https://issues.apache.org/jira/browse/HDFS-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xudong Cao updated HDFS-15069: -- Description: More than once, we have observed that during decommissioning of a large number of dns, the thread DecommissionMonitor-0 will stop scheduling, blocking for a long time, and there will be no exception logs or notifications at all. e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. The stack of DecommissionMonitor-0 looks like this: # stack on 2019.12.17 16:12 !stack_on_16_12.png! # stack on 2019.12.17 16:42 !stack_on_16_42.png! It can be seen that during half an hour, this thread has not been scheduled at all, its Waited count has not changed. We think the cause of the problem is: # The DecommissionMonitor task submitted by NameNode encounters an unchecked exception during its running , and then this task will be never executed again. # But NameNode does not care about the ScheduledFuture of this task, and never calls ScheduledFuture.get(), so the unchecked exception thrown by the task above will always be placed there, no one knows. After that, the subsequent phenomenon is: # The ScheduledExecutorService thread DecommissionMonitor-0 will block forever in ThreadPoolExecutor.getTask (). # The previously submitted task DecommissionMonitor will be never executed again. # No logs or notifications let us know exactly what had happened. was: More than once, we have observed that during decommissioning of a large number of dns, the thread DecommissionMonitor-0 will stop scheduling, blocking for a long time, and there will be no exception logs or notifications at all. e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. The stack of DecommissionMonitor-0 looks like this: 1. stack on 2019.12.17 16:12 !stack_on_16_12.png! 2. stack on 2019.12.17 16:42 !stack_on_16_42.png! It can be seen that during half an hour, this thread has not been scheduled at all, its Waited count has not changed. We think the cause of the problem is: 1. The DecommissionMonitor task submitted by NameNode encounters an unchecked exception during its running , and then this task will be never executed again. 2. But NameNode does not care about the ScheduledFuture of this task, and never calls ScheduledFuture.get(), so the unchecked exception thrown by the task above will always be placed there, no one knows. After that, the subsequent phenomenon is: 1.The ScheduledExecutorService thread DecommissionMonitor-0 will block forever in ThreadPoolExecutor.getTask (). 2. The previously submitted task DecommissionMonitor will be never executed again. 3. No logs or notifications let us know exactly what had happened. > DecommissionMonitor thread will block forever while it encountered an > unchecked exception. > -- > > Key: HDFS-15069 > URL: https://issues.apache.org/jira/browse/HDFS-15069 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Major > Attachments: stack_on_16_12.png, stack_on_16_42.png > > > More than once, we have observed that during decommissioning of a large > number of dns, the thread DecommissionMonitor-0 will stop scheduling, > blocking for a long time, and there will be no exception logs or > notifications at all. > e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about > 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. > The stack of DecommissionMonitor-0 looks like this: > # stack on 2019.12.17 16:12 !stack_on_16_12.png! > # stack on 2019.12.17 16:42 !stack_on_16_42.png! > It can be seen that during half an hour, this thread has not been scheduled > at all, its Waited count has not changed. > We think the cause of the problem is: > # The DecommissionMonitor task submitted by NameNode encounters an unchecked > exception during its running , and then this task will be never executed > again. > # But NameNode does not care about the ScheduledFuture of this task, and > never calls ScheduledFuture.get(), so the unchecked exception thrown by the > task above will always be placed there, no one knows. > After that, the subsequent phenomenon is: > # The ScheduledExecutorService thread DecommissionMonitor-0 will block > forever in ThreadPoolExecutor.getTask (). > # The previously submitted task DecommissionMonitor will be never executed > again. > # No logs or notifications let us know exactly what had happened. -- This message was sent by Atlassian Jira (v
[jira] [Updated] (HDFS-15069) DecommissionMonitor thread will block forever while it encountered an unchecked exception.
[ https://issues.apache.org/jira/browse/HDFS-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xudong Cao updated HDFS-15069: -- Description: More than once, we have observed that during decommissioning of a large number of DNs, the thread DecommissionMonitor-0 will stop scheduling, blocking for a long time, and there will be no exception logs or notifications at all. e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. The stack of DecommissionMonitor-0 looks like this: # stack on 2019.12.17 16:12 !stack_on_16_12.png! # stack on 2019.12.17 16:42 !stack_on_16_42.png! It can be seen that during half an hour, this thread has not been scheduled at all, its Waited count has not changed. We think the cause of the problem is: # The DecommissionMonitor task submitted by NameNode encounters an unchecked exception during its running , and then this task will be never executed again. # But NameNode does not care about the ScheduledFuture of this task, and never calls ScheduledFuture.get(), so the unchecked exception thrown by the task above will always be placed there, no one knows. After that, the subsequent phenomenon is: # The ScheduledExecutorService thread DecommissionMonitor-0 will block forever in ThreadPoolExecutor.getTask (). # The previously submitted task DecommissionMonitor will be never executed again. # No logs or notifications let us know exactly what had happened. was: More than once, we have observed that during decommissioning of a large number of dns, the thread DecommissionMonitor-0 will stop scheduling, blocking for a long time, and there will be no exception logs or notifications at all. e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. The stack of DecommissionMonitor-0 looks like this: # stack on 2019.12.17 16:12 !stack_on_16_12.png! # stack on 2019.12.17 16:42 !stack_on_16_42.png! It can be seen that during half an hour, this thread has not been scheduled at all, its Waited count has not changed. We think the cause of the problem is: # The DecommissionMonitor task submitted by NameNode encounters an unchecked exception during its running , and then this task will be never executed again. # But NameNode does not care about the ScheduledFuture of this task, and never calls ScheduledFuture.get(), so the unchecked exception thrown by the task above will always be placed there, no one knows. After that, the subsequent phenomenon is: # The ScheduledExecutorService thread DecommissionMonitor-0 will block forever in ThreadPoolExecutor.getTask (). # The previously submitted task DecommissionMonitor will be never executed again. # No logs or notifications let us know exactly what had happened. > DecommissionMonitor thread will block forever while it encountered an > unchecked exception. > -- > > Key: HDFS-15069 > URL: https://issues.apache.org/jira/browse/HDFS-15069 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Major > Attachments: stack_on_16_12.png, stack_on_16_42.png > > > More than once, we have observed that during decommissioning of a large > number of DNs, the thread DecommissionMonitor-0 will stop scheduling, > blocking for a long time, and there will be no exception logs or > notifications at all. > e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about > 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. > The stack of DecommissionMonitor-0 looks like this: > # stack on 2019.12.17 16:12 !stack_on_16_12.png! > # stack on 2019.12.17 16:42 !stack_on_16_42.png! > It can be seen that during half an hour, this thread has not been scheduled > at all, its Waited count has not changed. > We think the cause of the problem is: > # The DecommissionMonitor task submitted by NameNode encounters an unchecked > exception during its running , and then this task will be never executed > again. > # But NameNode does not care about the ScheduledFuture of this task, and > never calls ScheduledFuture.get(), so the unchecked exception thrown by the > task above will always be placed there, no one knows. > After that, the subsequent phenomenon is: > # The ScheduledExecutorService thread DecommissionMonitor-0 will block > forever in ThreadPoolExecutor.getTask (). > # The previously submitted task DecommissionMonitor will be never executed > again. > # No logs or notifications let us know exactly what had happened. -- This message was sent by Atlassian Jira
[jira] [Updated] (HDFS-15069) DecommissionMonitor thread will block forever while it encountered an unchecked exception.
[ https://issues.apache.org/jira/browse/HDFS-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xudong Cao updated HDFS-15069: -- Description: More than once, we have observed that during decommissioning of a large number of DNs, the thread DecommissionMonitor-0 will stop scheduling, blocking for a long time, and there will be no exception logs or notifications at all. e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. The stack of DecommissionMonitor-0 looks like this: # stack on 2019.12.17 16:12 !stack_on_16_12.png! # stack on 2019.12.17 16:42 !stack_on_16_42.png! It can be seen that during half an hour, this thread has not been scheduled at all, its Waited count has not changed. We think the cause of the problem is: # The DecommissionMonitor task submitted by NameNode encounters an unchecked exception during its running , and then this task will be never executed again. # But NameNode does not care about the ScheduledFuture of this task, and never calls ScheduledFuture.get(), so the unchecked exception thrown by the task above will always be placed there, no one knows. After that, the subsequent phenomenon is: # The ScheduledExecutorService thread DecommissionMonitor-0 will block forever in ThreadPoolExecutor.getTask (). # The previously submitted task DecommissionMonitor will be never executed again. # No logs or notifications let us know exactly what had happened. was: More than once, we have observed that during decommissioning of a large number of DNs, the thread DecommissionMonitor-0 will stop scheduling, blocking for a long time, and there will be no exception logs or notifications at all. e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. The stack of DecommissionMonitor-0 looks like this: # stack on 2019.12.17 16:12 !stack_on_16_12.png! # stack on 2019.12.17 16:42 !stack_on_16_42.png! It can be seen that during half an hour, this thread has not been scheduled at all, its Waited count has not changed. We think the cause of the problem is: # The DecommissionMonitor task submitted by NameNode encounters an unchecked exception during its running , and then this task will be never executed again. # But NameNode does not care about the ScheduledFuture of this task, and never calls ScheduledFuture.get(), so the unchecked exception thrown by the task above will always be placed there, no one knows. After that, the subsequent phenomenon is: # The ScheduledExecutorService thread DecommissionMonitor-0 will block forever in ThreadPoolExecutor.getTask (). # The previously submitted task DecommissionMonitor will be never executed again. # No logs or notifications let us know exactly what had happened. > DecommissionMonitor thread will block forever while it encountered an > unchecked exception. > -- > > Key: HDFS-15069 > URL: https://issues.apache.org/jira/browse/HDFS-15069 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Major > Attachments: stack_on_16_12.png, stack_on_16_42.png > > > More than once, we have observed that during decommissioning of a large > number of DNs, the thread DecommissionMonitor-0 will stop scheduling, > blocking for a long time, and there will be no exception logs or > notifications at all. > e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about > 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. > The stack of DecommissionMonitor-0 looks like this: > # stack on 2019.12.17 16:12 !stack_on_16_12.png! > # stack on 2019.12.17 16:42 !stack_on_16_42.png! > It can be seen that during half an hour, this thread has not been scheduled > at all, its Waited count has not changed. > We think the cause of the problem is: > # The DecommissionMonitor task submitted by NameNode encounters an unchecked > exception during its running , and then this task will be never executed > again. > # But NameNode does not care about the ScheduledFuture of this task, and > never calls ScheduledFuture.get(), so the unchecked exception thrown by the > task above will always be placed there, no one knows. > After that, the subsequent phenomenon is: > # The ScheduledExecutorService thread DecommissionMonitor-0 will block > forever in ThreadPoolExecutor.getTask (). > # The previously submitted task DecommissionMonitor will be never executed > again. > # No logs or notifications let us know exactly what had happened. -- This message was sent by Atlassian Jira (
[jira] [Updated] (HDFS-15069) DecommissionMonitor thread will block forever while it encountered an unchecked exception.
[ https://issues.apache.org/jira/browse/HDFS-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xudong Cao updated HDFS-15069: -- Description: More than once, we have observed that during decommissioning of a large number of DNs, the thread DecommissionMonitor-0 will stop scheduling, blocking for a long time, and there will be no exception logs or notifications at all. e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. The stack of DecommissionMonitor-0 looks like this: # stack on 2019.12.17 16:12 !stack_on_16_12.png! # stack on 2019.12.17 16:42 !stack_on_16_42.png! It can be seen that during half an hour, this thread has not been scheduled at all, its Waited count has not changed. We think the cause of the problem is: # The DecommissionMonitor task submitted by NameNode encounters an unchecked exception during its running , and then this task will be never executed again. # But NameNode does not care about the ScheduledFuture of this task, and never calls ScheduledFuture.get(), so the unchecked exception thrown by the task above will always be placed there, no one knows. After that, the subsequent phenomenon is: # The ScheduledExecutorService thread DecommissionMonitor-0 will block forever in ThreadPoolExecutor.getTask(). # The previously submitted task DecommissionMonitor will be never executed again. # No logs or notifications can let us know exactly what had happened. was: More than once, we have observed that during decommissioning of a large number of DNs, the thread DecommissionMonitor-0 will stop scheduling, blocking for a long time, and there will be no exception logs or notifications at all. e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. The stack of DecommissionMonitor-0 looks like this: # stack on 2019.12.17 16:12 !stack_on_16_12.png! # stack on 2019.12.17 16:42 !stack_on_16_42.png! It can be seen that during half an hour, this thread has not been scheduled at all, its Waited count has not changed. We think the cause of the problem is: # The DecommissionMonitor task submitted by NameNode encounters an unchecked exception during its running , and then this task will be never executed again. # But NameNode does not care about the ScheduledFuture of this task, and never calls ScheduledFuture.get(), so the unchecked exception thrown by the task above will always be placed there, no one knows. After that, the subsequent phenomenon is: # The ScheduledExecutorService thread DecommissionMonitor-0 will block forever in ThreadPoolExecutor.getTask (). # The previously submitted task DecommissionMonitor will be never executed again. # No logs or notifications let us know exactly what had happened. > DecommissionMonitor thread will block forever while it encountered an > unchecked exception. > -- > > Key: HDFS-15069 > URL: https://issues.apache.org/jira/browse/HDFS-15069 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Major > Attachments: stack_on_16_12.png, stack_on_16_42.png > > > More than once, we have observed that during decommissioning of a large > number of DNs, the thread DecommissionMonitor-0 will stop scheduling, > blocking for a long time, and there will be no exception logs or > notifications at all. > e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about > 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. > The stack of DecommissionMonitor-0 looks like this: > # stack on 2019.12.17 16:12 !stack_on_16_12.png! > # stack on 2019.12.17 16:42 !stack_on_16_42.png! > It can be seen that during half an hour, this thread has not been scheduled > at all, its Waited count has not changed. > We think the cause of the problem is: > # The DecommissionMonitor task submitted by NameNode encounters an unchecked > exception during its running , and then this task will be never executed > again. > # But NameNode does not care about the ScheduledFuture of this task, and > never calls ScheduledFuture.get(), so the unchecked exception thrown by the > task above will always be placed there, no one knows. > After that, the subsequent phenomenon is: > # The ScheduledExecutorService thread DecommissionMonitor-0 will block > forever in ThreadPoolExecutor.getTask(). > # The previously submitted task DecommissionMonitor will be never executed > again. > # No logs or notifications can let us know exactly what had happened. -- This message was sent by Atlassian J
[jira] [Updated] (HDFS-15069) DecommissionMonitor-0 thread will block forever while the timer moniter task encountered an unchecked exception.
[ https://issues.apache.org/jira/browse/HDFS-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xudong Cao updated HDFS-15069: -- Summary: DecommissionMonitor-0 thread will block forever while the timer moniter task encountered an unchecked exception. (was: DecommissionMonitor thread will block forever while it encountered an unchecked exception.) > DecommissionMonitor-0 thread will block forever while the timer moniter task > encountered an unchecked exception. > > > Key: HDFS-15069 > URL: https://issues.apache.org/jira/browse/HDFS-15069 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Major > Attachments: stack_on_16_12.png, stack_on_16_42.png > > > More than once, we have observed that during decommissioning of a large > number of DNs, the thread DecommissionMonitor-0 will stop scheduling, > blocking for a long time, and there will be no exception logs or > notifications at all. > e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about > 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. > The stack of DecommissionMonitor-0 looks like this: > # stack on 2019.12.17 16:12 !stack_on_16_12.png! > # stack on 2019.12.17 16:42 !stack_on_16_42.png! > It can be seen that during half an hour, this thread has not been scheduled > at all, its Waited count has not changed. > We think the cause of the problem is: > # The DecommissionMonitor task submitted by NameNode encounters an unchecked > exception during its running , and then this task will be never executed > again. > # But NameNode does not care about the ScheduledFuture of this task, and > never calls ScheduledFuture.get(), so the unchecked exception thrown by the > task above will always be placed there, no one knows. > After that, the subsequent phenomenon is: > # The ScheduledExecutorService thread DecommissionMonitor-0 will block > forever in ThreadPoolExecutor.getTask(). > # The previously submitted task DecommissionMonitor will be never executed > again. > # No logs or notifications can let us know exactly what had happened. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15069) DecommissionMonitor-0 thread will block forever while its timer task scheduled encountered any unchecked exception.
[ https://issues.apache.org/jira/browse/HDFS-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xudong Cao updated HDFS-15069: -- Summary: DecommissionMonitor-0 thread will block forever while its timer task scheduled encountered any unchecked exception. (was: DecommissionMonitor-0 thread will block forever while the timer moniter task encountered an unchecked exception.) > DecommissionMonitor-0 thread will block forever while its timer task > scheduled encountered any unchecked exception. > --- > > Key: HDFS-15069 > URL: https://issues.apache.org/jira/browse/HDFS-15069 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Major > Attachments: stack_on_16_12.png, stack_on_16_42.png > > > More than once, we have observed that during decommissioning of a large > number of DNs, the thread DecommissionMonitor-0 will stop scheduling, > blocking for a long time, and there will be no exception logs or > notifications at all. > e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about > 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. > The stack of DecommissionMonitor-0 looks like this: > # stack on 2019.12.17 16:12 !stack_on_16_12.png! > # stack on 2019.12.17 16:42 !stack_on_16_42.png! > It can be seen that during half an hour, this thread has not been scheduled > at all, its Waited count has not changed. > We think the cause of the problem is: > # The DecommissionMonitor task submitted by NameNode encounters an unchecked > exception during its running , and then this task will be never executed > again. > # But NameNode does not care about the ScheduledFuture of this task, and > never calls ScheduledFuture.get(), so the unchecked exception thrown by the > task above will always be placed there, no one knows. > After that, the subsequent phenomenon is: > # The ScheduledExecutorService thread DecommissionMonitor-0 will block > forever in ThreadPoolExecutor.getTask(). > # The previously submitted task DecommissionMonitor will be never executed > again. > # No logs or notifications can let us know exactly what had happened. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15069) DecommissionMonitor-0 thread will block forever while its timer task scheduled encountered any unchecked exception.
[ https://issues.apache.org/jira/browse/HDFS-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xudong Cao updated HDFS-15069: -- Description: More than once, we have observed that during decommissioning of a large number of DNs, the thread DecommissionMonitor-0 will stop scheduling, blocking for a long time, and there will be no exception logs or notifications at all. e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. The stack of DecommissionMonitor-0 looks like this: # stack on 2019.12.17 16:12 !stack_on_16_12.png! # stack on 2019.12.17 16:42 !stack_on_16_42.png! It can be seen that during half an hour, this thread has not been scheduled at all, its Waited count has not changed. We think the cause of the problem is: # The DecommissionMonitor task submitted by NameNode encounters an unchecked exception during its running , and then this task will be never executed again. # But NameNode does not care about the ScheduledFuture of this task, and never calls ScheduledFuture.get(), so the unchecked exception thrown by the task above will always be placed there, no one knows. After that, the subsequent phenomenon is: # The ScheduledExecutorService thread DecommissionMonitor-0 will block forever in ThreadPoolExecutor.getTask(). # The previously submitted task DecommissionMonitor will be never executed again. # No logs or notifications can let us know exactly what had happened. A possible solution: # Do not use thread pool to execute decommission monitor task, alternatively we can introduce a separate thread to do this. was: More than once, we have observed that during decommissioning of a large number of DNs, the thread DecommissionMonitor-0 will stop scheduling, blocking for a long time, and there will be no exception logs or notifications at all. e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. The stack of DecommissionMonitor-0 looks like this: # stack on 2019.12.17 16:12 !stack_on_16_12.png! # stack on 2019.12.17 16:42 !stack_on_16_42.png! It can be seen that during half an hour, this thread has not been scheduled at all, its Waited count has not changed. We think the cause of the problem is: # The DecommissionMonitor task submitted by NameNode encounters an unchecked exception during its running , and then this task will be never executed again. # But NameNode does not care about the ScheduledFuture of this task, and never calls ScheduledFuture.get(), so the unchecked exception thrown by the task above will always be placed there, no one knows. After that, the subsequent phenomenon is: # The ScheduledExecutorService thread DecommissionMonitor-0 will block forever in ThreadPoolExecutor.getTask(). # The previously submitted task DecommissionMonitor will be never executed again. # No logs or notifications can let us know exactly what had happened. > DecommissionMonitor-0 thread will block forever while its timer task > scheduled encountered any unchecked exception. > --- > > Key: HDFS-15069 > URL: https://issues.apache.org/jira/browse/HDFS-15069 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Major > Attachments: stack_on_16_12.png, stack_on_16_42.png > > > More than once, we have observed that during decommissioning of a large > number of DNs, the thread DecommissionMonitor-0 will stop scheduling, > blocking for a long time, and there will be no exception logs or > notifications at all. > e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about > 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. > The stack of DecommissionMonitor-0 looks like this: > # stack on 2019.12.17 16:12 !stack_on_16_12.png! > # stack on 2019.12.17 16:42 !stack_on_16_42.png! > It can be seen that during half an hour, this thread has not been scheduled > at all, its Waited count has not changed. > We think the cause of the problem is: > # The DecommissionMonitor task submitted by NameNode encounters an unchecked > exception during its running , and then this task will be never executed > again. > # But NameNode does not care about the ScheduledFuture of this task, and > never calls ScheduledFuture.get(), so the unchecked exception thrown by the > task above will always be placed there, no one knows. > After that, the subsequent phenomenon is: > # The ScheduledExecutorService thread DecommissionMonitor-0 will block > forever in ThreadPoolExecutor.getTask(). >
[jira] [Updated] (HDFS-15069) DecommissionMonitor-0 thread will block forever while its timer task scheduled encountered any unchecked exceptions.
[ https://issues.apache.org/jira/browse/HDFS-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xudong Cao updated HDFS-15069: -- Summary: DecommissionMonitor-0 thread will block forever while its timer task scheduled encountered any unchecked exceptions. (was: DecommissionMonitor-0 thread will block forever while its timer task scheduled encountered any unchecked exception.) > DecommissionMonitor-0 thread will block forever while its timer task > scheduled encountered any unchecked exceptions. > > > Key: HDFS-15069 > URL: https://issues.apache.org/jira/browse/HDFS-15069 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Major > Attachments: stack_on_16_12.png, stack_on_16_42.png > > > More than once, we have observed that during decommissioning of a large > number of DNs, the thread DecommissionMonitor-0 will stop scheduling, > blocking for a long time, and there will be no exception logs or > notifications at all. > e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about > 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. > The stack of DecommissionMonitor-0 looks like this: > # stack on 2019.12.17 16:12 !stack_on_16_12.png! > # stack on 2019.12.17 16:42 !stack_on_16_42.png! > It can be seen that during half an hour, this thread has not been scheduled > at all, its Waited count has not changed. > We think the cause of the problem is: > # The DecommissionMonitor task submitted by NameNode encounters an unchecked > exception during its running , and then this task will be never executed > again. > # But NameNode does not care about the ScheduledFuture of this task, and > never calls ScheduledFuture.get(), so the unchecked exception thrown by the > task above will always be placed there, no one knows. > After that, the subsequent phenomenon is: > # The ScheduledExecutorService thread DecommissionMonitor-0 will block > forever in ThreadPoolExecutor.getTask(). > # The previously submitted task DecommissionMonitor will be never executed > again. > # No logs or notifications can let us know exactly what had happened. > A possible solution: > # Do not use thread pool to execute decommission monitor task, alternatively > we can introduce a separate thread to do this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15069) DecommissionMonitor-0 thread will block forever while its timer task scheduled encountered any unchecked exceptions.
[ https://issues.apache.org/jira/browse/HDFS-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xudong Cao updated HDFS-15069: -- Description: More than once, we have observed that during decommissioning of a large number of DNs, the thread DecommissionMonitor-0 will stop scheduling, blocking for a long time, and there will be no exception logs or notifications at all. e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. The stack of DecommissionMonitor-0 looks like this: # stack on 2019.12.17 16:12 !stack_on_16_12.png! # stack on 2019.12.17 16:42 !stack_on_16_42.png! It can be seen that during half an hour, this thread has not been scheduled at all, its Waited count has not changed. We think the cause of the problem is: # The DecommissionMonitor task submitted by NameNode encounters an unchecked exception during its running , and then this task will be never executed again. # But NameNode does not care about the ScheduledFuture of this task, and never calls ScheduledFuture.get(), so the unchecked exception thrown by the task above will always be placed there, no one knows. After that, the subsequent phenomenon is: # The ScheduledExecutorService thread DecommissionMonitor-0 will block forever in ThreadPoolExecutor.getTask(). # The previously submitted task DecommissionMonitor will be never executed again. # No logs or notifications can let us know exactly what had happened. A possible solution: # Do not use thread pool to execute decommission monitor task, alternatively we can introduce a separate thread to do this, just like heartbeatManager, ReplicationMonitor, blockReportThread, and so on. was: More than once, we have observed that during decommissioning of a large number of DNs, the thread DecommissionMonitor-0 will stop scheduling, blocking for a long time, and there will be no exception logs or notifications at all. e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. The stack of DecommissionMonitor-0 looks like this: # stack on 2019.12.17 16:12 !stack_on_16_12.png! # stack on 2019.12.17 16:42 !stack_on_16_42.png! It can be seen that during half an hour, this thread has not been scheduled at all, its Waited count has not changed. We think the cause of the problem is: # The DecommissionMonitor task submitted by NameNode encounters an unchecked exception during its running , and then this task will be never executed again. # But NameNode does not care about the ScheduledFuture of this task, and never calls ScheduledFuture.get(), so the unchecked exception thrown by the task above will always be placed there, no one knows. After that, the subsequent phenomenon is: # The ScheduledExecutorService thread DecommissionMonitor-0 will block forever in ThreadPoolExecutor.getTask(). # The previously submitted task DecommissionMonitor will be never executed again. # No logs or notifications can let us know exactly what had happened. A possible solution: # Do not use thread pool to execute decommission monitor task, alternatively we can introduce a separate thread to do this. > DecommissionMonitor-0 thread will block forever while its timer task > scheduled encountered any unchecked exceptions. > > > Key: HDFS-15069 > URL: https://issues.apache.org/jira/browse/HDFS-15069 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Major > Attachments: stack_on_16_12.png, stack_on_16_42.png > > > More than once, we have observed that during decommissioning of a large > number of DNs, the thread DecommissionMonitor-0 will stop scheduling, > blocking for a long time, and there will be no exception logs or > notifications at all. > e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about > 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. > The stack of DecommissionMonitor-0 looks like this: > # stack on 2019.12.17 16:12 !stack_on_16_12.png! > # stack on 2019.12.17 16:42 !stack_on_16_42.png! > It can be seen that during half an hour, this thread has not been scheduled > at all, its Waited count has not changed. > We think the cause of the problem is: > # The DecommissionMonitor task submitted by NameNode encounters an unchecked > exception during its running , and then this task will be never executed > again. > # But NameNode does not care about the ScheduledFuture of this task, and > never calls ScheduledFuture.get(), so the unchecked exception thrown
[jira] [Updated] (HDFS-15069) DecommissionMonitor-0 thread will block forever while its timer task scheduled encountered any unchecked exceptions.
[ https://issues.apache.org/jira/browse/HDFS-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xudong Cao updated HDFS-15069: -- Description: More than once, we have observed that during decommissioning of a large number of DNs, the thread DecommissionMonitor-0 will stop scheduling, blocking for a long time, and there will be no exception logs or notifications at all. e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. The stack of DecommissionMonitor-0 looks like this: # stack on 2019.12.17 16:12 !stack_on_16_12.png! # stack on 2019.12.17 16:42 !stack_on_16_42.png! It can be seen that during half an hour, this thread has not been scheduled at all, its Waited count has not changed. We think the cause of the problem is: # The DecommissionMonitor task submitted by NameNode encounters an unchecked exception during its running , and then this task will be never executed again. # But NameNode does not care about the ScheduledFuture of this task, and never calls ScheduledFuture.get(), so the unchecked exception thrown by the task above will always be placed there, no one knows. After that, the subsequent phenomenon is: # The ScheduledExecutorService thread DecommissionMonitor-0 will block forever in ThreadPoolExecutor.getTask(). # The previously submitted task DecommissionMonitor will be never executed again. # No logs or notifications can let us know exactly what had happened. A possible solution: # Do not use thread pool to execute decommission monitor task, alternatively we can introduce a separate thread to do this, just like HeartbeatManager, ReplicationMonitor, BlockReportThread, and so on. was: More than once, we have observed that during decommissioning of a large number of DNs, the thread DecommissionMonitor-0 will stop scheduling, blocking for a long time, and there will be no exception logs or notifications at all. e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. The stack of DecommissionMonitor-0 looks like this: # stack on 2019.12.17 16:12 !stack_on_16_12.png! # stack on 2019.12.17 16:42 !stack_on_16_42.png! It can be seen that during half an hour, this thread has not been scheduled at all, its Waited count has not changed. We think the cause of the problem is: # The DecommissionMonitor task submitted by NameNode encounters an unchecked exception during its running , and then this task will be never executed again. # But NameNode does not care about the ScheduledFuture of this task, and never calls ScheduledFuture.get(), so the unchecked exception thrown by the task above will always be placed there, no one knows. After that, the subsequent phenomenon is: # The ScheduledExecutorService thread DecommissionMonitor-0 will block forever in ThreadPoolExecutor.getTask(). # The previously submitted task DecommissionMonitor will be never executed again. # No logs or notifications can let us know exactly what had happened. A possible solution: # Do not use thread pool to execute decommission monitor task, alternatively we can introduce a separate thread to do this, just like heartbeatManager, ReplicationMonitor, blockReportThread, and so on. > DecommissionMonitor-0 thread will block forever while its timer task > scheduled encountered any unchecked exceptions. > > > Key: HDFS-15069 > URL: https://issues.apache.org/jira/browse/HDFS-15069 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Major > Attachments: stack_on_16_12.png, stack_on_16_42.png > > > More than once, we have observed that during decommissioning of a large > number of DNs, the thread DecommissionMonitor-0 will stop scheduling, > blocking for a long time, and there will be no exception logs or > notifications at all. > e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about > 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. > The stack of DecommissionMonitor-0 looks like this: > # stack on 2019.12.17 16:12 !stack_on_16_12.png! > # stack on 2019.12.17 16:42 !stack_on_16_42.png! > It can be seen that during half an hour, this thread has not been scheduled > at all, its Waited count has not changed. > We think the cause of the problem is: > # The DecommissionMonitor task submitted by NameNode encounters an unchecked > exception during its running , and then this task will be never executed > again. > # But NameNode does not care about the ScheduledFuture of this ta
[jira] [Updated] (HDFS-15069) DecommissionMonitor-0 thread will block forever while its timer task scheduled encountered any unchecked exceptions.
[ https://issues.apache.org/jira/browse/HDFS-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xudong Cao updated HDFS-15069: -- Description: More than once, we have observed that during decommissioning of a large number of DNs, the thread DecommissionMonitor-0 will stop scheduling, blocking for a long time, and there will be no exception logs or notifications at all. e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. The stack of DecommissionMonitor-0 looks like this: # stack on 2019.12.17 16:12 !stack_on_16_12.png! # stack on 2019.12.17 16:42 !stack_on_16_42.png! It can be seen that during half an hour, this thread has not been scheduled at all, its Waited count has not changed. We think the cause of the problem is: # The DecommissionMonitor task submitted by NameNode encounters an unchecked exception during its running , and then this task will be never executed again. # But NameNode does not care about the ScheduledFuture of this task, and never calls ScheduledFuture.get(), so the unchecked exception thrown by the task above will always be placed there, no one knows. After that, the subsequent phenomenon is: # The ScheduledExecutorService thread DecommissionMonitor-0 will block forever in ThreadPoolExecutor.getTask(). # The previously submitted task DecommissionMonitor will be never executed again. # No logs or notifications can let us know exactly what had happened. A possible solution: # Do not use thread pool to execute decommission monitor task, alternatively we can introduce a separate thread to do this, just like HeartbeatManager, ReplicationMonitor, LeaseManager, BlockReportThread, and so on. was: More than once, we have observed that during decommissioning of a large number of DNs, the thread DecommissionMonitor-0 will stop scheduling, blocking for a long time, and there will be no exception logs or notifications at all. e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. The stack of DecommissionMonitor-0 looks like this: # stack on 2019.12.17 16:12 !stack_on_16_12.png! # stack on 2019.12.17 16:42 !stack_on_16_42.png! It can be seen that during half an hour, this thread has not been scheduled at all, its Waited count has not changed. We think the cause of the problem is: # The DecommissionMonitor task submitted by NameNode encounters an unchecked exception during its running , and then this task will be never executed again. # But NameNode does not care about the ScheduledFuture of this task, and never calls ScheduledFuture.get(), so the unchecked exception thrown by the task above will always be placed there, no one knows. After that, the subsequent phenomenon is: # The ScheduledExecutorService thread DecommissionMonitor-0 will block forever in ThreadPoolExecutor.getTask(). # The previously submitted task DecommissionMonitor will be never executed again. # No logs or notifications can let us know exactly what had happened. A possible solution: # Do not use thread pool to execute decommission monitor task, alternatively we can introduce a separate thread to do this, just like HeartbeatManager, ReplicationMonitor, BlockReportThread, and so on. > DecommissionMonitor-0 thread will block forever while its timer task > scheduled encountered any unchecked exceptions. > > > Key: HDFS-15069 > URL: https://issues.apache.org/jira/browse/HDFS-15069 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Major > Attachments: stack_on_16_12.png, stack_on_16_42.png > > > More than once, we have observed that during decommissioning of a large > number of DNs, the thread DecommissionMonitor-0 will stop scheduling, > blocking for a long time, and there will be no exception logs or > notifications at all. > e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about > 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. > The stack of DecommissionMonitor-0 looks like this: > # stack on 2019.12.17 16:12 !stack_on_16_12.png! > # stack on 2019.12.17 16:42 !stack_on_16_42.png! > It can be seen that during half an hour, this thread has not been scheduled > at all, its Waited count has not changed. > We think the cause of the problem is: > # The DecommissionMonitor task submitted by NameNode encounters an unchecked > exception during its running , and then this task will be never executed > again. > # But NameNode does not care about the ScheduledFut
[jira] [Commented] (HDFS-15068) DataNode could meet deadlock if invoke refreshVolumes when register
[ https://issues.apache.org/jira/browse/HDFS-15068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998990#comment-16998990 ] Hadoop QA commented on HDFS-15068: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 45s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 24s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 9s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 41s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 13s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 14s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 40s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 226 unchanged - 1 fixed = 227 total (was 227) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 24s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 7s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}103m 24s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 31s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}165m 14s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.TestDeadNodeDetection | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:e573ea49085 | | JIRA Issue | HDFS-15068 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12989070/HDFS-15068.002.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux c000dade79f0 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 92c8962 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/28538/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28538/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/28538/testReport/ | | Max. process+thread count | 2715 (vs. ulimit of 5500) | | modul
[jira] [Updated] (HDFS-15069) DecommissionMonitor-0 thread will block forever while its timer task scheduled encountered any unchecked exceptions.
[ https://issues.apache.org/jira/browse/HDFS-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xudong Cao updated HDFS-15069: -- Description: More than once, we have observed that during decommissioning of a large number of DNs, the thread DecommissionMonitor-0 will stop scheduling, blocking for a long time, and there will be no exception logs or notifications at all. e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. The stack of DecommissionMonitor-0 looks like this: # stack on 2019.12.17 16:12 !stack_on_16_12.png! # stack on 2019.12.17 16:42 !stack_on_16_42.png! It can be seen that during half an hour, this thread has not been scheduled at all, its Waited count has not changed. We think the cause of the problem is: # The DecommissionMonitor task submitted by NameNode encounters an unchecked exception during its running , and then this task will be never executed again. # But NameNode does not care about the ScheduledFuture of this task, and never calls ScheduledFuture.get(), so the unchecked exception thrown by the task above will always be placed there, no one knows. After that, the subsequent phenomenon is: # The ScheduledExecutorService thread DecommissionMonitor-0 will block forever in ThreadPoolExecutor.getTask(). # The previously submitted task DecommissionMonitor will be never executed again. # No logs or notifications can let us know exactly what had happened. Possible solutions: # Do not use thread pool to execute decommission monitor task, alternatively we can introduce a separate thread to do this, just like HeartbeatManager, ReplicationMonitor, LeaseManager, BlockReportThread, and so on. OR 2. Catch all exceptions in decommission monitor task's run() method, so he does not throw any exceptions. was: More than once, we have observed that during decommissioning of a large number of DNs, the thread DecommissionMonitor-0 will stop scheduling, blocking for a long time, and there will be no exception logs or notifications at all. e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. The stack of DecommissionMonitor-0 looks like this: # stack on 2019.12.17 16:12 !stack_on_16_12.png! # stack on 2019.12.17 16:42 !stack_on_16_42.png! It can be seen that during half an hour, this thread has not been scheduled at all, its Waited count has not changed. We think the cause of the problem is: # The DecommissionMonitor task submitted by NameNode encounters an unchecked exception during its running , and then this task will be never executed again. # But NameNode does not care about the ScheduledFuture of this task, and never calls ScheduledFuture.get(), so the unchecked exception thrown by the task above will always be placed there, no one knows. After that, the subsequent phenomenon is: # The ScheduledExecutorService thread DecommissionMonitor-0 will block forever in ThreadPoolExecutor.getTask(). # The previously submitted task DecommissionMonitor will be never executed again. # No logs or notifications can let us know exactly what had happened. A possible solution: # Do not use thread pool to execute decommission monitor task, alternatively we can introduce a separate thread to do this, just like HeartbeatManager, ReplicationMonitor, LeaseManager, BlockReportThread, and so on. > DecommissionMonitor-0 thread will block forever while its timer task > scheduled encountered any unchecked exceptions. > > > Key: HDFS-15069 > URL: https://issues.apache.org/jira/browse/HDFS-15069 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Major > Attachments: stack_on_16_12.png, stack_on_16_42.png > > > More than once, we have observed that during decommissioning of a large > number of DNs, the thread DecommissionMonitor-0 will stop scheduling, > blocking for a long time, and there will be no exception logs or > notifications at all. > e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about > 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. > The stack of DecommissionMonitor-0 looks like this: > # stack on 2019.12.17 16:12 !stack_on_16_12.png! > # stack on 2019.12.17 16:42 !stack_on_16_42.png! > It can be seen that during half an hour, this thread has not been scheduled > at all, its Waited count has not changed. > We think the cause of the problem is: > # The DecommissionMonitor task submitted by NameNode encounters an unchecked
[jira] [Updated] (HDFS-15069) DecommissionMonitor-0 thread will block forever while its timer task scheduled encountered any unchecked exceptions.
[ https://issues.apache.org/jira/browse/HDFS-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xudong Cao updated HDFS-15069: -- Description: More than once, we have observed that during decommissioning of a large number of DNs, the thread DecommissionMonitor-0 will stop scheduling, blocking for a long time, and there will be no exception logs or notifications at all. e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. The stack of DecommissionMonitor-0 looks like this: # stack on 2019.12.17 16:12 !stack_on_16_12.png! # stack on 2019.12.17 16:42 !stack_on_16_42.png! It can be seen that during half an hour, this thread has not been scheduled at all, its Waited count has not changed. We think the cause of the problem is: # The DecommissionMonitor task submitted by NameNode encounters an unchecked exception during its running , and then this task will be never executed again. # But NameNode does not care about the ScheduledFuture of this task, and never calls ScheduledFuture.get(), so the unchecked exception thrown by the task above will always be placed there, no one knows. After that, the subsequent phenomenon is: # The ScheduledExecutorService thread DecommissionMonitor-0 will block forever in ThreadPoolExecutor.getTask(). # The previously submitted task DecommissionMonitor will be never executed again. # No logs or notifications can let us know exactly what had happened. Possible solutions: # Do not use thread pool to execute decommission monitor task, alternatively we can introduce a separate thread to do this, just like HeartbeatManager, ReplicationMonitor, LeaseManager, BlockReportThread, and so on. OR 2. Catch all exceptions in decommission monitor task's run() method, so it does not throw any exceptions. was: More than once, we have observed that during decommissioning of a large number of DNs, the thread DecommissionMonitor-0 will stop scheduling, blocking for a long time, and there will be no exception logs or notifications at all. e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. The stack of DecommissionMonitor-0 looks like this: # stack on 2019.12.17 16:12 !stack_on_16_12.png! # stack on 2019.12.17 16:42 !stack_on_16_42.png! It can be seen that during half an hour, this thread has not been scheduled at all, its Waited count has not changed. We think the cause of the problem is: # The DecommissionMonitor task submitted by NameNode encounters an unchecked exception during its running , and then this task will be never executed again. # But NameNode does not care about the ScheduledFuture of this task, and never calls ScheduledFuture.get(), so the unchecked exception thrown by the task above will always be placed there, no one knows. After that, the subsequent phenomenon is: # The ScheduledExecutorService thread DecommissionMonitor-0 will block forever in ThreadPoolExecutor.getTask(). # The previously submitted task DecommissionMonitor will be never executed again. # No logs or notifications can let us know exactly what had happened. Possible solutions: # Do not use thread pool to execute decommission monitor task, alternatively we can introduce a separate thread to do this, just like HeartbeatManager, ReplicationMonitor, LeaseManager, BlockReportThread, and so on. OR 2. Catch all exceptions in decommission monitor task's run() method, so he does not throw any exceptions. > DecommissionMonitor-0 thread will block forever while its timer task > scheduled encountered any unchecked exceptions. > > > Key: HDFS-15069 > URL: https://issues.apache.org/jira/browse/HDFS-15069 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Major > Attachments: stack_on_16_12.png, stack_on_16_42.png > > > More than once, we have observed that during decommissioning of a large > number of DNs, the thread DecommissionMonitor-0 will stop scheduling, > blocking for a long time, and there will be no exception logs or > notifications at all. > e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about > 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. > The stack of DecommissionMonitor-0 looks like this: > # stack on 2019.12.17 16:12 !stack_on_16_12.png! > # stack on 2019.12.17 16:42 !stack_on_16_42.png! > It can be seen that during half an hour, this thread has not been scheduled > at all, its Waited count has not chang
[jira] [Updated] (HDFS-15069) DecommissionMonitor-0 thread will block forever while its timer task scheduled encountered any unchecked exceptions.
[ https://issues.apache.org/jira/browse/HDFS-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xudong Cao updated HDFS-15069: -- Description: More than once, we have observed that during decommissioning of a large number of DNs, the thread DecommissionMonitor-0 will stop scheduling, blocking for a long time, and there will be no exception logs or notifications at all. e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. The stack of DecommissionMonitor-0 looks like this: # stack on 2019.12.17 16:12 !stack_on_16_12.png! # stack on 2019.12.17 16:42 !stack_on_16_42.png! It can be seen that during half an hour, this thread has not been scheduled at all, its Waited count has not changed. We think the cause of the problem is: # The DecommissionMonitor task submitted by NameNode encounters an unchecked exception during its running , and then this task will be never executed again. # But NameNode does not care about the ScheduledFuture of this task, and never calls ScheduledFuture.get(), so the unchecked exception thrown by the task above will always be placed there, no one knows. After that, the subsequent phenomenon is: # The ScheduledExecutorService thread DecommissionMonitor-0 will block forever in ThreadPoolExecutor.getTask(). # The previously submitted task DecommissionMonitor will be never executed again. # No logs or notifications can let us know exactly what had happened. Possible solutions: # Do not use thread pool to execute decommission monitor task, alternatively we can introduce a separate thread to do this, just like HeartbeatManager, ReplicationMonitor, LeaseManager, BlockReportThread, and so on. OR 2. Catch all exceptions in decommission monitor task's run() method, so it does not throw any exceptions. I prefer the second option. was: More than once, we have observed that during decommissioning of a large number of DNs, the thread DecommissionMonitor-0 will stop scheduling, blocking for a long time, and there will be no exception logs or notifications at all. e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. The stack of DecommissionMonitor-0 looks like this: # stack on 2019.12.17 16:12 !stack_on_16_12.png! # stack on 2019.12.17 16:42 !stack_on_16_42.png! It can be seen that during half an hour, this thread has not been scheduled at all, its Waited count has not changed. We think the cause of the problem is: # The DecommissionMonitor task submitted by NameNode encounters an unchecked exception during its running , and then this task will be never executed again. # But NameNode does not care about the ScheduledFuture of this task, and never calls ScheduledFuture.get(), so the unchecked exception thrown by the task above will always be placed there, no one knows. After that, the subsequent phenomenon is: # The ScheduledExecutorService thread DecommissionMonitor-0 will block forever in ThreadPoolExecutor.getTask(). # The previously submitted task DecommissionMonitor will be never executed again. # No logs or notifications can let us know exactly what had happened. Possible solutions: # Do not use thread pool to execute decommission monitor task, alternatively we can introduce a separate thread to do this, just like HeartbeatManager, ReplicationMonitor, LeaseManager, BlockReportThread, and so on. OR 2. Catch all exceptions in decommission monitor task's run() method, so it does not throw any exceptions. > DecommissionMonitor-0 thread will block forever while its timer task > scheduled encountered any unchecked exceptions. > > > Key: HDFS-15069 > URL: https://issues.apache.org/jira/browse/HDFS-15069 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Major > Attachments: stack_on_16_12.png, stack_on_16_42.png > > > More than once, we have observed that during decommissioning of a large > number of DNs, the thread DecommissionMonitor-0 will stop scheduling, > blocking for a long time, and there will be no exception logs or > notifications at all. > e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about > 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. > The stack of DecommissionMonitor-0 looks like this: > # stack on 2019.12.17 16:12 !stack_on_16_12.png! > # stack on 2019.12.17 16:42 !stack_on_16_42.png! > It can be seen that during half an hour, this thread has not been scheduled > at all
[jira] [Commented] (HDFS-15062) Add LOG when sendIBRs failed
[ https://issues.apache.org/jira/browse/HDFS-15062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999281#comment-16999281 ] Fei Hui commented on HDFS-15062: [~hexiaoqiao] Thanks for reminding me. HDFS-14997 is great work, it could resolve the problem that sending IBRs was delayed. This JIRA just adds key logs for quick trouble shooting. > Add LOG when sendIBRs failed > > > Key: HDFS-15062 > URL: https://issues.apache.org/jira/browse/HDFS-15062 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15062.001.patch, HDFS-15062.002.patch > > > {code} > /** Send IBRs to namenode. */ > void sendIBRs(DatanodeProtocol namenode, DatanodeRegistration registration, > String bpid, String nnRpcLatencySuffix) throws IOException { > // Generate a list of the pending reports for each storage under the lock > final StorageReceivedDeletedBlocks[] reports = generateIBRs(); > if (reports.length == 0) { > // Nothing new to report. > return; > } > // Send incremental block reports to the Namenode outside the lock > if (LOG.isDebugEnabled()) { > LOG.debug("call blockReceivedAndDeleted: " + Arrays.toString(reports)); > } > boolean success = false; > final long startTime = monotonicNow(); > try { > namenode.blockReceivedAndDeleted(registration, bpid, reports); > success = true; > } finally { > if (success) { > dnMetrics.addIncrementalBlockReport(monotonicNow() - startTime, > nnRpcLatencySuffix); > lastIBR = startTime; > } else { > // If we didn't succeed in sending the report, put all of the > // blocks back onto our queue, but only in the case where we > // didn't put something newer in the meantime. > putMissing(reports); > } > } > } > {code} > When call namenode.blockReceivedAndDelete failed, will put reports to > pendingIBRs. Maybe we should add log for failed case. It is helpful for > trouble shooting -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15068) DataNode could meet deadlock if invoke refreshVolumes when register
[ https://issues.apache.org/jira/browse/HDFS-15068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999318#comment-16999318 ] Xiaoqiao He commented on HDFS-15068: Hi [~Aiphag0], v002 is better than previous one. Some minor comment, a. The new unit test is passed run at local without fix, please help to check it. b. it is better to avoid use wait static times such as {{Thread.sleep(500)}} in unit test, I think there are some more graceful ways such as {{CountDownLatch}} to wait until something happen. c. we should assert expected result at the last part of unit test. remainder looks good to me. Thanks. > DataNode could meet deadlock if invoke refreshVolumes when register > --- > > Key: HDFS-15068 > URL: https://issues.apache.org/jira/browse/HDFS-15068 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Critical > Attachments: HDFS-15068.001.patch, HDFS-15068.002.patch > > > DataNode could meet deadlock when invoke `dfsadmin -reconfig datanode ip:host > start` to trigger #refreshVolumes. > 1. DataNod#refreshVolumes hold datanode instance ownable {{synchronizer}} > when enter this method first, then try to hold BPOfferService {{readlock}} > when `bpos.getNamespaceInfo()` in following code segment. > {code:java} > for (BPOfferService bpos : blockPoolManager.getAllNamenodeThreads()) { > nsInfos.add(bpos.getNamespaceInfo()); > } > {code} > 2. BPOfferService#registrationSucceeded (which is invoked by #register when > DataNode start or #reregister when processCommandFromActor) hold > BPOfferService {{writelock}} first, then try to hold datanode instance > ownable {{synchronizer}} in following method. > {code:java} > synchronized void bpRegistrationSucceeded(DatanodeRegistration > bpRegistration, > String blockPoolId) throws IOException { > id = bpRegistration; > if(!storage.getDatanodeUuid().equals(bpRegistration.getDatanodeUuid())) { > throw new IOException("Inconsistent Datanode IDs. Name-node returned " > + bpRegistration.getDatanodeUuid() > + ". Expecting " + storage.getDatanodeUuid()); > } > > registerBlockPoolWithSecretManager(bpRegistration, blockPoolId); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13101) Yet another fsimage corruption related to snapshot
[ https://issues.apache.org/jira/browse/HDFS-13101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999379#comment-16999379 ] Hudson commented on HDFS-13101: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17773 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/17773/]) HDFS-15012. NN fails to parse Edit logs after applying HDFS-13101. (shashikant: rev fdd96e46d1f89f0ecdb9b1836dc7fca9fbb954fd) * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/snapshot/DirectoryWithSnapshotFeature.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/snapshot/TestRenameWithSnapshots.java > Yet another fsimage corruption related to snapshot > -- > > Key: HDFS-13101 > URL: https://issues.apache.org/jira/browse/HDFS-13101 > Project: Hadoop HDFS > Issue Type: Bug > Components: snapshots >Reporter: Yongjun Zhang >Assignee: Shashikant Banerjee >Priority: Major > Fix For: 2.10.0, 3.0.4, 3.3.0, 2.8.6, 3.2.1, 2.9.3, 3.1.3 > > Attachments: HDFS-13101.001.patch, HDFS-13101.002.patch, > HDFS-13101.003.patch, HDFS-13101.004.patch, HDFS-13101.branch-2.001.patch, > HDFS-13101.branch-2.8.patch, HDFS-13101.corruption_repro.patch, > HDFS-13101.corruption_repro_simplified.patch > > > Lately we saw case similar to HDFS-9406, even though HDFS-9406 fix is > present, so it's likely another case not covered by the fix. We are currently > trying to collect good fsimage + editlogs to replay to reproduce it and > investigate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15012) NN fails to parse Edit logs after applying HDFS-13101
[ https://issues.apache.org/jira/browse/HDFS-15012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999378#comment-16999378 ] Hudson commented on HDFS-15012: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17773 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/17773/]) HDFS-15012. NN fails to parse Edit logs after applying HDFS-13101. (shashikant: rev fdd96e46d1f89f0ecdb9b1836dc7fca9fbb954fd) * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/snapshot/DirectoryWithSnapshotFeature.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/snapshot/TestRenameWithSnapshots.java > NN fails to parse Edit logs after applying HDFS-13101 > - > > Key: HDFS-15012 > URL: https://issues.apache.org/jira/browse/HDFS-15012 > Project: Hadoop HDFS > Issue Type: Bug > Components: nn >Reporter: Eric Lin >Assignee: Shashikant Banerjee >Priority: Blocker > Labels: release-blocker > Attachments: HDFS-15012.000.patch, HDFS-15012.001.patch > > > After applying HDFS-13101, and deleting and creating large number of > snapshots, SNN exited with below error: > > {code:sh} > 2019-11-18 08:28:06,528 ERROR > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception > on operation DeleteSnapshotOp [snapshotRoot=/path/to/hdfs/file, > snapshotName=distcp-3479-31-old, > RpcClientId=b16a6cb5-bdbb-45ae-9f9a-f7dc57931f37, Rpc > CallId=1] > java.lang.AssertionError: Element already exists: > element=partition_isactive=true, DELETED=[partition_isactive=true] > at org.apache.hadoop.hdfs.util.Diff.insert(Diff.java:193) > at org.apache.hadoop.hdfs.util.Diff.delete(Diff.java:239) > at org.apache.hadoop.hdfs.util.Diff.combinePosterior(Diff.java:462) > at > org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff$2.initChildren(DirectoryWithSnapshotFeature.java:240) > at > org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff$2.iterator(DirectoryWithSnapshotFeature.java:250) > at > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:755) > at > org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:753) > at > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:790) > at > org.apache.hadoop.hdfs.server.namenode.INodeReference.cleanSubtree(INodeReference.java:332) > at > org.apache.hadoop.hdfs.server.namenode.INodeReference$WithName.cleanSubtree(INodeReference.java:583) > at > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:760) > at > org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:753) > at > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:790) > at > org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.removeSnapshot(DirectorySnapshottableFeature.java:235) > at > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.removeSnapshot(INodeDirectory.java:259) > at > org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.deleteSnapshot(SnapshotManager.java:301) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:688) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:232) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:141) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:903) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:756) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:324) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1144) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:796) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:614) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:676) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:844) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:823) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:
[jira] [Commented] (HDFS-15069) DecommissionMonitor-0 thread will block forever while its timer task scheduled encountered any unchecked exceptions.
[ https://issues.apache.org/jira/browse/HDFS-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999396#comment-16999396 ] Íñigo Goiri commented on HDFS-15069: Is this the related to HDFS-12703? > DecommissionMonitor-0 thread will block forever while its timer task > scheduled encountered any unchecked exceptions. > > > Key: HDFS-15069 > URL: https://issues.apache.org/jira/browse/HDFS-15069 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Major > Attachments: stack_on_16_12.png, stack_on_16_42.png > > > More than once, we have observed that during decommissioning of a large > number of DNs, the thread DecommissionMonitor-0 will stop scheduling, > blocking for a long time, and there will be no exception logs or > notifications at all. > e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about > 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. > The stack of DecommissionMonitor-0 looks like this: > # stack on 2019.12.17 16:12 !stack_on_16_12.png! > # stack on 2019.12.17 16:42 !stack_on_16_42.png! > It can be seen that during half an hour, this thread has not been scheduled > at all, its Waited count has not changed. > We think the cause of the problem is: > # The DecommissionMonitor task submitted by NameNode encounters an unchecked > exception during its running , and then this task will be never executed > again. > # But NameNode does not care about the ScheduledFuture of this task, and > never calls ScheduledFuture.get(), so the unchecked exception thrown by the > task above will always be placed there, no one knows. > After that, the subsequent phenomenon is: > # The ScheduledExecutorService thread DecommissionMonitor-0 will block > forever in ThreadPoolExecutor.getTask(). > # The previously submitted task DecommissionMonitor will be never executed > again. > # No logs or notifications can let us know exactly what had happened. > Possible solutions: > # Do not use thread pool to execute decommission monitor task, alternatively > we can introduce a separate thread to do this, just like HeartbeatManager, > ReplicationMonitor, LeaseManager, BlockReportThread, and so on. > OR > 2. Catch all exceptions in decommission monitor task's run() method, > so it does not throw any exceptions. > I prefer the second option. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15069) DecommissionMonitor-0 thread will block forever while its timer task scheduled encountered any unchecked exceptions.
[ https://issues.apache.org/jira/browse/HDFS-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999399#comment-16999399 ] Wei-Chiu Chuang commented on HDFS-15069: [~sodonnell] does it look like a possible culprit for decommissioning that got stuck which you observed before? > DecommissionMonitor-0 thread will block forever while its timer task > scheduled encountered any unchecked exceptions. > > > Key: HDFS-15069 > URL: https://issues.apache.org/jira/browse/HDFS-15069 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Major > Attachments: stack_on_16_12.png, stack_on_16_42.png > > > More than once, we have observed that during decommissioning of a large > number of DNs, the thread DecommissionMonitor-0 will stop scheduling, > blocking for a long time, and there will be no exception logs or > notifications at all. > e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about > 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. > The stack of DecommissionMonitor-0 looks like this: > # stack on 2019.12.17 16:12 !stack_on_16_12.png! > # stack on 2019.12.17 16:42 !stack_on_16_42.png! > It can be seen that during half an hour, this thread has not been scheduled > at all, its Waited count has not changed. > We think the cause of the problem is: > # The DecommissionMonitor task submitted by NameNode encounters an unchecked > exception during its running , and then this task will be never executed > again. > # But NameNode does not care about the ScheduledFuture of this task, and > never calls ScheduledFuture.get(), so the unchecked exception thrown by the > task above will always be placed there, no one knows. > After that, the subsequent phenomenon is: > # The ScheduledExecutorService thread DecommissionMonitor-0 will block > forever in ThreadPoolExecutor.getTask(). > # The previously submitted task DecommissionMonitor will be never executed > again. > # No logs or notifications can let us know exactly what had happened. > Possible solutions: > # Do not use thread pool to execute decommission monitor task, alternatively > we can introduce a separate thread to do this, just like HeartbeatManager, > ReplicationMonitor, LeaseManager, BlockReportThread, and so on. > OR > 2. Catch all exceptions in decommission monitor task's run() method, > so it does not throw any exceptions. > I prefer the second option. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15062) Add LOG when sendIBRs failed
[ https://issues.apache.org/jira/browse/HDFS-15062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999402#comment-16999402 ] Íñigo Goiri commented on HDFS-15062: [~weichiu] was also bringing up if we needed to add more information to the log message. He proposed the duration but maybe there are other things too. [~ferhui], just give it a thought to see if we should add more info. > Add LOG when sendIBRs failed > > > Key: HDFS-15062 > URL: https://issues.apache.org/jira/browse/HDFS-15062 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15062.001.patch, HDFS-15062.002.patch > > > {code} > /** Send IBRs to namenode. */ > void sendIBRs(DatanodeProtocol namenode, DatanodeRegistration registration, > String bpid, String nnRpcLatencySuffix) throws IOException { > // Generate a list of the pending reports for each storage under the lock > final StorageReceivedDeletedBlocks[] reports = generateIBRs(); > if (reports.length == 0) { > // Nothing new to report. > return; > } > // Send incremental block reports to the Namenode outside the lock > if (LOG.isDebugEnabled()) { > LOG.debug("call blockReceivedAndDeleted: " + Arrays.toString(reports)); > } > boolean success = false; > final long startTime = monotonicNow(); > try { > namenode.blockReceivedAndDeleted(registration, bpid, reports); > success = true; > } finally { > if (success) { > dnMetrics.addIncrementalBlockReport(monotonicNow() - startTime, > nnRpcLatencySuffix); > lastIBR = startTime; > } else { > // If we didn't succeed in sending the report, put all of the > // blocks back onto our queue, but only in the case where we > // didn't put something newer in the meantime. > putMissing(reports); > } > } > } > {code} > When call namenode.blockReceivedAndDelete failed, will put reports to > pendingIBRs. Maybe we should add log for failed case. It is helpful for > trouble shooting -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15012) NN fails to parse Edit logs after applying HDFS-13101
[ https://issues.apache.org/jira/browse/HDFS-15012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated HDFS-15012: --- Fix Version/s: 2.8.0 2.9.0 3.1.0 2.10.0 3.2.0 3.3.0 Resolution: Fixed Status: Resolved (was: Patch Available) Thanks [~ericlin] for helping discovering the issue. Thanks [~arp], [~szetszwo], [~weichiu], [~ayushtkn] [~surendrasingh] for the review and feedback. I have committed this. > NN fails to parse Edit logs after applying HDFS-13101 > - > > Key: HDFS-15012 > URL: https://issues.apache.org/jira/browse/HDFS-15012 > Project: Hadoop HDFS > Issue Type: Bug > Components: nn >Reporter: Eric Lin >Assignee: Shashikant Banerjee >Priority: Blocker > Labels: release-blocker > Fix For: 3.3.0, 3.2.0, 2.10.0, 3.1.0, 2.9.0, 2.8.0 > > Attachments: HDFS-15012.000.patch, HDFS-15012.001.patch > > > After applying HDFS-13101, and deleting and creating large number of > snapshots, SNN exited with below error: > > {code:sh} > 2019-11-18 08:28:06,528 ERROR > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception > on operation DeleteSnapshotOp [snapshotRoot=/path/to/hdfs/file, > snapshotName=distcp-3479-31-old, > RpcClientId=b16a6cb5-bdbb-45ae-9f9a-f7dc57931f37, Rpc > CallId=1] > java.lang.AssertionError: Element already exists: > element=partition_isactive=true, DELETED=[partition_isactive=true] > at org.apache.hadoop.hdfs.util.Diff.insert(Diff.java:193) > at org.apache.hadoop.hdfs.util.Diff.delete(Diff.java:239) > at org.apache.hadoop.hdfs.util.Diff.combinePosterior(Diff.java:462) > at > org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff$2.initChildren(DirectoryWithSnapshotFeature.java:240) > at > org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff$2.iterator(DirectoryWithSnapshotFeature.java:250) > at > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:755) > at > org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:753) > at > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:790) > at > org.apache.hadoop.hdfs.server.namenode.INodeReference.cleanSubtree(INodeReference.java:332) > at > org.apache.hadoop.hdfs.server.namenode.INodeReference$WithName.cleanSubtree(INodeReference.java:583) > at > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:760) > at > org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:753) > at > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:790) > at > org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.removeSnapshot(DirectorySnapshottableFeature.java:235) > at > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.removeSnapshot(INodeDirectory.java:259) > at > org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.deleteSnapshot(SnapshotManager.java:301) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:688) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:232) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:141) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:903) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:756) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:324) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1144) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:796) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:614) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:676) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:844) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:823) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1547) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:
[jira] [Comment Edited] (HDFS-15012) NN fails to parse Edit logs after applying HDFS-13101
[ https://issues.apache.org/jira/browse/HDFS-15012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999407#comment-16999407 ] Shashikant Banerjee edited comment on HDFS-15012 at 12/18/19 6:12 PM: -- Thanks [~ericlin] for helping discovering the issue. Thanks [~arp], [~szetszwo], [~weichiu], [~ayushtkn] [~surendrasingh] for the review and feedback. I have committed this. The findbug issue reported is not related. was (Author: shashikant): Thanks [~ericlin] for helping discovering the issue. Thanks [~arp], [~szetszwo], [~weichiu], [~ayushtkn] [~surendrasingh] for the review and feedback. I have committed this. > NN fails to parse Edit logs after applying HDFS-13101 > - > > Key: HDFS-15012 > URL: https://issues.apache.org/jira/browse/HDFS-15012 > Project: Hadoop HDFS > Issue Type: Bug > Components: nn >Reporter: Eric Lin >Assignee: Shashikant Banerjee >Priority: Blocker > Labels: release-blocker > Fix For: 2.8.0, 2.9.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 > > Attachments: HDFS-15012.000.patch, HDFS-15012.001.patch > > > After applying HDFS-13101, and deleting and creating large number of > snapshots, SNN exited with below error: > > {code:sh} > 2019-11-18 08:28:06,528 ERROR > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception > on operation DeleteSnapshotOp [snapshotRoot=/path/to/hdfs/file, > snapshotName=distcp-3479-31-old, > RpcClientId=b16a6cb5-bdbb-45ae-9f9a-f7dc57931f37, Rpc > CallId=1] > java.lang.AssertionError: Element already exists: > element=partition_isactive=true, DELETED=[partition_isactive=true] > at org.apache.hadoop.hdfs.util.Diff.insert(Diff.java:193) > at org.apache.hadoop.hdfs.util.Diff.delete(Diff.java:239) > at org.apache.hadoop.hdfs.util.Diff.combinePosterior(Diff.java:462) > at > org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff$2.initChildren(DirectoryWithSnapshotFeature.java:240) > at > org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff$2.iterator(DirectoryWithSnapshotFeature.java:250) > at > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:755) > at > org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:753) > at > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:790) > at > org.apache.hadoop.hdfs.server.namenode.INodeReference.cleanSubtree(INodeReference.java:332) > at > org.apache.hadoop.hdfs.server.namenode.INodeReference$WithName.cleanSubtree(INodeReference.java:583) > at > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:760) > at > org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:753) > at > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:790) > at > org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.removeSnapshot(DirectorySnapshottableFeature.java:235) > at > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.removeSnapshot(INodeDirectory.java:259) > at > org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.deleteSnapshot(SnapshotManager.java:301) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:688) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:232) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:141) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:903) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:756) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:324) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1144) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:796) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:614) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:676) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:844) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:823) > at > org.apache.hadoop.hdfs.server.namenode.
[jira] [Commented] (HDFS-15069) DecommissionMonitor-0 thread will block forever while its timer task scheduled encountered any unchecked exceptions.
[ https://issues.apache.org/jira/browse/HDFS-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999431#comment-16999431 ] Stephen O'Donnell commented on HDFS-15069: -- It is one way decommission can get stuck, but it should be fixed by HDFS-12703. Most times we see a given DN get stuck, a restart of the DN gets it moving again, so that that should not be related to this particular problem. A NN restart would be needed to fix the issue mentioned here. Does the cluster which hit this issue have HDFS-12703 included in the build? > DecommissionMonitor-0 thread will block forever while its timer task > scheduled encountered any unchecked exceptions. > > > Key: HDFS-15069 > URL: https://issues.apache.org/jira/browse/HDFS-15069 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Major > Attachments: stack_on_16_12.png, stack_on_16_42.png > > > More than once, we have observed that during decommissioning of a large > number of DNs, the thread DecommissionMonitor-0 will stop scheduling, > blocking for a long time, and there will be no exception logs or > notifications at all. > e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about > 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. > The stack of DecommissionMonitor-0 looks like this: > # stack on 2019.12.17 16:12 !stack_on_16_12.png! > # stack on 2019.12.17 16:42 !stack_on_16_42.png! > It can be seen that during half an hour, this thread has not been scheduled > at all, its Waited count has not changed. > We think the cause of the problem is: > # The DecommissionMonitor task submitted by NameNode encounters an unchecked > exception during its running , and then this task will be never executed > again. > # But NameNode does not care about the ScheduledFuture of this task, and > never calls ScheduledFuture.get(), so the unchecked exception thrown by the > task above will always be placed there, no one knows. > After that, the subsequent phenomenon is: > # The ScheduledExecutorService thread DecommissionMonitor-0 will block > forever in ThreadPoolExecutor.getTask(). > # The previously submitted task DecommissionMonitor will be never executed > again. > # No logs or notifications can let us know exactly what had happened. > Possible solutions: > # Do not use thread pool to execute decommission monitor task, alternatively > we can introduce a separate thread to do this, just like HeartbeatManager, > ReplicationMonitor, LeaseManager, BlockReportThread, and so on. > OR > 2. Catch all exceptions in decommission monitor task's run() method, > so it does not throw any exceptions. > I prefer the second option. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15003) RBF: Make Router support storage type quota.
[ https://issues.apache.org/jira/browse/HDFS-15003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999434#comment-16999434 ] Ayush Saxena commented on HDFS-15003: - The two new RouterAdmin commands should be added in the doc too. > RBF: Make Router support storage type quota. > > > Key: HDFS-15003 > URL: https://issues.apache.org/jira/browse/HDFS-15003 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15003.001.patch, HDFS-15003.002.patch, > HDFS-15003.003.patch, HDFS-15003.004.patch, HDFS-15003.005.patch > > > Make Router support storage type quota. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14997) BPServiceActor process command from NameNode asynchronously
[ https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999460#comment-16999460 ] Wei-Chiu Chuang commented on HDFS-14997: Sorry forgot to mention I am +1 > BPServiceActor process command from NameNode asynchronously > --- > > Key: HDFS-14997 > URL: https://issues.apache.org/jira/browse/HDFS-14997 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiaoqiao He >Assignee: Xiaoqiao He >Priority: Major > Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, > HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch > > > There are two core functions, report(#sendHeartbeat, #blockReport, > #cacheReport) and #processCommand in #BPServiceActor main process flow. If > processCommand cost long time it will block send report flow. Meanwhile > processCommand could cost long time(over 1000s the worst case I meet) when IO > load of DataNode is very high. Since some IO operations are under > #datasetLock, So it has to wait to acquire #datasetLock long time when > process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat > will not send to NameNode in-time, and trigger other disasters. > I propose to improve #processCommand asynchronously and not block > #BPServiceActor to send heartbeat back to NameNode when meet high IO load. > Notes: > 1. Lifeline could be one effective solution, however some old branches are > not support this feature. > 2. IO operations under #datasetLock is another issue, I think we should solve > it at another JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15031) Allow BootstrapStandby to download FSImage if the directory is already formatted
[ https://issues.apache.org/jira/browse/HDFS-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Becker updated HDFS-15031: Attachment: HDFS-15031.007.patch > Allow BootstrapStandby to download FSImage if the directory is already > formatted > > > Key: HDFS-15031 > URL: https://issues.apache.org/jira/browse/HDFS-15031 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Danny Becker >Assignee: Danny Becker >Priority: Minor > Attachments: HDFS-15031.000.patch, HDFS-15031.001.patch, > HDFS-15031.002.patch, HDFS-15031.003.patch, HDFS-15031.005.patch, > HDFS-15031.006.patch, HDFS-15031.007.patch > > > Currently, BootstrapStandby will only download the latest FSImage if it has > formatted the local image directory. This can be an issue when there are out > of date FSImages on a Standby NameNode, as the non-interactive mode will not > format the image directory, and BootstrapStandby will return an error code. > The changes here simply allow BootstrapStandby to download the latest FSImage > to the image directory, without needing to format first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15070) Crashing bugs in NameNode when using a valid configuration for `dfs.namenode.audit.loggers`
Xudong Sun created HDFS-15070: - Summary: Crashing bugs in NameNode when using a valid configuration for `dfs.namenode.audit.loggers` Key: HDFS-15070 URL: https://issues.apache.org/jira/browse/HDFS-15070 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.10.0 Reporter: Xudong Sun I am using Hadoop-2.10.0. The configuration parameter `dfs.namenode.audit.loggers` allows `default` (which is the default value) and `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger`. When we use `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger`, namenode will not be started successfully because of `InstantiationException `thrown from `org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initAuditLoggers`. The root cause is that during the initialization time of namenode, `initAuditLoggers` will be called and it will try to call the default constructor of `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger` which actually doesn't have one default constructor implemented, and thus the `InstantiationException` exception is thrown. *Symptom* *$ ./start-dfs.sh* {code:java} 2019-12-18 14:05:20,670 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed.java.lang.RuntimeException: java.lang.InstantiationException: org.apache.hadoop.hdfs.server.namenode.top.TopAuditLoggerat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initAuditLoggers(FSNamesystem.java:1024)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:858)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:677)at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:674)at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:736)at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:961)at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:940)at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1714)at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1782)Caused by: java.lang.InstantiationException: org.apache.hadoop.hdfs.server.namenode.top.TopAuditLoggerat java.lang.Class.newInstance(Class.java:427)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initAuditLoggers(FSNamesystem.java:1017)... 8 moreCaused by: java.lang.NoSuchMethodException: org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger.()at java.lang.Class.getConstructor0(Class.java:3082)at java.lang.Class.newInstance(Class.java:412) ... 9 more {code} *Detailed Root Cause* There is no default constructor in `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger`: {code:java} /** * An {@link AuditLogger} that sends logged data directly to the metrics * systems. It is used when the top service is used directly by the name node */ @InterfaceAudience.Private public class TopAuditLogger implements AuditLogger { publicstaticfinalLogger LOG = LoggerFactory.getLogger(TopAuditLogger.class); privatefinalTopMetrics topMetrics; public TopAuditLogger(TopMetrics topMetrics) { Preconditions.checkNotNull(topMetrics, "Cannot init with a null " + "TopMetrics"); this.topMetrics = topMetrics; } @Override publicvoid initialize(Configuration conf) { }{code} As long as the configuration parameter `dfs.namenode.audit.loggers` is set to `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger`, `initAuditLoggers` will try to call its default constructor to make a new instance: {code:java} private List initAuditLoggers(Configuration conf) { // Initialize the custom access loggers if configured. Collection alClasses = conf.getTrimmedStringCollection(DFS_NAMENODE_AUDIT_LOGGERS_KEY); List auditLoggers = Lists.newArrayList(); if (alClasses != null && !alClasses.isEmpty()) { for (String className : alClasses) { try { AuditLogger logger; if (DFS_NAMENODE_DEFAULT_AUDIT_LOGGER_NAME.equals(className)) { logger = new DefaultAuditLogger(); } else { logger = (AuditLogger) Class.forName(className).newInstance(); } logger.initialize(conf); auditLoggers.add(logger); } catch (RuntimeException re) { throw re; } catch (Exception e) { throw new RuntimeException(e); } } }{code} This is very different from the default configuration, `default`, which implements a default constructor so the default is fine. *How To Reproduce* The version of Hadoop: 2.10.0 # Set the value of configuration parameter `dfs.namenode.audit.loggers` to `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger` in "hdfs-site.xml"(the default value is `default`) # Start the namenode by running "start-dfs.sh" # The namenode will not be started
[jira] [Assigned] (HDFS-15071) Add DataNode Read and Write throughput percentile metrics
[ https://issues.apache.org/jira/browse/HDFS-15071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Becker reassigned HDFS-15071: --- Assignee: Danny Becker > Add DataNode Read and Write throughput percentile metrics > - > > Key: HDFS-15071 > URL: https://issues.apache.org/jira/browse/HDFS-15071 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, hdfs, metrics >Reporter: Danny Becker >Assignee: Danny Becker >Priority: Minor > > Add DataNode throughput metrics for read and write. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15071) Add DataNode Read and Write throughput percentile metrics
Danny Becker created HDFS-15071: --- Summary: Add DataNode Read and Write throughput percentile metrics Key: HDFS-15071 URL: https://issues.apache.org/jira/browse/HDFS-15071 Project: Hadoop HDFS Issue Type: Improvement Components: datanode, hdfs, metrics Reporter: Danny Becker Add DataNode throughput metrics for read and write. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15071) Add DataNode Read and Write throughput percentile metrics
[ https://issues.apache.org/jira/browse/HDFS-15071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Becker updated HDFS-15071: Attachment: HDFS-15071.000.patch > Add DataNode Read and Write throughput percentile metrics > - > > Key: HDFS-15071 > URL: https://issues.apache.org/jira/browse/HDFS-15071 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, hdfs, metrics >Reporter: Danny Becker >Assignee: Danny Becker >Priority: Minor > Attachments: HDFS-15071.000.patch > > > Add DataNode throughput metrics for read and write. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15070) Crashing bugs in NameNode when using a valid configuration for `dfs.namenode.audit.loggers`
[ https://issues.apache.org/jira/browse/HDFS-15070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tianyin Xu updated HDFS-15070: -- Description: I am using Hadoop-2.10.0. The configuration parameter `dfs.namenode.audit.loggers` allows `default` (which is the default value) and `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger`. When we use `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger`, namenode will not be started successfully because of `InstantiationException `thrown from `org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initAuditLoggers`. The root cause is that during the initialization time of namenode, `initAuditLoggers` will be called and it will try to call the default constructor of `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger` which actually doesn't have one default constructor implemented, and thus the `InstantiationException` exception is thrown. *Symptom* *$ ./start-dfs.sh* {code:java} 2019-12-18 14:05:20,670 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed.java.lang.RuntimeException: java.lang.InstantiationException: org.apache.hadoop.hdfs.server.namenode.top.TopAuditLoggerat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initAuditLoggers(FSNamesystem.java:1024)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:858)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:677)at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:674)at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:736)at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:961)at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:940)at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1714)at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1782)Caused by: java.lang.InstantiationException: org.apache.hadoop.hdfs.server.namenode.top.TopAuditLoggerat java.lang.Class.newInstance(Class.java:427)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initAuditLoggers(FSNamesystem.java:1017)... 8 moreCaused by: java.lang.NoSuchMethodException: org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger.()at java.lang.Class.getConstructor0(Class.java:3082)at java.lang.Class.newInstance(Class.java:412) ... 9 more {code} *Detailed Root Cause* There is no default constructor in `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger`: {code:java} /** * An {@link AuditLogger} that sends logged data directly to the metrics * systems. It is used when the top service is used directly by the name node */ @InterfaceAudience.Private public class TopAuditLogger implements AuditLogger { public static final Logger LOG = LoggerFactory.getLogger(TopAuditLogger.class); private final TopMetrics topMetrics; public TopAuditLogger(TopMetrics topMetrics) { Preconditions.checkNotNull(topMetrics, "Cannot init with a null " + "TopMetrics"); this.topMetrics = topMetrics; } @Override public void initialize(Configuration conf) { }{code} As long as the configuration parameter `dfs.namenode.audit.loggers` is set to `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger`, `initAuditLoggers` will try to call its default constructor to make a new instance: {code:java} private List initAuditLoggers(Configuration conf) { // Initialize the custom access loggers if configured. Collection alClasses = conf.getTrimmedStringCollection(DFS_NAMENODE_AUDIT_LOGGERS_KEY); List auditLoggers = Lists.newArrayList(); if (alClasses != null && !alClasses.isEmpty()) { for (String className : alClasses) { try { AuditLogger logger; if (DFS_NAMENODE_DEFAULT_AUDIT_LOGGER_NAME.equals(className)) { logger = new DefaultAuditLogger(); } else { logger = (AuditLogger) Class.forName(className).newInstance(); } logger.initialize(conf); auditLoggers.add(logger); } catch (RuntimeException re) { throw re; } catch (Exception e) { throw new RuntimeException(e); } } }{code} This is very different from the default configuration, `default`, which implements a default constructor so the default is fine. *How To Reproduce* The version of Hadoop: 2.10.0 # Set the value of configuration parameter `dfs.namenode.audit.loggers` to `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger` in "hdfs-site.xml"(the default value is `default`) # Start the namenode by running "start-dfs.sh" # The namenode will not be started successfully. {code:java} dfs.namenode.audit.loggers org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger List of classes implementing audit loggers that will receive audit events. These should be imp
[jira] [Updated] (HDFS-15070) Crashing bugs in NameNode when using a valid configuration for `dfs.namenode.audit.loggers`
[ https://issues.apache.org/jira/browse/HDFS-15070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xudong Sun updated HDFS-15070: -- Description: I am using Hadoop-2.10.0. The configuration parameter `dfs.namenode.audit.loggers` allows `default` (which is the default value) and `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger`. When I use `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger`, namenode will not be started successfully because of an `InstantiationException` thrown from `org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initAuditLoggers`. The root cause is that while initializing namenode, `initAuditLoggers` will be called and it will try to call the default constructor of `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger` which doesn't have a default constructor. Thus the `InstantiationException` exception is thrown. *Symptom* *$ ./start-dfs.sh* {code:java} 2019-12-18 14:05:20,670 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed.java.lang.RuntimeException: java.lang.InstantiationException: org.apache.hadoop.hdfs.server.namenode.top.TopAuditLoggerat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initAuditLoggers(FSNamesystem.java:1024)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:858)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:677)at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:674)at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:736)at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:961)at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:940)at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1714)at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1782)Caused by: java.lang.InstantiationException: org.apache.hadoop.hdfs.server.namenode.top.TopAuditLoggerat java.lang.Class.newInstance(Class.java:427)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initAuditLoggers(FSNamesystem.java:1017)... 8 moreCaused by: java.lang.NoSuchMethodException: org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger.()at java.lang.Class.getConstructor0(Class.java:3082)at java.lang.Class.newInstance(Class.java:412) ... 9 more {code} *Detailed Root Cause* There is no default constructor in `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger`: {code:java} /** * An {@link AuditLogger} that sends logged data directly to the metrics * systems. It is used when the top service is used directly by the name node */ @InterfaceAudience.Private public class TopAuditLogger implements AuditLogger { public static finalLogger LOG = LoggerFactory.getLogger(TopAuditLogger.class); private final TopMetrics topMetrics; public TopAuditLogger(TopMetrics topMetrics) { Preconditions.checkNotNull(topMetrics, "Cannot init with a null " + "TopMetrics"); this.topMetrics = topMetrics; } @Override public void initialize(Configuration conf) { }{code} As long as the configuration parameter `dfs.namenode.audit.loggers` is set to `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger`, `initAuditLoggers` will try to call its default constructor to make a new instance: {code:java} private List initAuditLoggers(Configuration conf) { // Initialize the custom access loggers if configured. Collection alClasses = conf.getTrimmedStringCollection(DFS_NAMENODE_AUDIT_LOGGERS_KEY); List auditLoggers = Lists.newArrayList(); if (alClasses != null && !alClasses.isEmpty()) { for (String className : alClasses) { try { AuditLogger logger; if (DFS_NAMENODE_DEFAULT_AUDIT_LOGGER_NAME.equals(className)) { logger = new DefaultAuditLogger(); } else { logger = (AuditLogger) Class.forName(className).newInstance(); } logger.initialize(conf); auditLoggers.add(logger); } catch (RuntimeException re) { throw re; } catch (Exception e) { throw new RuntimeException(e); } } }{code} `initAuditLoggers` tries to call the default constructor to make a new instance in: {code:java} logger = (AuditLogger) Class.forName(className).newInstance();{code} This is very different from the default configuration, `default`, which implements a default constructor so the default is fine. *How To Reproduce* The version of Hadoop: 2.10.0 # Set the value of configuration parameter `dfs.namenode.audit.loggers` to `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger` in "hdfs-site.xml"(the default value is `default`) # Start the namenode by running "start-dfs.sh" # The namenode will not be started successfully. {code:java} dfs.namenode.audit.loggers org.apache.hadoop.hdfs.server.namenode.
[jira] [Commented] (HDFS-15031) Allow BootstrapStandby to download FSImage if the directory is already formatted
[ https://issues.apache.org/jira/browse/HDFS-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999570#comment-16999570 ] Hadoop QA commented on HDFS-15031: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 45s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 28s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 51s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 13s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 17s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 22s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 11s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 25s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 10s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}103m 13s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 32s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}166m 59s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.TestDeadNodeDetection | | | hadoop.hdfs.TestDecommissionWithStriped | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:e573ea49085 | | JIRA Issue | HDFS-15031 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12989136/HDFS-15031.007.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux d05deb737f74 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / fddc3d5 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28539/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/28539/testReport/ | | Max. process+thread count | 2881 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/28539/console | | Powered by | Apache Y
[jira] [Updated] (HDFS-15071) Add DataNode Read and Write throughput percentile metrics
[ https://issues.apache.org/jira/browse/HDFS-15071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Íñigo Goiri updated HDFS-15071: --- Status: Patch Available (was: Open) > Add DataNode Read and Write throughput percentile metrics > - > > Key: HDFS-15071 > URL: https://issues.apache.org/jira/browse/HDFS-15071 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, hdfs, metrics >Reporter: Danny Becker >Assignee: Danny Becker >Priority: Minor > Attachments: HDFS-15071.000.patch > > > Add DataNode throughput metrics for read and write. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15071) Add DataNode Read and Write throughput percentile metrics
[ https://issues.apache.org/jira/browse/HDFS-15071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999600#comment-16999600 ] Íñigo Goiri commented on HDFS-15071: Thanks [~dannytbecker] for the patch. * Can we add the description of dfs.metrics.small.read and dfs.metrics.small.write to hdfs-default.xml and as a javadoc? * Can we add a javadoc to addWriteThroughput and addReadThroughput? > Add DataNode Read and Write throughput percentile metrics > - > > Key: HDFS-15071 > URL: https://issues.apache.org/jira/browse/HDFS-15071 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, hdfs, metrics >Reporter: Danny Becker >Assignee: Danny Becker >Priority: Minor > Attachments: HDFS-15071.000.patch > > > Add DataNode throughput metrics for read and write. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15031) Allow BootstrapStandby to download FSImage if the directory is already formatted
[ https://issues.apache.org/jira/browse/HDFS-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999602#comment-16999602 ] Íñigo Goiri commented on HDFS-15031: +1 on [^HDFS-15031.007.patch]. > Allow BootstrapStandby to download FSImage if the directory is already > formatted > > > Key: HDFS-15031 > URL: https://issues.apache.org/jira/browse/HDFS-15031 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Danny Becker >Assignee: Danny Becker >Priority: Minor > Attachments: HDFS-15031.000.patch, HDFS-15031.001.patch, > HDFS-15031.002.patch, HDFS-15031.003.patch, HDFS-15031.005.patch, > HDFS-15031.006.patch, HDFS-15031.007.patch > > > Currently, BootstrapStandby will only download the latest FSImage if it has > formatted the local image directory. This can be an issue when there are out > of date FSImages on a Standby NameNode, as the non-interactive mode will not > format the image directory, and BootstrapStandby will return an error code. > The changes here simply allow BootstrapStandby to download the latest FSImage > to the image directory, without needing to format first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15071) Add DataNode Read and Write throughput percentile metrics
[ https://issues.apache.org/jira/browse/HDFS-15071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Becker updated HDFS-15071: Attachment: HDFS-15071.001.patch > Add DataNode Read and Write throughput percentile metrics > - > > Key: HDFS-15071 > URL: https://issues.apache.org/jira/browse/HDFS-15071 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, hdfs, metrics >Reporter: Danny Becker >Assignee: Danny Becker >Priority: Minor > Attachments: HDFS-15071.000.patch, HDFS-15071.001.patch > > > Add DataNode throughput metrics for read and write. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15072) HDFS MiniCluster fails to start when run in directory path with a %
Geoffrey Jacoby created HDFS-15072: -- Summary: HDFS MiniCluster fails to start when run in directory path with a % Key: HDFS-15072 URL: https://issues.apache.org/jira/browse/HDFS-15072 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.7.5 Environment: I encountered this on a Mac while running an HBase minicluster that was using Hadoop 2.7.5. However, the code looks the same in trunk so it likely affects most or all current versions. Reporter: Geoffrey Jacoby FSVolumeImpl.initializeCacheExecutor calls Guava's ThreadPoolExecutorBuilder. setNameFormat, passing in the String representation of the parent File. Guava will take the String whole and pass it to String.format, which uses % as a special character. That means that if parent.toString() contains a percentage sign, followed by a character that's illegal to use as a formatter in String.format(), you'll get an exception that stops the MiniCluster from starting up. I did not check to see if this would also happen on a normal DataNode daemon. initializeCacheExecutor should escape the parent file name before passing it in. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15062) Add LOG when sendIBRs failed
[ https://issues.apache.org/jira/browse/HDFS-15062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15062: --- Attachment: HDFS-15062.003.patch > Add LOG when sendIBRs failed > > > Key: HDFS-15062 > URL: https://issues.apache.org/jira/browse/HDFS-15062 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15062.001.patch, HDFS-15062.002.patch, > HDFS-15062.003.patch > > > {code} > /** Send IBRs to namenode. */ > void sendIBRs(DatanodeProtocol namenode, DatanodeRegistration registration, > String bpid, String nnRpcLatencySuffix) throws IOException { > // Generate a list of the pending reports for each storage under the lock > final StorageReceivedDeletedBlocks[] reports = generateIBRs(); > if (reports.length == 0) { > // Nothing new to report. > return; > } > // Send incremental block reports to the Namenode outside the lock > if (LOG.isDebugEnabled()) { > LOG.debug("call blockReceivedAndDeleted: " + Arrays.toString(reports)); > } > boolean success = false; > final long startTime = monotonicNow(); > try { > namenode.blockReceivedAndDeleted(registration, bpid, reports); > success = true; > } finally { > if (success) { > dnMetrics.addIncrementalBlockReport(monotonicNow() - startTime, > nnRpcLatencySuffix); > lastIBR = startTime; > } else { > // If we didn't succeed in sending the report, put all of the > // blocks back onto our queue, but only in the case where we > // didn't put something newer in the meantime. > putMissing(reports); > } > } > } > {code} > When call namenode.blockReceivedAndDelete failed, will put reports to > pendingIBRs. Maybe we should add log for failed case. It is helpful for > trouble shooting -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15062) Add LOG when sendIBRs failed
[ https://issues.apache.org/jira/browse/HDFS-15062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999647#comment-16999647 ] Fei Hui commented on HDFS-15062: [~weichiu] [~elgoiri] I think adding more info is good.Upload v003 patch add nnId and duration info > Add LOG when sendIBRs failed > > > Key: HDFS-15062 > URL: https://issues.apache.org/jira/browse/HDFS-15062 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15062.001.patch, HDFS-15062.002.patch, > HDFS-15062.003.patch > > > {code} > /** Send IBRs to namenode. */ > void sendIBRs(DatanodeProtocol namenode, DatanodeRegistration registration, > String bpid, String nnRpcLatencySuffix) throws IOException { > // Generate a list of the pending reports for each storage under the lock > final StorageReceivedDeletedBlocks[] reports = generateIBRs(); > if (reports.length == 0) { > // Nothing new to report. > return; > } > // Send incremental block reports to the Namenode outside the lock > if (LOG.isDebugEnabled()) { > LOG.debug("call blockReceivedAndDeleted: " + Arrays.toString(reports)); > } > boolean success = false; > final long startTime = monotonicNow(); > try { > namenode.blockReceivedAndDeleted(registration, bpid, reports); > success = true; > } finally { > if (success) { > dnMetrics.addIncrementalBlockReport(monotonicNow() - startTime, > nnRpcLatencySuffix); > lastIBR = startTime; > } else { > // If we didn't succeed in sending the report, put all of the > // blocks back onto our queue, but only in the case where we > // didn't put something newer in the meantime. > putMissing(reports); > } > } > } > {code} > When call namenode.blockReceivedAndDelete failed, will put reports to > pendingIBRs. Maybe we should add log for failed case. It is helpful for > trouble shooting -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-15069) DecommissionMonitor-0 thread will block forever while its timer task scheduled encountered any unchecked exceptions.
[ https://issues.apache.org/jira/browse/HDFS-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xudong Cao resolved HDFS-15069. --- Resolution: Duplicate > DecommissionMonitor-0 thread will block forever while its timer task > scheduled encountered any unchecked exceptions. > > > Key: HDFS-15069 > URL: https://issues.apache.org/jira/browse/HDFS-15069 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Major > Attachments: stack_on_16_12.png, stack_on_16_42.png > > > More than once, we have observed that during decommissioning of a large > number of DNs, the thread DecommissionMonitor-0 will stop scheduling, > blocking for a long time, and there will be no exception logs or > notifications at all. > e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about > 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. > The stack of DecommissionMonitor-0 looks like this: > # stack on 2019.12.17 16:12 !stack_on_16_12.png! > # stack on 2019.12.17 16:42 !stack_on_16_42.png! > It can be seen that during half an hour, this thread has not been scheduled > at all, its Waited count has not changed. > We think the cause of the problem is: > # The DecommissionMonitor task submitted by NameNode encounters an unchecked > exception during its running , and then this task will be never executed > again. > # But NameNode does not care about the ScheduledFuture of this task, and > never calls ScheduledFuture.get(), so the unchecked exception thrown by the > task above will always be placed there, no one knows. > After that, the subsequent phenomenon is: > # The ScheduledExecutorService thread DecommissionMonitor-0 will block > forever in ThreadPoolExecutor.getTask(). > # The previously submitted task DecommissionMonitor will be never executed > again. > # No logs or notifications can let us know exactly what had happened. > Possible solutions: > # Do not use thread pool to execute decommission monitor task, alternatively > we can introduce a separate thread to do this, just like HeartbeatManager, > ReplicationMonitor, LeaseManager, BlockReportThread, and so on. > OR > 2. Catch all exceptions in decommission monitor task's run() method, > so it does not throw any exceptions. > I prefer the second option. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15069) DecommissionMonitor-0 thread will block forever while its timer task scheduled encountered any unchecked exceptions.
[ https://issues.apache.org/jira/browse/HDFS-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999656#comment-16999656 ] Xudong Cao commented on HDFS-15069: --- Sorry for not noticing HDFS-12703, they are indeed a same issue. > DecommissionMonitor-0 thread will block forever while its timer task > scheduled encountered any unchecked exceptions. > > > Key: HDFS-15069 > URL: https://issues.apache.org/jira/browse/HDFS-15069 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Major > Attachments: stack_on_16_12.png, stack_on_16_42.png > > > More than once, we have observed that during decommissioning of a large > number of DNs, the thread DecommissionMonitor-0 will stop scheduling, > blocking for a long time, and there will be no exception logs or > notifications at all. > e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about > 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days. > The stack of DecommissionMonitor-0 looks like this: > # stack on 2019.12.17 16:12 !stack_on_16_12.png! > # stack on 2019.12.17 16:42 !stack_on_16_42.png! > It can be seen that during half an hour, this thread has not been scheduled > at all, its Waited count has not changed. > We think the cause of the problem is: > # The DecommissionMonitor task submitted by NameNode encounters an unchecked > exception during its running , and then this task will be never executed > again. > # But NameNode does not care about the ScheduledFuture of this task, and > never calls ScheduledFuture.get(), so the unchecked exception thrown by the > task above will always be placed there, no one knows. > After that, the subsequent phenomenon is: > # The ScheduledExecutorService thread DecommissionMonitor-0 will block > forever in ThreadPoolExecutor.getTask(). > # The previously submitted task DecommissionMonitor will be never executed > again. > # No logs or notifications can let us know exactly what had happened. > Possible solutions: > # Do not use thread pool to execute decommission monitor task, alternatively > we can introduce a separate thread to do this, just like HeartbeatManager, > ReplicationMonitor, LeaseManager, BlockReportThread, and so on. > OR > 2. Catch all exceptions in decommission monitor task's run() method, > so it does not throw any exceptions. > I prefer the second option. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15071) Add DataNode Read and Write throughput percentile metrics
[ https://issues.apache.org/jira/browse/HDFS-15071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999692#comment-16999692 ] Hadoop QA commented on HDFS-15071: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 41s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 21s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 16s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 57s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 26s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 19s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 15s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 47s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 609 unchanged - 0 fixed = 611 total (was 609) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 11s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 39s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 13s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}119m 11s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 41s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}183m 26s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.TestDeadNodeDetection | | | hadoop.tools.TestHdfsConfigFields | | | hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:e573ea49085 | | JIRA Issue | HDFS-15071 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12989142/HDFS-15071.000.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 986a74cb3d6d 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 7b93575 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/28540/artifact/out/diff-checkstyle-hadoop
[jira] [Commented] (HDFS-15071) Add DataNode Read and Write throughput percentile metrics
[ https://issues.apache.org/jira/browse/HDFS-15071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999703#comment-16999703 ] Hadoop QA commented on HDFS-15071: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 53s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 7s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 51s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 16s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 53s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 26s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 18s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 5s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 55s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 4 new + 609 unchanged - 0 fixed = 613 total (was 609) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 24s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 1s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 13s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 23s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}117m 26s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 33s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}185m 9s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.namenode.TestNameNodeMXBean | | | hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks | | | hadoop.hdfs.server.namenode.TestAddOverReplicatedStripedBlocks | | | hadoop.hdfs.server.namenode.ha.TestDFSUpgradeWithHA | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:e573ea49085 | | JIRA Issue | HDFS-15071 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12989146/HDFS-15071.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml | | uname | Linux 95915b0d0c84 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/perso
[jira] [Commented] (HDFS-15062) Add LOG when sendIBRs failed
[ https://issues.apache.org/jira/browse/HDFS-15062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999720#comment-16999720 ] Hadoop QA commented on HDFS-15062: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 32s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 16s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 1s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 41s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 6s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 30s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 6s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 16s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 17s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 18s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 91m 12s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 37s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}148m 35s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.TestDeadNodeDetection | | | hadoop.hdfs.TestMultipleNNPortQOP | | | hadoop.hdfs.TestReconstructStripedFile | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:e573ea49085 | | JIRA Issue | HDFS-15062 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12989151/HDFS-15062.003.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 88a4ee862d58 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 7b93575 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28542/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/28542/testReport/ | | Max. process+thread count | 3975 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs
[jira] [Commented] (HDFS-15062) Add LOG when sendIBRs failed
[ https://issues.apache.org/jira/browse/HDFS-15062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999723#comment-16999723 ] Wei-Chiu Chuang commented on HDFS-15062: LGTM > Add LOG when sendIBRs failed > > > Key: HDFS-15062 > URL: https://issues.apache.org/jira/browse/HDFS-15062 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15062.001.patch, HDFS-15062.002.patch, > HDFS-15062.003.patch > > > {code} > /** Send IBRs to namenode. */ > void sendIBRs(DatanodeProtocol namenode, DatanodeRegistration registration, > String bpid, String nnRpcLatencySuffix) throws IOException { > // Generate a list of the pending reports for each storage under the lock > final StorageReceivedDeletedBlocks[] reports = generateIBRs(); > if (reports.length == 0) { > // Nothing new to report. > return; > } > // Send incremental block reports to the Namenode outside the lock > if (LOG.isDebugEnabled()) { > LOG.debug("call blockReceivedAndDeleted: " + Arrays.toString(reports)); > } > boolean success = false; > final long startTime = monotonicNow(); > try { > namenode.blockReceivedAndDeleted(registration, bpid, reports); > success = true; > } finally { > if (success) { > dnMetrics.addIncrementalBlockReport(monotonicNow() - startTime, > nnRpcLatencySuffix); > lastIBR = startTime; > } else { > // If we didn't succeed in sending the report, put all of the > // blocks back onto our queue, but only in the case where we > // didn't put something newer in the meantime. > putMissing(reports); > } > } > } > {code} > When call namenode.blockReceivedAndDelete failed, will put reports to > pendingIBRs. Maybe we should add log for failed case. It is helpful for > trouble shooting -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15068) DataNode could meet deadlock if invoke refreshVolumes when register
[ https://issues.apache.org/jira/browse/HDFS-15068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15068: --- Attachment: HDFS-15068.003.patch > DataNode could meet deadlock if invoke refreshVolumes when register > --- > > Key: HDFS-15068 > URL: https://issues.apache.org/jira/browse/HDFS-15068 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Critical > Attachments: HDFS-15068.001.patch, HDFS-15068.002.patch, > HDFS-15068.003.patch > > > DataNode could meet deadlock when invoke `dfsadmin -reconfig datanode ip:host > start` to trigger #refreshVolumes. > 1. DataNod#refreshVolumes hold datanode instance ownable {{synchronizer}} > when enter this method first, then try to hold BPOfferService {{readlock}} > when `bpos.getNamespaceInfo()` in following code segment. > {code:java} > for (BPOfferService bpos : blockPoolManager.getAllNamenodeThreads()) { > nsInfos.add(bpos.getNamespaceInfo()); > } > {code} > 2. BPOfferService#registrationSucceeded (which is invoked by #register when > DataNode start or #reregister when processCommandFromActor) hold > BPOfferService {{writelock}} first, then try to hold datanode instance > ownable {{synchronizer}} in following method. > {code:java} > synchronized void bpRegistrationSucceeded(DatanodeRegistration > bpRegistration, > String blockPoolId) throws IOException { > id = bpRegistration; > if(!storage.getDatanodeUuid().equals(bpRegistration.getDatanodeUuid())) { > throw new IOException("Inconsistent Datanode IDs. Name-node returned " > + bpRegistration.getDatanodeUuid() > + ". Expecting " + storage.getDatanodeUuid()); > } > > registerBlockPoolWithSecretManager(bpRegistration, blockPoolId); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15068) DataNode could meet deadlock if invoke refreshVolumes when register
[ https://issues.apache.org/jira/browse/HDFS-15068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999736#comment-16999736 ] Aiphago commented on HDFS-15068: Hi [~hexiaoqiao],thanks for valuable advice,renew the patch.[^HDFS-15068.003.patch] > DataNode could meet deadlock if invoke refreshVolumes when register > --- > > Key: HDFS-15068 > URL: https://issues.apache.org/jira/browse/HDFS-15068 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Critical > Attachments: HDFS-15068.001.patch, HDFS-15068.002.patch, > HDFS-15068.003.patch > > > DataNode could meet deadlock when invoke `dfsadmin -reconfig datanode ip:host > start` to trigger #refreshVolumes. > 1. DataNod#refreshVolumes hold datanode instance ownable {{synchronizer}} > when enter this method first, then try to hold BPOfferService {{readlock}} > when `bpos.getNamespaceInfo()` in following code segment. > {code:java} > for (BPOfferService bpos : blockPoolManager.getAllNamenodeThreads()) { > nsInfos.add(bpos.getNamespaceInfo()); > } > {code} > 2. BPOfferService#registrationSucceeded (which is invoked by #register when > DataNode start or #reregister when processCommandFromActor) hold > BPOfferService {{writelock}} first, then try to hold datanode instance > ownable {{synchronizer}} in following method. > {code:java} > synchronized void bpRegistrationSucceeded(DatanodeRegistration > bpRegistration, > String blockPoolId) throws IOException { > id = bpRegistration; > if(!storage.getDatanodeUuid().equals(bpRegistration.getDatanodeUuid())) { > throw new IOException("Inconsistent Datanode IDs. Name-node returned " > + bpRegistration.getDatanodeUuid() > + ". Expecting " + storage.getDatanodeUuid()); > } > > registerBlockPoolWithSecretManager(bpRegistration, blockPoolId); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15073) Remove the usage of curator-shaded guava in SnapshotDiffReportListing.java
Akira Ajisaka created HDFS-15073: Summary: Remove the usage of curator-shaded guava in SnapshotDiffReportListing.java Key: HDFS-15073 URL: https://issues.apache.org/jira/browse/HDFS-15073 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client Reporter: Akira Ajisaka In SnapshotDiffReportListing.java, {code} import org.apache.curator.shaded.com.google.common.base.Preconditions; {code} should be {code} import com.google.common.base.Preconditions; {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15000) Improve FsDatasetImpl to avoid IO operation in datasetLock
[ https://issues.apache.org/jira/browse/HDFS-15000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15000: --- Attachment: HDFS-15000.001.patch > Improve FsDatasetImpl to avoid IO operation in datasetLock > -- > > Key: HDFS-15000 > URL: https://issues.apache.org/jira/browse/HDFS-15000 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Major > Attachments: HDFS-15000.001.patch > > > As HDFS-14997 mentioned, some methods in #FsDatasetImpl such as > #finalizeBlock, #finalizeReplica, #createRbw includes IO operation in the > datasetLock, It will block some logic when IO load is very high. We should > reduce grain fineness or move IO operation out of datasetLock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15000) Improve FsDatasetImpl to avoid IO operation in datasetLock
[ https://issues.apache.org/jira/browse/HDFS-15000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999776#comment-16999776 ] Aiphago commented on HDFS-15000: submit demo patch.The main idea is to make IO opreate(the opreate may have order depend) without lock and async and keep the opreate is in order. > Improve FsDatasetImpl to avoid IO operation in datasetLock > -- > > Key: HDFS-15000 > URL: https://issues.apache.org/jira/browse/HDFS-15000 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Major > Attachments: HDFS-15000.001.patch > > > As HDFS-14997 mentioned, some methods in #FsDatasetImpl such as > #finalizeBlock, #finalizeReplica, #createRbw includes IO operation in the > datasetLock, It will block some logic when IO load is very high. We should > reduce grain fineness or move IO operation out of datasetLock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15000) Improve FsDatasetImpl to avoid IO operation in datasetLock
[ https://issues.apache.org/jira/browse/HDFS-15000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999776#comment-16999776 ] Aiphago edited comment on HDFS-15000 at 12/19/19 6:32 AM: -- submit demo patch.The main idea is to make IO opreate(the opreate may have order depend) without lock and async and keep the opreate is in order.Any suggestions or problems? was (Author: aiphag0): submit demo patch.The main idea is to make IO opreate(the opreate may have order depend) without lock and async and keep the opreate is in order. > Improve FsDatasetImpl to avoid IO operation in datasetLock > -- > > Key: HDFS-15000 > URL: https://issues.apache.org/jira/browse/HDFS-15000 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Major > Attachments: HDFS-15000.001.patch > > > As HDFS-14997 mentioned, some methods in #FsDatasetImpl such as > #finalizeBlock, #finalizeReplica, #createRbw includes IO operation in the > datasetLock, It will block some logic when IO load is very high. We should > reduce grain fineness or move IO operation out of datasetLock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15068) DataNode could meet deadlock if invoke refreshVolumes when register
[ https://issues.apache.org/jira/browse/HDFS-15068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999783#comment-16999783 ] Xiaoqiao He commented on HDFS-15068: Thanks [~Aiphag0] for your works, v003 LGTM, Let's wait what Jenkins says. Ping [~weichiu],[~elgoiri],[~iwasakims] Would you mind take another reviews? > DataNode could meet deadlock if invoke refreshVolumes when register > --- > > Key: HDFS-15068 > URL: https://issues.apache.org/jira/browse/HDFS-15068 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Critical > Attachments: HDFS-15068.001.patch, HDFS-15068.002.patch, > HDFS-15068.003.patch > > > DataNode could meet deadlock when invoke `dfsadmin -reconfig datanode ip:host > start` to trigger #refreshVolumes. > 1. DataNod#refreshVolumes hold datanode instance ownable {{synchronizer}} > when enter this method first, then try to hold BPOfferService {{readlock}} > when `bpos.getNamespaceInfo()` in following code segment. > {code:java} > for (BPOfferService bpos : blockPoolManager.getAllNamenodeThreads()) { > nsInfos.add(bpos.getNamespaceInfo()); > } > {code} > 2. BPOfferService#registrationSucceeded (which is invoked by #register when > DataNode start or #reregister when processCommandFromActor) hold > BPOfferService {{writelock}} first, then try to hold datanode instance > ownable {{synchronizer}} in following method. > {code:java} > synchronized void bpRegistrationSucceeded(DatanodeRegistration > bpRegistration, > String blockPoolId) throws IOException { > id = bpRegistration; > if(!storage.getDatanodeUuid().equals(bpRegistration.getDatanodeUuid())) { > throw new IOException("Inconsistent Datanode IDs. Name-node returned " > + bpRegistration.getDatanodeUuid() > + ". Expecting " + storage.getDatanodeUuid()); > } > > registerBlockPoolWithSecretManager(bpRegistration, blockPoolId); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15000) Improve FsDatasetImpl to avoid IO operation in datasetLock
[ https://issues.apache.org/jira/browse/HDFS-15000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999789#comment-16999789 ] Xiaoqiao He commented on HDFS-15000: Thanks [~Aiphag0], it's a very good start here. Some suggestions, a. Please rebase codebase then submit new patch. b. We should consider case how to rollback meta information of {{volumeMap}} while IoTasker run failed. c. Consider this is core changes, we should add enough unit test to cover changes. > Improve FsDatasetImpl to avoid IO operation in datasetLock > -- > > Key: HDFS-15000 > URL: https://issues.apache.org/jira/browse/HDFS-15000 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Major > Attachments: HDFS-15000.001.patch > > > As HDFS-14997 mentioned, some methods in #FsDatasetImpl such as > #finalizeBlock, #finalizeReplica, #createRbw includes IO operation in the > datasetLock, It will block some logic when IO load is very high. We should > reduce grain fineness or move IO operation out of datasetLock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15051) RBF: Propose to revoke WRITE MountTableEntry privilege to super user only
[ https://issues.apache.org/jira/browse/HDFS-15051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999792#comment-16999792 ] Xiaoqiao He commented on HDFS-15051: Hi [~elgoiri],[~ayushtkn],[~weichiu] any more suggestions here? > RBF: Propose to revoke WRITE MountTableEntry privilege to super user only > - > > Key: HDFS-15051 > URL: https://issues.apache.org/jira/browse/HDFS-15051 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: Xiaoqiao He >Assignee: Xiaoqiao He >Priority: Major > Attachments: HDFS-15051.001.patch, HDFS-15051.002.patch, > HDFS-15051.003.patch, HDFS-15051.004.patch > > > The current permission checker of #MountTableStoreImpl is not very restrict. > In some case, any user could add/update/remove MountTableEntry without the > expected permission checking. > The following code segment try to check permission when operate > MountTableEntry, however mountTable object is from Client/RouterAdmin > {{MountTable mountTable = request.getEntry();}}, and user could pass any mode > which could bypass the permission checker. > {code:java} > public void checkPermission(MountTable mountTable, FsAction access) > throws AccessControlException { > if (isSuperUser()) { > return; > } > FsPermission mode = mountTable.getMode(); > if (getUser().equals(mountTable.getOwnerName()) > && mode.getUserAction().implies(access)) { > return; > } > if (isMemberOfGroup(mountTable.getGroupName()) > && mode.getGroupAction().implies(access)) { > return; > } > if (!getUser().equals(mountTable.getOwnerName()) > && !isMemberOfGroup(mountTable.getGroupName()) > && mode.getOtherAction().implies(access)) { > return; > } > throw new AccessControlException( > "Permission denied while accessing mount table " > + mountTable.getSourcePath() > + ": user " + getUser() + " does not have " + access.toString() > + " permissions."); > } > {code} > I just propose revoke WRITE MountTableEntry privilege to super user only. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14997) BPServiceActor process command from NameNode asynchronously
[ https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999809#comment-16999809 ] Xiaoqiao He commented on HDFS-14997: Some more information in our production cluster: we have deployed this feature for nearly one month, and we do not any DataNode heartbeat exception even dead state due to no heartbeat more than 630s(by default) since long time to wait process command finished.(The main reason is that some command need to wait {{datasetLock}} for long time if IO load is very high). Furthermore, we have make BlockReport async also in our internal version, gain the effect was very obvious. If any interesting, I would like to push forward it at next JIRA. > BPServiceActor process command from NameNode asynchronously > --- > > Key: HDFS-14997 > URL: https://issues.apache.org/jira/browse/HDFS-14997 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiaoqiao He >Assignee: Xiaoqiao He >Priority: Major > Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, > HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch > > > There are two core functions, report(#sendHeartbeat, #blockReport, > #cacheReport) and #processCommand in #BPServiceActor main process flow. If > processCommand cost long time it will block send report flow. Meanwhile > processCommand could cost long time(over 1000s the worst case I meet) when IO > load of DataNode is very high. Since some IO operations are under > #datasetLock, So it has to wait to acquire #datasetLock long time when > process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat > will not send to NameNode in-time, and trigger other disasters. > I propose to improve #processCommand asynchronously and not block > #BPServiceActor to send heartbeat back to NameNode when meet high IO load. > Notes: > 1. Lifeline could be one effective solution, however some old branches are > not support this feature. > 2. IO operations under #datasetLock is another issue, I think we should solve > it at another JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14740) Recover data blocks from persistent memory read cache during datanode restarts
[ https://issues.apache.org/jira/browse/HDFS-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feilong He updated HDFS-14740: -- Attachment: HDFS-14740-branch-3.1-000.patch > Recover data blocks from persistent memory read cache during datanode restarts > -- > > Key: HDFS-14740 > URL: https://issues.apache.org/jira/browse/HDFS-14740 > Project: Hadoop HDFS > Issue Type: Improvement > Components: caching, datanode >Reporter: Feilong He >Assignee: Feilong He >Priority: Major > Attachments: HDFS-14740-branch-3.1-000.patch, HDFS-14740.000.patch, > HDFS-14740.001.patch, HDFS-14740.002.patch, HDFS-14740.003.patch, > HDFS-14740.004.patch, HDFS-14740.005.patch, HDFS-14740.006.patch, > HDFS-14740.007.patch, HDFS-14740.008.patch, > HDFS_Persistent_Read-Cache_Design-v1.pdf, > HDFS_Persistent_Read-Cache_Test-v1.1.pdf, > HDFS_Persistent_Read-Cache_Test-v1.pdf, HDFS_Persistent_Read-Cache_Test-v2.pdf > > > In HDFS-13762, persistent memory (PM) is enabled in HDFS centralized cache > management. Even though PM can persist cache data, for simplifying the > initial implementation, the previous cache data will be cleaned up during > DataNode restarts. Here, we are proposing to improve HDFS PM cache by taking > advantage of PM's data persistence characteristic, i.e., recovering the > status for cached data, if any, when DataNode restarts, thus, cache warm up > time can be saved for user. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14740) Recover data blocks from persistent memory read cache during datanode restarts
[ https://issues.apache.org/jira/browse/HDFS-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feilong He updated HDFS-14740: -- Attachment: HDFS-14740-branch-3.2-000.patch > Recover data blocks from persistent memory read cache during datanode restarts > -- > > Key: HDFS-14740 > URL: https://issues.apache.org/jira/browse/HDFS-14740 > Project: Hadoop HDFS > Issue Type: Improvement > Components: caching, datanode >Reporter: Feilong He >Assignee: Feilong He >Priority: Major > Attachments: HDFS-14740-branch-3.1-000.patch, > HDFS-14740-branch-3.2-000.patch, HDFS-14740.000.patch, HDFS-14740.001.patch, > HDFS-14740.002.patch, HDFS-14740.003.patch, HDFS-14740.004.patch, > HDFS-14740.005.patch, HDFS-14740.006.patch, HDFS-14740.007.patch, > HDFS-14740.008.patch, HDFS_Persistent_Read-Cache_Design-v1.pdf, > HDFS_Persistent_Read-Cache_Test-v1.1.pdf, > HDFS_Persistent_Read-Cache_Test-v1.pdf, HDFS_Persistent_Read-Cache_Test-v2.pdf > > > In HDFS-13762, persistent memory (PM) is enabled in HDFS centralized cache > management. Even though PM can persist cache data, for simplifying the > initial implementation, the previous cache data will be cleaned up during > DataNode restarts. Here, we are proposing to improve HDFS PM cache by taking > advantage of PM's data persistence characteristic, i.e., recovering the > status for cached data, if any, when DataNode restarts, thus, cache warm up > time can be saved for user. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org