[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16162330#comment-16162330 ] Weiwei Yang commented on HDFS-12098: Hi [~anu], [~vagarychen] Thanks for revisiting this, I could not reproduce this either on latest code base, looks like this was fixed by some other patches. This seems no longer a valid issue, I think we can close it. Thanks for spending time trying to reproduce this. > Ozone: Datanode is unable to register with scm if scm starts later > -- > > Key: HDFS-12098 > URL: https://issues.apache.org/jira/browse/HDFS-12098 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, ozone, scm >Affects Versions: HDFS-7240 >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Labels: ozoneMerge > Fix For: HDFS-7240 > > Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, > HDFS-12098-HDFS-7240.002.patch, HDFS-12098-HDFS-7240.testcase-1.patch, > HDFS-12098-HDFS-7240.testcase.patch, Screen Shot 2017-07-11 at 4.58.08 > PM.png, thread_dump.log > > > Reproducing steps > 1. Start namenode > {{./bin/hdfs --daemon start namenode}} > 2. Start datanode > {{./bin/hdfs datanode}} > will see following connection issues > {noformat} > 17/07/13 21:16:48 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 0 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:49 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 1 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:50 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 2 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:51 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 3 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > {noformat} > this is expected because scm is not started yet > 3. Start scm > {{./bin/hdfs scm}} > expecting datanode can register to this scm, expecting the log in scm > {noformat} > 17/07/13 21:22:30 INFO node.SCMNodeManager: Data node with ID: > af22862d-aafa-4941-9073-53224ae43e2c Registered. > {noformat} > but did *NOT* see this log. (_I debugged into the code and found the datanode > state was transited SHUTDOWN unexpectedly because the thread leaks, each of > those threads counted to set to next state and they all set to SHUTDOWN > state_) > 4. Create a container from scm CLI > {{./bin/hdfs scm -container -create -c 20170714c0}} > this fails with following exception > {noformat} > Creating container : 20170714c0. > Error executing > command:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.scm.exceptions.SCMException): > Unable to create container while in chill mode > at > org.apache.hadoop.ozone.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:241) > at > org.apache.hadoop.ozone.scm.StorageContainerManager.allocateContainer(StorageContainerManager.java:392) > at > org.apache.hadoop.ozone.protocolPB.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:73) > {noformat} > datanode was not registered to scm, thus it's still in chill mode. > *Note*, if we start scm first, there is no such issue, I can create container > from CLI without any problem. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16161994#comment-16161994 ] Anu Engineer commented on HDFS-12098: - @weiwei yang, I was talking to [~vagarychen] offline and he was thinking this works for him. Would you be able to cross check if this is still broken? > Ozone: Datanode is unable to register with scm if scm starts later > -- > > Key: HDFS-12098 > URL: https://issues.apache.org/jira/browse/HDFS-12098 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, ozone, scm >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Labels: ozoneMerge > Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, > HDFS-12098-HDFS-7240.002.patch, HDFS-12098-HDFS-7240.testcase-1.patch, > HDFS-12098-HDFS-7240.testcase.patch, Screen Shot 2017-07-11 at 4.58.08 > PM.png, thread_dump.log > > > Reproducing steps > 1. Start namenode > {{./bin/hdfs --daemon start namenode}} > 2. Start datanode > {{./bin/hdfs datanode}} > will see following connection issues > {noformat} > 17/07/13 21:16:48 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 0 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:49 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 1 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:50 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 2 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:51 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 3 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > {noformat} > this is expected because scm is not started yet > 3. Start scm > {{./bin/hdfs scm}} > expecting datanode can register to this scm, expecting the log in scm > {noformat} > 17/07/13 21:22:30 INFO node.SCMNodeManager: Data node with ID: > af22862d-aafa-4941-9073-53224ae43e2c Registered. > {noformat} > but did *NOT* see this log. (_I debugged into the code and found the datanode > state was transited SHUTDOWN unexpectedly because the thread leaks, each of > those threads counted to set to next state and they all set to SHUTDOWN > state_) > 4. Create a container from scm CLI > {{./bin/hdfs scm -container -create -c 20170714c0}} > this fails with following exception > {noformat} > Creating container : 20170714c0. > Error executing > command:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.scm.exceptions.SCMException): > Unable to create container while in chill mode > at > org.apache.hadoop.ozone.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:241) > at > org.apache.hadoop.ozone.scm.StorageContainerManager.allocateContainer(StorageContainerManager.java:392) > at > org.apache.hadoop.ozone.protocolPB.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:73) > {noformat} > datanode was not registered to scm, thus it's still in chill mode. > *Note*, if we start scm first, there is no such issue, I can create container > from CLI without any problem. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16091052#comment-16091052 ] Weiwei Yang commented on HDFS-12098: Oh [~anu], no problem at all. Thanks for your quick reply. > Ozone: Datanode is unable to register with scm if scm starts later > -- > > Key: HDFS-12098 > URL: https://issues.apache.org/jira/browse/HDFS-12098 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, ozone, scm >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, > HDFS-12098-HDFS-7240.002.patch, HDFS-12098-HDFS-7240.testcase-1.patch, > HDFS-12098-HDFS-7240.testcase.patch, Screen Shot 2017-07-11 at 4.58.08 > PM.png, thread_dump.log > > > Reproducing steps > 1. Start namenode > {{./bin/hdfs --daemon start namenode}} > 2. Start datanode > {{./bin/hdfs datanode}} > will see following connection issues > {noformat} > 17/07/13 21:16:48 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 0 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:49 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 1 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:50 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 2 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:51 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 3 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > {noformat} > this is expected because scm is not started yet > 3. Start scm > {{./bin/hdfs scm}} > expecting datanode can register to this scm, expecting the log in scm > {noformat} > 17/07/13 21:22:30 INFO node.SCMNodeManager: Data node with ID: > af22862d-aafa-4941-9073-53224ae43e2c Registered. > {noformat} > but did *NOT* see this log. (_I debugged into the code and found the datanode > state was transited SHUTDOWN unexpectedly because the thread leaks, each of > those threads counted to set to next state and they all set to SHUTDOWN > state_) > 4. Create a container from scm CLI > {{./bin/hdfs scm -container -create -c 20170714c0}} > this fails with following exception > {noformat} > Creating container : 20170714c0. > Error executing > command:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.scm.exceptions.SCMException): > Unable to create container while in chill mode > at > org.apache.hadoop.ozone.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:241) > at > org.apache.hadoop.ozone.scm.StorageContainerManager.allocateContainer(StorageContainerManager.java:392) > at > org.apache.hadoop.ozone.protocolPB.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:73) > {noformat} > datanode was not registered to scm, thus it's still in chill mode. > *Note*, if we start scm first, there is no such issue, I can create container > from CLI without any problem. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16091035#comment-16091035 ] Anu Engineer commented on HDFS-12098: - [~cheersyang] Sorry, I have not gotten to this yet. I will take a look at this soon. I have been trying to clear up the code review backlogs. > Ozone: Datanode is unable to register with scm if scm starts later > -- > > Key: HDFS-12098 > URL: https://issues.apache.org/jira/browse/HDFS-12098 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, ozone, scm >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, > HDFS-12098-HDFS-7240.002.patch, HDFS-12098-HDFS-7240.testcase-1.patch, > HDFS-12098-HDFS-7240.testcase.patch, Screen Shot 2017-07-11 at 4.58.08 > PM.png, thread_dump.log > > > Reproducing steps > 1. Start namenode > {{./bin/hdfs --daemon start namenode}} > 2. Start datanode > {{./bin/hdfs datanode}} > will see following connection issues > {noformat} > 17/07/13 21:16:48 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 0 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:49 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 1 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:50 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 2 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:51 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 3 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > {noformat} > this is expected because scm is not started yet > 3. Start scm > {{./bin/hdfs scm}} > expecting datanode can register to this scm, expecting the log in scm > {noformat} > 17/07/13 21:22:30 INFO node.SCMNodeManager: Data node with ID: > af22862d-aafa-4941-9073-53224ae43e2c Registered. > {noformat} > but did *NOT* see this log. (_I debugged into the code and found the datanode > state was transited SHUTDOWN unexpectedly because the thread leaks, each of > those threads counted to set to next state and they all set to SHUTDOWN > state_) > 4. Create a container from scm CLI > {{./bin/hdfs scm -container -create -c 20170714c0}} > this fails with following exception > {noformat} > Creating container : 20170714c0. > Error executing > command:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.scm.exceptions.SCMException): > Unable to create container while in chill mode > at > org.apache.hadoop.ozone.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:241) > at > org.apache.hadoop.ozone.scm.StorageContainerManager.allocateContainer(StorageContainerManager.java:392) > at > org.apache.hadoop.ozone.protocolPB.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:73) > {noformat} > datanode was not registered to scm, thus it's still in chill mode. > *Note*, if we start scm first, there is no such issue, I can create container > from CLI without any problem. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16091004#comment-16091004 ] Weiwei Yang commented on HDFS-12098: Hi [~anu] Have you tried to reproduce this issue or apply the test case patch I uploaded to take a look at the issue ? Please let me know, thanks. > Ozone: Datanode is unable to register with scm if scm starts later > -- > > Key: HDFS-12098 > URL: https://issues.apache.org/jira/browse/HDFS-12098 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, ozone, scm >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, > HDFS-12098-HDFS-7240.002.patch, HDFS-12098-HDFS-7240.testcase-1.patch, > HDFS-12098-HDFS-7240.testcase.patch, Screen Shot 2017-07-11 at 4.58.08 > PM.png, thread_dump.log > > > Reproducing steps > 1. Start namenode > {{./bin/hdfs --daemon start namenode}} > 2. Start datanode > {{./bin/hdfs datanode}} > will see following connection issues > {noformat} > 17/07/13 21:16:48 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 0 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:49 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 1 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:50 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 2 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:51 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 3 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > {noformat} > this is expected because scm is not started yet > 3. Start scm > {{./bin/hdfs scm}} > expecting datanode can register to this scm, expecting the log in scm > {noformat} > 17/07/13 21:22:30 INFO node.SCMNodeManager: Data node with ID: > af22862d-aafa-4941-9073-53224ae43e2c Registered. > {noformat} > but did *NOT* see this log. (_I debugged into the code and found the datanode > state was transited SHUTDOWN unexpectedly because the thread leaks, each of > those threads counted to set to next state and they all set to SHUTDOWN > state_) > 4. Create a container from scm CLI > {{./bin/hdfs scm -container -create -c 20170714c0}} > this fails with following exception > {noformat} > Creating container : 20170714c0. > Error executing > command:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.scm.exceptions.SCMException): > Unable to create container while in chill mode > at > org.apache.hadoop.ozone.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:241) > at > org.apache.hadoop.ozone.scm.StorageContainerManager.allocateContainer(StorageContainerManager.java:392) > at > org.apache.hadoop.ozone.protocolPB.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:73) > {noformat} > datanode was not registered to scm, thus it's still in chill mode. > *Note*, if we start scm first, there is no such issue, I can create container > from CLI without any problem. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089723#comment-16089723 ] Hadoop QA commented on HDFS-12098: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 3s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} HDFS-7240 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 27s{color} | {color:green} HDFS-7240 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 55s{color} | {color:green} HDFS-7240 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 39s{color} | {color:green} HDFS-7240 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 2s{color} | {color:green} HDFS-7240 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 58s{color} | {color:green} HDFS-7240 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 52s{color} | {color:green} HDFS-7240 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 56s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 39s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 154 unchanged - 0 fixed = 155 total (was 154) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 50s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 71m 8s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 21s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}101m 4s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting | | | hadoop.ozone.TestMiniOzoneCluster | | | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure080 | | | hadoop.ozone.container.replication.TestContainerReplicationManager | | | hadoop.ozone.container.ozoneimpl.TestOzoneContainer | | | hadoop.ozone.TestStorageContainerManager | | | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure | | Timed out junit tests | org.apache.hadoop.ozone.container.ozoneimpl.TestRatisManager | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:14b5c93 | | JIRA Issue | HDFS-12098 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12877553/HDFS-12098-HDFS-7240.testcase-1.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 6ea999e772d2 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | HDFS-7240 / 1bec6a1 | | Default Java | 1.8.0_131 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/20304/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/20304/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/20304/testReport/ | | modules
[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089336#comment-16089336 ] Hadoop QA commented on HDFS-12098: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 19s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} HDFS-7240 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 14m 39s{color} | {color:green} HDFS-7240 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 51s{color} | {color:green} HDFS-7240 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 39s{color} | {color:green} HDFS-7240 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 59s{color} | {color:green} HDFS-7240 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 52s{color} | {color:green} HDFS-7240 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 51s{color} | {color:green} HDFS-7240 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 52s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 50s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 50s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 36s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 16 new + 154 unchanged - 0 fixed = 170 total (was 154) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 2m 0s{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 74m 23s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 21s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}102m 13s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-hdfs-project/hadoop-hdfs | | | Inconsistent synchronization of org.apache.hadoop.hdfs.server.datanode.DataNode.datanodeStateMachine; locked 42% of time Unsynchronized access at DataNode.java:42% of time Unsynchronized access at DataNode.java:[line 3228] | | Failed junit tests | hadoop.ozone.container.replication.TestContainerReplicationManager | | | hadoop.ozone.TestMiniOzoneCluster | | | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure080 | | | hadoop.ozone.TestStorageContainerManager | | Timed out junit tests | org.apache.hadoop.hdfs.TestLeaseRecovery2 | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:14b5c93 | | JIRA Issue | HDFS-12098 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12877520/HDFS-12098-HDFS-7240.testcase.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux a4fe1c2f42ae 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | HDFS-7240 / 1bec6a1 | | Default Java | 1.8.0_131 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/20299/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt | | findbugs |
[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089283#comment-16089283 ] Weiwei Yang commented on HDFS-12098: Attached a test case patch to reproduce this issue. Please take a look at [^HDFS-12098-HDFS-7240.testcase.patch]. This patch simulates the scenario # Start mini ozone cluster without starting scm # Datanode is unable to register to scm # Start scm, waiting for datanode to register # Wait a while but datanode is still unable to successfully register to scm if you apply this patch, it's gonna fail. You might have noticed the patch changes some more code than just adding a test, that is because the reason I mentioned earlier. I also have added a method to check if a datanode is registered to scm so that we can check datanode state even scm is not started. I have a patch to fix this also, if applied that patch, this test will pass. I am ready to share that as well. Thanks > Ozone: Datanode is unable to register with scm if scm starts later > -- > > Key: HDFS-12098 > URL: https://issues.apache.org/jira/browse/HDFS-12098 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, ozone, scm >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, > HDFS-12098-HDFS-7240.002.patch, HDFS-12098-HDFS-7240.testcase.patch, Screen > Shot 2017-07-11 at 4.58.08 PM.png, thread_dump.log > > > Reproducing steps > 1. Start namenode > {{./bin/hdfs --daemon start namenode}} > 2. Start datanode > {{./bin/hdfs datanode}} > will see following connection issues > {noformat} > 17/07/13 21:16:48 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 0 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:49 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 1 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:50 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 2 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:51 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 3 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > {noformat} > this is expected because scm is not started yet > 3. Start scm > {{./bin/hdfs scm}} > expecting datanode can register to this scm, expecting the log in scm > {noformat} > 17/07/13 21:22:30 INFO node.SCMNodeManager: Data node with ID: > af22862d-aafa-4941-9073-53224ae43e2c Registered. > {noformat} > but did *NOT* see this log. (_I debugged into the code and found the datanode > state was transited SHUTDOWN unexpectedly because the thread leaks, each of > those threads counted to set to next state and they all set to SHUTDOWN > state_) > 4. Create a container from scm CLI > {{./bin/hdfs scm -container -create -c 20170714c0}} > this fails with following exception > {noformat} > Creating container : 20170714c0. > Error executing > command:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.scm.exceptions.SCMException): > Unable to create container while in chill mode > at > org.apache.hadoop.ozone.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:241) > at > org.apache.hadoop.ozone.scm.StorageContainerManager.allocateContainer(StorageContainerManager.java:392) > at > org.apache.hadoop.ozone.protocolPB.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:73) > {noformat} > datanode was not registered to scm, thus it's still in chill mode. > *Note*, if we start scm first, there is no such issue, I can create container > from CLI without any problem. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088417#comment-16088417 ] Hadoop QA commented on HDFS-12098: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 13s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} HDFS-7240 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 14m 34s{color} | {color:green} HDFS-7240 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 52s{color} | {color:green} HDFS-7240 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 39s{color} | {color:green} HDFS-7240 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 56s{color} | {color:green} HDFS-7240 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 52s{color} | {color:green} HDFS-7240 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 53s{color} | {color:green} HDFS-7240 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 51s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 50s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 50s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 35s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 5 new + 154 unchanged - 0 fixed = 159 total (was 154) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 58s{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 50s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 65m 35s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 20s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 93m 10s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-hdfs-project/hadoop-hdfs | | | Inconsistent synchronization of org.apache.hadoop.hdfs.server.datanode.DataNode.datanodeStateMachine; locked 42% of time Unsynchronized access at DataNode.java:42% of time Unsynchronized access at DataNode.java:[line 3228] | | Failed junit tests | hadoop.ozone.TestMiniOzoneCluster | | | hadoop.hdfs.qjournal.client.TestQuorumJournalManager | | | hadoop.ozone.container.replication.TestContainerReplicationManager | | | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure010 | | | hadoop.ozone.TestOzoneConfigurationFields | | Timed out junit tests | org.apache.hadoop.ozone.container.ozoneimpl.TestRatisManager | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:14b5c93 | | JIRA Issue | HDFS-12098 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12877433/HDFS-12098-HDFS-7240.testcase.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 719ca50388a4 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | HDFS-7240 / 90f1d58 | | Default Java | 1.8.0_131 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/20284/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt | | findbugs |
[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088411#comment-16088411 ] Weiwei Yang commented on HDFS-12098: Please hold on looking at the test patch, it still has some problems.. working on a new one :P > Ozone: Datanode is unable to register with scm if scm starts later > -- > > Key: HDFS-12098 > URL: https://issues.apache.org/jira/browse/HDFS-12098 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, ozone, scm >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, > HDFS-12098-HDFS-7240.002.patch, HDFS-12098-HDFS-7240.testcase.patch, Screen > Shot 2017-07-11 at 4.58.08 PM.png, thread_dump.log > > > Reproducing steps > 1. Start namenode > {{./bin/hdfs --daemon start namenode}} > 2. Start datanode > {{./bin/hdfs datanode}} > will see following connection issues > {noformat} > 17/07/13 21:16:48 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 0 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:49 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 1 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:50 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 2 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:51 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 3 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > {noformat} > this is expected because scm is not started yet > 3. Start scm > {{./bin/hdfs scm}} > expecting datanode can register to this scm, expecting the log in scm > {noformat} > 17/07/13 21:22:30 INFO node.SCMNodeManager: Data node with ID: > af22862d-aafa-4941-9073-53224ae43e2c Registered. > {noformat} > but did *NOT* see this log. (_I debugged into the code and found the datanode > state was transited SHUTDOWN unexpectedly because the thread leaks, each of > those threads counted to set to next state and they all set to SHUTDOWN > state_) > 4. Create a container from scm CLI > {{./bin/hdfs scm -container -create -c 20170714c0}} > this fails with following exception > {noformat} > Creating container : 20170714c0. > Error executing > command:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.scm.exceptions.SCMException): > Unable to create container while in chill mode > at > org.apache.hadoop.ozone.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:241) > at > org.apache.hadoop.ozone.scm.StorageContainerManager.allocateContainer(StorageContainerManager.java:392) > at > org.apache.hadoop.ozone.protocolPB.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:73) > {noformat} > datanode was not registered to scm, thus it's still in chill mode. > *Note*, if we start scm first, there is no such issue, I can create container > from CLI without any problem. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088382#comment-16088382 ] Weiwei Yang commented on HDFS-12098: Hi [~anu] I just uploaded a test case patch to reproduce this problem from UT. I revised some code about how scm was started in MiniOzoneCluster, ensures that scm constructor is only called when scm is started. In this case, I could reproduce the same issue as I was seeing from a real setup. Please take a look and if you are agree with the problem I described, we then can look at the fix. Thank you. > Ozone: Datanode is unable to register with scm if scm starts later > -- > > Key: HDFS-12098 > URL: https://issues.apache.org/jira/browse/HDFS-12098 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, ozone, scm >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, > HDFS-12098-HDFS-7240.002.patch, HDFS-12098-HDFS-7240.testcase.patch, Screen > Shot 2017-07-11 at 4.58.08 PM.png, thread_dump.log > > > Reproducing steps > 1. Start namenode > {{./bin/hdfs --daemon start namenode}} > 2. Start datanode > {{./bin/hdfs datanode}} > will see following connection issues > {noformat} > 17/07/13 21:16:48 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 0 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:49 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 1 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:50 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 2 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:51 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 3 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > {noformat} > this is expected because scm is not started yet > 3. Start scm > {{./bin/hdfs scm}} > expecting datanode can register to this scm, expecting the log in scm > {noformat} > 17/07/13 21:22:30 INFO node.SCMNodeManager: Data node with ID: > af22862d-aafa-4941-9073-53224ae43e2c Registered. > {noformat} > but did *NOT* see this log. (_I debugged into the code and found the datanode > state was transited SHUTDOWN unexpectedly because the thread leaks, each of > those threads counted to set to next state and they all set to SHUTDOWN > state_) > 4. Create a container from scm CLI > {{./bin/hdfs scm -container -create -c 20170714c0}} > this fails with following exception > {noformat} > Creating container : 20170714c0. > Error executing > command:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.scm.exceptions.SCMException): > Unable to create container while in chill mode > at > org.apache.hadoop.ozone.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:241) > at > org.apache.hadoop.ozone.scm.StorageContainerManager.allocateContainer(StorageContainerManager.java:392) > at > org.apache.hadoop.ozone.protocolPB.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:73) > {noformat} > datanode was not registered to scm, thus it's still in chill mode. > *Note*, if we start scm first, there is no such issue, I can create container > from CLI without any problem. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086834#comment-16086834 ] Anu Engineer commented on HDFS-12098: - Thank you for detailed repro steps, I will look at this tomorrow. > Ozone: Datanode is unable to register with scm if scm starts later > -- > > Key: HDFS-12098 > URL: https://issues.apache.org/jira/browse/HDFS-12098 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, ozone, scm >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, > HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png, > thread_dump.log > > > Reproducing steps > 1. Start namenode > {{./bin/hdfs --daemon start namenode}} > 2. Start datanode > {{./bin/hdfs datanode}} > will see following connection issues > {noformat} > 17/07/13 21:16:48 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 0 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:49 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 1 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:50 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 2 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:51 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 3 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > {noformat} > this is expected because scm is not started yet > 3. Start scm > {{./bin/hdfs scm}} > expecting datanode can register to this scm, expecting the log in scm > {noformat} > 17/07/13 21:22:30 INFO node.SCMNodeManager: Data node with ID: > af22862d-aafa-4941-9073-53224ae43e2c Registered. > {noformat} > but did *NOT* see this log. (_I debugged into the code and found the datanode > state was transited SHUTDOWN unexpectedly because the thread leaks, each of > those threads counted to set to next state and they all set to SHUTDOWN > state_) > 4. Create a container from scm CLI > {{./bin/hdfs scm -container -create -c 20170714c0}} > this fails with following exception > {noformat} > Creating container : 20170714c0. > Error executing > command:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.scm.exceptions.SCMException): > Unable to create container while in chill mode > at > org.apache.hadoop.ozone.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:241) > at > org.apache.hadoop.ozone.scm.StorageContainerManager.allocateContainer(StorageContainerManager.java:392) > at > org.apache.hadoop.ozone.protocolPB.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:73) > {noformat} > datanode was not registered to scm, thus it's still in chill mode. > *Note*, if we start scm first, there is no such issue, I can create container > from CLI without any problem. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086828#comment-16086828 ] Weiwei Yang commented on HDFS-12098: Hi [~anu] bq. How do you start SCM, I always do bin/hdfs start scm or --daemon start scm. Do you do it differently ? No, same. I realized the reproducing steps in the description was not clear, sorry about that. I just added some more details about the issue itself and how to reproduce it, please take a look. I'll work on reproducing this from UT as well. Thank you. > Ozone: Datanode is unable to register with scm if scm starts later > -- > > Key: HDFS-12098 > URL: https://issues.apache.org/jira/browse/HDFS-12098 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, ozone, scm >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, > HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png, > thread_dump.log > > > Reproducing steps > 1. Start namenode > {{./bin/hdfs --daemon start namenode}} > 2. Start datanode > {{./bin/hdfs datanode}} > will see following connection issues > {noformat} > 17/07/13 21:16:48 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 0 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:49 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 1 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:50 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 2 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > 17/07/13 21:16:51 INFO ipc.Client: Retrying connect to server: > ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 3 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 > SECONDS) > {noformat} > this is expected because scm is not started yet > 3. Start scm > {{./bin/hdfs scm}} > expecting datanode can register to this scm, expecting the log in scm > {noformat} > 17/07/13 21:22:30 INFO node.SCMNodeManager: Data node with ID: > af22862d-aafa-4941-9073-53224ae43e2c Registered. > {noformat} > but did *NOT* see this log. > 4. Create a container from scm CLI > {{./bin/hdfs scm -container -create -c 20170714c0}} > this fails with following exception > {noformat} > Creating container : 20170714c0. > Error executing > command:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.scm.exceptions.SCMException): > Unable to create container while in chill mode > at > org.apache.hadoop.ozone.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:241) > at > org.apache.hadoop.ozone.scm.StorageContainerManager.allocateContainer(StorageContainerManager.java:392) > at > org.apache.hadoop.ozone.protocolPB.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:73) > {noformat} > datanode was not registered to scm, thus it's still in chill mode. > *Note*, if we start scm first, there is no such issue, I can create container > from CLI without any problem. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086814#comment-16086814 ] Anu Engineer commented on HDFS-12098: - >From the description of the problem. bq. Start SCM, expecting datanode could connect to the scm and the state machine could transit to RUNNING. However in actual, its state transits to SHUTDOWN, datanode enters chill mode. How do you start SCM, I always do bin/hdfs start scm or --daemon start scm. Do you do it differently ? Anyways, I will try to debug this in a cluster. > Ozone: Datanode is unable to register with scm if scm starts later > -- > > Key: HDFS-12098 > URL: https://issues.apache.org/jira/browse/HDFS-12098 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, ozone, scm >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, > HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png, > thread_dump.log > > > Reproducing steps > # Start datanode > # Wait and see datanode state, it has connection issues, this is expected > # Start SCM, expecting datanode could connect to the scm and the state > machine could transit to RUNNING. However in actual, its state transits to > SHUTDOWN, datanode enters chill mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086658#comment-16086658 ] Weiwei Yang commented on HDFS-12098: Hi [~anu] bq. Looks like the main does call, SCM constructor ... The main method you pasted is only called if scm is started, in bin/hdfs {code} scm) HADOOP_CLASSNAME='org.apache.hadoop.ozone.scm.StorageContainerManager' ... {code} if I don't start scm (like how I reproduce this issue in the description), it won't be called, and the port will not be bound. That's what I meant. Thanks > Ozone: Datanode is unable to register with scm if scm starts later > -- > > Key: HDFS-12098 > URL: https://issues.apache.org/jira/browse/HDFS-12098 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, ozone, scm >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, > HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png, > thread_dump.log > > > Reproducing steps > # Start datanode > # Wait and see datanode state, it has connection issues, this is expected > # Start SCM, expecting datanode could connect to the scm and the state > machine could transit to RUNNING. However in actual, its state transits to > SHUTDOWN, datanode enters chill mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086064#comment-16086064 ] Anu Engineer commented on HDFS-12098: - [~cheersyang] Sorry to be so dense. I am not sure I understand what this means well. bq. However, in a real cluster environment. Scm constructor will not be called, so the port will not be bound. Looks like the main does call, SCM constructor {code} /** * Main entry point for starting StorageContainerManager. * * @param argv arguments * @throws IOException if startup fails due to I/O error */ public static void main(String[] argv) throws IOException { StringUtils.startupShutdownMessage(StorageContainerManager.class, argv, LOG); try { StorageContainerManager scm = new StorageContainerManager( new OzoneConfiguration()); scm.start(); scm.join(); } catch (Throwable t) { LOG.error("Failed to start the StorageContainerManager.", t); terminate(1, t); } } {code} > Ozone: Datanode is unable to register with scm if scm starts later > -- > > Key: HDFS-12098 > URL: https://issues.apache.org/jira/browse/HDFS-12098 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, ozone, scm >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, > HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png, > thread_dump.log > > > Reproducing steps > # Start datanode > # Wait and see datanode state, it has connection issues, this is expected > # Start SCM, expecting datanode could connect to the scm and the state > machine could transit to RUNNING. However in actual, its state transits to > SHUTDOWN, datanode enters chill mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086056#comment-16086056 ] Anu Engineer commented on HDFS-12098: - [~cheersyang] Thanks for the analysis. I would love if MiniOzoneCluster is able to simulate issues in the real cluster. If we are able to reproduce issues in the real cluster using MiniOzoneCluster then it is a real win for us. Let me take a look at this, I am hoping the changes you are suggesting for SCM is not too complex to simulate this in MiniOzoneCluster. > Ozone: Datanode is unable to register with scm if scm starts later > -- > > Key: HDFS-12098 > URL: https://issues.apache.org/jira/browse/HDFS-12098 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, ozone, scm >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, > HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png, > thread_dump.log > > > Reproducing steps > # Start datanode > # Wait and see datanode state, it has connection issues, this is expected > # Start SCM, expecting datanode could connect to the scm and the state > machine could transit to RUNNING. However in actual, its state transits to > SHUTDOWN, datanode enters chill mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085463#comment-16085463 ] Weiwei Yang commented on HDFS-12098: Ah found the difference after hours of debugging ... it's not that easy to get this reproduced from mini cluster, let me explain, the behavior is different from mini cluster and a real cluster setup, *Mini Cluster* In class {{MiniOzoneCluster}}, we are initiating SCM like {code} StorageContainerManager scm = new StorageContainerManager(conf); f(!disableSCM) { // start SCM if it is not disabled. scm.start(); } {code} the constructor of scm will init scm datanode, client RPC servers. During the initiation, {{RPC.Builder(conf)...build()}} will bind the RPC server to the specific port, once the port is bound, subsequent client RPC calls e.g {code} SCMVersionResponseProto versionResponse = rpcEndPoint.getEndPoint().getVersion(null); {code} will try to connect that port and read data, however the service is not responding, thus it gets a {{SocketTimeout}}. *Real Cluster* However, in a real cluster environment. Scm constructor will not be called, so the port will not be bound. When the RPC client tries to connect to that port, it gets a {{connection refused error}}. This error is caught and triggered the RetryPolicy, that's where I saw 10 times of retry which causes this problem (thread leak). I am not sure if it is worth to fix this problem in mini cluster, that probably needs to refactor the SCM constructor to move RPC init code out. Since this issue can be simply reproduced in a cluster setup following the steps in the description. Please kindly advise. Thanks. > Ozone: Datanode is unable to register with scm if scm starts later > -- > > Key: HDFS-12098 > URL: https://issues.apache.org/jira/browse/HDFS-12098 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, ozone, scm >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, > HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png, > thread_dump.log > > > Reproducing steps > # Start datanode > # Wait and see datanode state, it has connection issues, this is expected > # Start SCM, expecting datanode could connect to the scm and the state > machine could transit to RUNNING. However in actual, its state transits to > SHUTDOWN, datanode enters chill mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083334#comment-16083334 ] Weiwei Yang commented on HDFS-12098: Hi [~anu] The difference I noticed is in the mini cluster, the RPC seems directly times out without retrying, not sure why the retry policy was not applied. On my setup I saw following retries in getVersion call, {noformat} 17/07/11 19:27:05 INFO ipc.Client: Retrying connect to server: ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 17/07/11 19:27:06 INFO ipc.Client: Retrying connect to server: ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 17/07/11 19:27:07 INFO ipc.Client: Retrying connect to server: ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) {noformat} these retries will keep the thread alive even the task execution is done. I will try to reproduce in a test case. Thank you for looking at this. > Ozone: Datanode is unable to register with scm if scm starts later > -- > > Key: HDFS-12098 > URL: https://issues.apache.org/jira/browse/HDFS-12098 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, ozone, scm >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, > HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png, > thread_dump.log > > > Reproducing steps > # Start datanode > # Wait and see datanode state, it has connection issues, this is expected > # Start SCM, expecting datanode could connect to the scm and the state > machine could transit to RUNNING. However in actual, its state transits to > SHUTDOWN, datanode enters chill mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083237#comment-16083237 ] Hadoop QA commented on HDFS-12098: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 5s{color} | {color:red} HDFS-12098 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | HDFS-12098 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12876725/disabled-scm-test.patch | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/20239/console | | Powered by | Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Ozone: Datanode is unable to register with scm if scm starts later > -- > > Key: HDFS-12098 > URL: https://issues.apache.org/jira/browse/HDFS-12098 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, ozone, scm >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, > HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png, > thread_dump.log > > > Reproducing steps > # Start datanode > # Wait and see datanode state, it has connection issues, this is expected > # Start SCM, expecting datanode could connect to the scm and the state > machine could transit to RUNNING. However in actual, its state transits to > SHUTDOWN, datanode enters chill mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083232#comment-16083232 ] Anu Engineer commented on HDFS-12098: - One big difference is that fact that I have 1000 millisecond time out for the socket calls in tests. > Ozone: Datanode is unable to register with scm if scm starts later > -- > > Key: HDFS-12098 > URL: https://issues.apache.org/jira/browse/HDFS-12098 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, ozone, scm >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Attachments: HDFS-12098-HDFS-7240.001.patch, > HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png, > thread_dump.log > > > Reproducing steps > # Start datanode > # Wait and see datanode state, it has connection issues, this is expected > # Start SCM, expecting datanode could connect to the scm and the state > machine could transit to RUNNING. However in actual, its state transits to > SHUTDOWN, datanode enters chill mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083229#comment-16083229 ] Anu Engineer commented on HDFS-12098: - @Weiwei yang, Can you please share your repro steps once again ? or look at this test patch that I have created ? I have added a disable SCM call, when tests run, I can see we do not hit the SCM. {code} java.net.SocketTimeoutException: Call From hw11767.home/192.168.29.224 to 0.0.0.0:58880 failed on socket timeout exception: java.net.SocketTimeoutException: 1000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels {code} However, I am not able to see many Datanode state machine threads. Please see the attached snapshot from my profiler. I have also attached a test case that I developed to simulate and debug this case. Thanks Anu > Ozone: Datanode is unable to register with scm if scm starts later > -- > > Key: HDFS-12098 > URL: https://issues.apache.org/jira/browse/HDFS-12098 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, ozone, scm >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Attachments: HDFS-12098-HDFS-7240.001.patch, > HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png, > thread_dump.log > > > Reproducing steps > # Start datanode > # Wait and see datanode state, it has connection issues, this is expected > # Start SCM, expecting datanode could connect to the scm and the state > machine could transit to RUNNING. However in actual, its state transits to > SHUTDOWN, datanode enters chill mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16082656#comment-16082656 ] Anu Engineer commented on HDFS-12098: - [~cheersyang] Thanks for reporting this and posting a patch. Before commenting on this I would like to simulate this in our unit tests and then test with and without your patch. I am going to modify MiniOzoneCluster and build it with flags called *disableSCM* and *disableKSM*, so we can simulate SCM or KSM being down. I will be able to explore the behavior in greater detail with that. Some thoughts on this patch, if my understanding is correct, isn't the root issue that we time out but forget to communicate to the running thread we have already timed out ? I was wondering if we add a a AtomicBoolean to each task which indicates if it has timed out, then perhaps when the thread comes out it can understand the caller has timed out and it will exist that thread ? Do you think it will address this issue ? The reason why I am asking is that, if we pursue the approach of a single thread -- then we have to create many state machines for various tasks -- like many SCMs or running some complex SCM commands. I am fine with that approach too , but something that I wanted to us to consider. > Ozone: Datanode is unable to register with scm if scm starts later > -- > > Key: HDFS-12098 > URL: https://issues.apache.org/jira/browse/HDFS-12098 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, ozone, scm >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Attachments: HDFS-12098-HDFS-7240.001.patch, > HDFS-12098-HDFS-7240.002.patch, thread_dump.log > > > Reproducing steps > # Start datanode > # Wait and see datanode state, it has connection issues, this is expected > # Start SCM, expecting datanode could connect to the scm and the state > machine could transit to RUNNING. However in actual, its state transits to > SHUTDOWN, datanode enters chill mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079862#comment-16079862 ] Hadoop QA commented on HDFS-12098: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} HDFS-7240 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 14m 40s{color} | {color:green} HDFS-7240 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 51s{color} | {color:green} HDFS-7240 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 37s{color} | {color:green} HDFS-7240 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 59s{color} | {color:green} HDFS-7240 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 51s{color} | {color:green} HDFS-7240 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 53s{color} | {color:green} HDFS-7240 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 52s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 50s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 65m 38s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 20s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 93m 14s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure080 | | Timed out junit tests | org.apache.hadoop.ozone.container.ozoneimpl.TestRatisManager | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:14b5c93 | | JIRA Issue | HDFS-12098 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12876355/HDFS-12098-HDFS-7240.002.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 38d58a4eea69 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | HDFS-7240 / 87154fc | | Default Java | 1.8.0_131 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/20206/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/20206/testReport/ | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/20206/console | | Powered by | Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Ozone: Datanode is unable to register with scm if scm starts later > -- > > Key: HDFS-12098 > URL: https://issues.apache.org/jira/browse/HDFS-12098 > Project: Hadoop HDFS > Issue Type: Sub-task >
[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16078409#comment-16078409 ] Hadoop QA commented on HDFS-12098: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 9s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} HDFS-7240 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 19s{color} | {color:green} HDFS-7240 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 51s{color} | {color:green} HDFS-7240 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s{color} | {color:green} HDFS-7240 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 58s{color} | {color:green} HDFS-7240 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 57s{color} | {color:green} HDFS-7240 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 51s{color} | {color:green} HDFS-7240 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 52s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 35s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 4 new + 1 unchanged - 0 fixed = 5 total (was 1) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 65m 17s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 22s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 96m 33s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.TestDFSStripedInputStreamWithRandomECPolicy | | | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure080 | | | hadoop.hdfs.server.namenode.TestNamenodeCapacityReport | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:14b5c93 | | JIRA Issue | HDFS-12098 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12876099/HDFS-12098-HDFS-7240.001.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux de496575ec93 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | HDFS-7240 / 5fd38a6 | | Default Java | 1.8.0_131 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/20188/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/20188/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/20188/testReport/ | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/20188/console | | Powered by | Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Ozone: Datanode is unable to register with scm if scm
[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later
[ https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16078213#comment-16078213 ] Weiwei Yang commented on HDFS-12098: This is because datanode state machine leaks {{VersionEndpointTask}} thread. In the case scm is not yet started, more and more {{VersionEndpointTask}} threads keep retrying connection with scm, {noformat} INIT - RUNNING \ GETVERSION executor.execute(new VersionEndpointTask()) - retry on getVersion ... ... (HB interval) executor.execute(new VersionEndpointTask()) - retry on getVersion ... ... (HB interval) executor.execute(new VersionEndpointTask()) - retry on getVersion ... ... {noformat} the version endpoint tasks are launched in HB interval (5s on my env), so every 5s there is a new task submitted; the retry policy for each getVersion call is 10 * 1s = 10s, so every 10s a task can be finished. So every 10s there will be ONE thread leak. When scm is up, all pending tasks will be able to connect to scm and getVersion call returns, so each of them will count the state to next, since the state is shared in {{EndpointStateMachine}}, it increments more than 1 so when I review the state changes, it looks like below {noformat} REGISTER HEARTBEAT SHUTDOWN SHUTDOWN SHUTDOWN ... {noformat} > Ozone: Datanode is unable to register with scm if scm starts later > -- > > Key: HDFS-12098 > URL: https://issues.apache.org/jira/browse/HDFS-12098 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, ozone, scm >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > > Reproducing steps > # Start datanode > # Wait and see datanode state, it has connection issues, this is expected > # Start SCM, expecting datanode could connect to the scm and the state > machine could transit to RUNNING. However in actual, its state transits to > SHUTDOWN, datanode enters chill mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org