[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated HDFS-12098:
-------------------------------
    Description: 
Reproducing steps
1. Start namenode

{{./bin/hdfs --daemon start namenode}}

2. Start datanode

{{./bin/hdfs datanode}}

will see following connection issues

{noformat}
17/07/13 21:16:48 INFO ipc.Client: Retrying connect to server: 
ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 0 time(s); retry policy 
is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
17/07/13 21:16:49 INFO ipc.Client: Retrying connect to server: 
ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 1 time(s); retry policy 
is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
17/07/13 21:16:50 INFO ipc.Client: Retrying connect to server: 
ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 2 time(s); retry policy 
is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
17/07/13 21:16:51 INFO ipc.Client: Retrying connect to server: 
ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 3 time(s); retry policy 
is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
{noformat}

this is expected because scm is not started yet

3. Start scm

{{./bin/hdfs scm}}

expecting datanode can register to this scm, expecting the log in scm

{noformat}
17/07/13 21:22:30 INFO node.SCMNodeManager: Data node with ID: 
af22862d-aafa-4941-9073-53224ae43e2c Registered.
{noformat}

but did *NOT* see this log. (_I debugged into the code and found the datanode 
state was transited SHUTDOWN unexpectedly because the thread leaks, each of 
those threads counted to set to next state and they all set to SHUTDOWN state_)

4. Create a container from scm CLI

{{./bin/hdfs scm -container -create -c 20170714c0}}

this fails with following exception

{noformat}
Creating container : 20170714c0.
Error executing 
command:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.scm.exceptions.SCMException):
 Unable to create container while in chill mode
        at 
org.apache.hadoop.ozone.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:241)
        at 
org.apache.hadoop.ozone.scm.StorageContainerManager.allocateContainer(StorageContainerManager.java:392)
        at 
org.apache.hadoop.ozone.protocolPB.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:73)
{noformat}

datanode was not registered to scm, thus it's still in chill mode.

*Note*, if we start scm first, there is no such issue, I can create container 
from CLI without any problem.



  was:
Reproducing steps
1. Start namenode

{{./bin/hdfs --daemon start namenode}}

2. Start datanode

{{./bin/hdfs datanode}}

will see following connection issues

{noformat}
17/07/13 21:16:48 INFO ipc.Client: Retrying connect to server: 
ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 0 time(s); retry policy 
is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
17/07/13 21:16:49 INFO ipc.Client: Retrying connect to server: 
ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 1 time(s); retry policy 
is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
17/07/13 21:16:50 INFO ipc.Client: Retrying connect to server: 
ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 2 time(s); retry policy 
is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
17/07/13 21:16:51 INFO ipc.Client: Retrying connect to server: 
ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 3 time(s); retry policy 
is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
{noformat}

this is expected because scm is not started yet

3. Start scm

{{./bin/hdfs scm}}

expecting datanode can register to this scm, expecting the log in scm

{noformat}
17/07/13 21:22:30 INFO node.SCMNodeManager: Data node with ID: 
af22862d-aafa-4941-9073-53224ae43e2c Registered.
{noformat}

but did *NOT* see this log. (I debugged into the code and found the datanode 
state was transited SHUTDOWN unexpectedly because the thread leaks, each of 
those threads counted to set to next state and they all set to SHUTDOWN state)

4. Create a container from scm CLI

{{./bin/hdfs scm -container -create -c 20170714c0}}

this fails with following exception

{noformat}
Creating container : 20170714c0.
Error executing 
command:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.scm.exceptions.SCMException):
 Unable to create container while in chill mode
        at 
org.apache.hadoop.ozone.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:241)
        at 
org.apache.hadoop.ozone.scm.StorageContainerManager.allocateContainer(StorageContainerManager.java:392)
        at 
org.apache.hadoop.ozone.protocolPB.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:73)
{noformat}

datanode was not registered to scm, thus it's still in chill mode.

*Note*, if we start scm first, there is no such issue, I can create container 
from CLI without any problem.




> Ozone: Datanode is unable to register with scm if scm starts later
> ------------------------------------------------------------------
>
>                 Key: HDFS-12098
>                 URL: https://issues.apache.org/jira/browse/HDFS-12098
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, ozone, scm
>            Reporter: Weiwei Yang
>            Assignee: Weiwei Yang
>            Priority: Critical
>         Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, 
> HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png, 
> thread_dump.log
>
>
> Reproducing steps
> 1. Start namenode
> {{./bin/hdfs --daemon start namenode}}
> 2. Start datanode
> {{./bin/hdfs datanode}}
> will see following connection issues
> {noformat}
> 17/07/13 21:16:48 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 0 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:49 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 1 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:50 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 2 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:51 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 3 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> {noformat}
> this is expected because scm is not started yet
> 3. Start scm
> {{./bin/hdfs scm}}
> expecting datanode can register to this scm, expecting the log in scm
> {noformat}
> 17/07/13 21:22:30 INFO node.SCMNodeManager: Data node with ID: 
> af22862d-aafa-4941-9073-53224ae43e2c Registered.
> {noformat}
> but did *NOT* see this log. (_I debugged into the code and found the datanode 
> state was transited SHUTDOWN unexpectedly because the thread leaks, each of 
> those threads counted to set to next state and they all set to SHUTDOWN 
> state_)
> 4. Create a container from scm CLI
> {{./bin/hdfs scm -container -create -c 20170714c0}}
> this fails with following exception
> {noformat}
> Creating container : 20170714c0.
> Error executing 
> command:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.scm.exceptions.SCMException):
>  Unable to create container while in chill mode
>       at 
> org.apache.hadoop.ozone.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:241)
>       at 
> org.apache.hadoop.ozone.scm.StorageContainerManager.allocateContainer(StorageContainerManager.java:392)
>       at 
> org.apache.hadoop.ozone.protocolPB.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:73)
> {noformat}
> datanode was not registered to scm, thus it's still in chill mode.
> *Note*, if we start scm first, there is no such issue, I can create container 
> from CLI without any problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to