[ 
https://issues.apache.org/jira/browse/AMBARI-2713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Lysnichenko updated AMBARI-2713:
---------------------------------------

    Description: 
I've deployed a 3-node cluster and tried to bootstrap 4 nodes. 2 nodes were 
'bad' for different reasons
{code}
host1.internal
host2.internal
host3.internal - python executables are left without x-bit
host4.internal - non existent node, does not ping
{code}
Also I've configured an intensive logging with timestamps. During bootstrap, 
all 4 nodes stuck for ~5 minutes in Installing state. After that, 2 nodes 
became failed and other switched to Registering state.

The root problem is that scp operation to not-existent node takes 1 minute 
before connection timeout. Also.
- all parallel scp operations are performed in up to 20 threads at once. If 
there are more hosts in a list, list is splitted to chunks. Next chunk launches 
when the previous ends. The same thing for ssp.
- next operations are performed only when all previous parallel ssh/scp 
operation completes.
- done files for host are completed at last step of bootstrap, for all hosts at 
once. 

That's why, if we have overall 174 hosts and 26 of them are 
off/inaccessible/not configured for pubkey auth:
- 174 hosts are splitted to 9 chunks of 20 hosts at initial scp operation. In 
every chunk there will be ~3 dead hosts. So at every chunk, we have to wait for 
~1 minute before dead hosts time out, 9 minutes overall.
- 148 hosts that completed scp will continue bootstrap and finish in few 
minutes.
- when all 148 hosts finish bootstrap, done files are created for all 174 hosts 
. 
- server reads exits status for 174 hosts and consider bootstrap completed. 
That is reflected at API.

The described behaviour is not a bug but rather the way bootstrap.py currently 
works. 
Possible solutions:
- completely redesign bootstrap.py

  was:
I've deployed a 3-node cluster and tried to bootstrap 4 nodes. 2 nodes were 
'bad' for different reasons
{code}
ip-10-147-179-78.ec2.internal
ip-10-34-15-110.ec2.internal
ip-10-12-55-197.ec2.internal - python executables are left without x-bit
ip-10-12-55-187.ec2.internal - non existent node, does not ping
{code}
Also I've configured an intensive logging with timestamps. During bootstrap, 
all 4 nodes stuck for ~5 minutes in Installing state. After that, 2 nodes 
became failed and other switched to Registering state.

The root problem is that scp operation to not-existent node takes 1 minute 
before connection timeout. Also.
- all parallel scp operations are performed in up to 20 threads at once. If 
there are more hosts in a list, list is splitted to chunks. Next chunk launches 
when the previous ends. The same thing for ssp.
- next operations are performed only when all previous parallel ssh/scp 
operation completes.
- done files for host are completed at last step of bootstrap, for all hosts at 
once. 

That's why, if we have overall 174 hosts and 26 of them are 
off/inaccessible/not configured for pubkey auth:
- 174 hosts are splitted to 9 chunks of 20 hosts at initial scp operation. In 
every chunk there will be ~3 dead hosts. So at every chunk, we have to wait for 
~1 minute before dead hosts time out, 9 minutes overall.
- 148 hosts that completed scp will continue bootstrap and finish in few 
minutes.
- when all 148 hosts finish bootstrap, done files are created for all 174 hosts 
. 
- server reads exits status for 174 hosts and consider bootstrap completed. 
That is reflected at API.

The described behaviour is not a bug but rather the way bootstrap.py currently 
works. 
Possible solutions:
- completely redesign bootstrap.py

    
> Perf: Installer host registration stuck in 'Installing' for 10mins before 
> succeeding
> ------------------------------------------------------------------------------------
>
>                 Key: AMBARI-2713
>                 URL: https://issues.apache.org/jira/browse/AMBARI-2713
>             Project: Ambari
>          Issue Type: Bug
>          Components: controller
>    Affects Versions: 1.2.5
>            Reporter: Dmitry Lysnichenko
>            Assignee: Dmitry Lysnichenko
>             Fix For: 1.4.0
>
>         Attachments: AMBARI-2713.patch
>
>
> I've deployed a 3-node cluster and tried to bootstrap 4 nodes. 2 nodes were 
> 'bad' for different reasons
> {code}
> host1.internal
> host2.internal
> host3.internal - python executables are left without x-bit
> host4.internal - non existent node, does not ping
> {code}
> Also I've configured an intensive logging with timestamps. During bootstrap, 
> all 4 nodes stuck for ~5 minutes in Installing state. After that, 2 nodes 
> became failed and other switched to Registering state.
> The root problem is that scp operation to not-existent node takes 1 minute 
> before connection timeout. Also.
> - all parallel scp operations are performed in up to 20 threads at once. If 
> there are more hosts in a list, list is splitted to chunks. Next chunk 
> launches when the previous ends. The same thing for ssp.
> - next operations are performed only when all previous parallel ssh/scp 
> operation completes.
> - done files for host are completed at last step of bootstrap, for all hosts 
> at once. 
> That's why, if we have overall 174 hosts and 26 of them are 
> off/inaccessible/not configured for pubkey auth:
> - 174 hosts are splitted to 9 chunks of 20 hosts at initial scp operation. In 
> every chunk there will be ~3 dead hosts. So at every chunk, we have to wait 
> for ~1 minute before dead hosts time out, 9 minutes overall.
> - 148 hosts that completed scp will continue bootstrap and finish in few 
> minutes.
> - when all 148 hosts finish bootstrap, done files are created for all 174 
> hosts . 
> - server reads exits status for 174 hosts and consider bootstrap completed. 
> That is reflected at API.
> The described behaviour is not a bug but rather the way bootstrap.py 
> currently works. 
> Possible solutions:
> - completely redesign bootstrap.py

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to