[
https://issues.apache.org/jira/browse/VCL-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14315233#comment-14315233
]
ASF subversion and git services commented on VCL-815:
-----------------------------------------------------
Commit 1658846 from [~arkurth] in branch 'vcl/trunk'
[ https://svn.apache.org/r1658846 ]
VCL-815
Reworked OS.pm::update_cluster. Each reservation now only creates a
cluster_info file for itself. The firewall processes was also improved to be
much more efficient.
Fixed problem with Windows code which creates the cluster_info file. The line
endings were not Windows-style because the sed command in
OS.pm::set_text_file_line_endings was failing. Sed under Cygwin can't handle a
Windows-style file path if the "-i" switch is used. Added
get_cygwin_unix_file_path subroutine. This gets called and the result gets
passed to set_text_file_line_endings.
Other
Cleaned up and renamed Windows.pm::get_cygwin_path -->
get_cygwin_installation_directory_path. It was messy. There were notify
messages obviously copied/pasted from another subroutine which made no sense.
Its name had also become vague with the addition of get_cygwin_unix_file_path,
both of which call cygpath.exe.
> Problems with update_cluster in OS.pm
> -------------------------------------
>
> Key: VCL-815
> URL: https://issues.apache.org/jira/browse/VCL-815
> Project: VCL
> Issue Type: Bug
> Components: vcld (backend)
> Reporter: Andy Kurth
> Assignee: Andy Kurth
> Fix For: 2.4
>
>
> *Problem 1:*
> Due to OS.pm::update_cluster, the processing of the reserved state for each
> computer is taking an extremely long time. A fairly large cluster of 80
> computers is taking over 20 minutes to complete, not including the time it
> spends waiting for the user to click Connect and to actually connect to the
> computer. This is longer than it took to actually load the bare-metal
> computers via xCAT.
> *Problem 2:*
> Every reservation that belongs to a cluster request first creates its own
> cluster_info file in /tmp. The file that each creates is identical. Every
> reservation then proceeds to copy the file it created to every other computer
> in the cluster request via SCP. So, for an 80 node cluster, cluster_info
> files are needlessly copied back and forth 6,400 times.
> *Problem 3:*
> Each node does not copy the cluster_info file it created in /tmp to the
> proper location on itself. Instead, it sends it to itself via SCP. There's
> really no need for this because the file will end up in the proper location
> when the other 79 nodes SCP it over.
> *Problem 4:*
> The subroutine is using the following to determine where to send the file to
> via SCP:
> {noformat}
> my $node_name = $request_data->{reservation}{$resid}{computer}{SHORTNAME};
> {noformat}
> If you're not familiar with the backend code, the remote computer's short
> hostname is being used. For example, if the FQDN is vm9.example.com, the
> code is trying to send the file to "vm9". This will only work if "vm9"
> resolves to an address the management node can get to. This will not work
> for distributed management nodes residing in different locations.
> Furthermore, with the move away from relying on /etc/hosts, whenever a
> hostname is passed to run_scp_command, the computer's private IP address will
> be used. Examine the determine_remote_connection_target for more
> information. If absolutely necessary to SCP files back and forth, the public
> IP address must be used.
> *Problem 5:*
> The update_cluster subroutine is attempting to open the firewall on each
> computer for incoming traffic from the other computers in the cluster. This
> is good, but the way it is implemented is very slow an inefficient. It is
> also not reliable due to the way the enable_firewall_port subroutines are
> handling things. Errors are being generated because the command being
> generated is too large:
> {noformat}
> |10435|2626116|2726142|reserved|Linux.pm:enable_firewall_port|3828| ----
> WARNING ----
> |10435|2626116|2726142|reserved|Linux.pm:enable_firewall_port|3828|
> 2015-02-04
> 16:28:26|10435|2626116|2726142|reserved|Linux.pm:enable_firewall_port|3828|failed
> to enable firewall port on blade1f5-5, protocol: tcp, port: any, scope:
> <omitted>, exit status: 1, command:
> |10435|2626116|2726142|reserved|Linux.pm:enable_firewall_port|3828| iptables
> -D INPUT 9 && iptables -D INPUT 8 && iptables -D INPUT 7 && iptables -D INPUT
> 6 && iptables -D INPUT 5 && iptables -D INPUT 4 && iptables -D INPUT 3 &&
> iptables -D INPUT 2 && iptables -D INPUT 10 && iptables -D INPUT 1 &&
> /sbin/iptables -v -I INPUT 1 -p tcp -j ACCEPT -s 10.25.0.241 -m state --state
> NEW,RELATED,ESTABLISHED &&
> ...much more omitted...
> |10435|2626116|2726142|reserved|Linux.pm:enable_firewall_port|3828| output:
> |10435|2626116|2726142|reserved|Linux.pm:enable_firewall_port|3828| iptables:
> Index of deletion too big.
> |10435|2626116|2726142|reserved|Linux.pm:enable_firewall_port|3828| ( 0)
> Linux.pm, enable_firewall_port (line: 3828)
> |10435|2626116|2726142|reserved|Linux.pm:enable_firewall_port|3828| (-1)
> OS.pm, update_cluster (line: 3909)
> |10435|2626116|2726142|reserved|Linux.pm:enable_firewall_port|3828| (-2)
> reserved.pm, process (line: 157)
> |10435|2626116|2726142|reserved|Linux.pm:enable_firewall_port|3828| (-3)
> vcld, make_new_child (line: 587)
> |10435|2626116|2726142|reserved|Linux.pm:enable_firewall_port|3828| (-4)
> vcld, main (line: 348)
> {noformat}
> *Problem 6:*
> The subroutine is calling $self->data->get_request_data() and then accessing
> the hash data directly. Example:
> {noformat}
> $request_data->{reservation}{$rid}{computer}{IPaddress}
> {noformat}
> The past few years have been spent trying to get rid of code which directly
> accesses the hash constructed by get_request_info. There are functions built
> into DataStructure.pm which may be accessed via $self->data to retrieve this
> information, including information about other reservations in the cluster.
> The get_request_data function should only be used when absolutely necessary
> -- such as from DataStructure.pm or if the entire hash is needed to create
> another module object.
> *Problem 7:*
> The reserved.pm::process subroutine calculates the exact time it should
> timeout a reservation due to no initial connection by taking the time the
> user clicked connect and adding the initialconnecttimeout variable value to
> it. The update_cluster subroutine gets run in the meantime. As it is
> currently written, it can take longer than the entire initialconnecttimeout
> value. When control is passed back to reserved.pm::process, the code
> realizes the initial connection expire time is in the past so connection
> checking isn't being done at all.
> *Problem 8:*
> The code includes the following:
> {noformat}
> if ($image_OS_type =~ /linux/i) {
> ...
> }
> elsif ($image_OS_type =~ /windows/i) {
> ...
> }
> {noformat}
> The backend is architected so that these if/else statements should be avoided
> whenever possible. Sure, an if/else such as this is quick and easy but the
> more elegant way to handle it would be to add a commonly named subroutine to
> the Linux and Windows modules that handles the specifics of this statement.
> Again, work has been done over the past few years to rework code such as
> this. I see no reason to keep adding it back.
> Problem 9:
> "$image_OS_type" - There are over 9,000 variables defined in the backend
> code. Variables intended to act as constants for an entire module are all
> uppercase. Nearly all other variables are entirely lowercase, except for a
> few in some very old code. I see no reason to mix case.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)