On Wednesday, September 20, 2017 at 3:55:43 AM UTC-7, Lukas Zapletal wrote:
>
> A MAC address can only exist once, if you already have a 
> (managed/unmanaged) host and you try to discover a host with same MAC, 
> you will get error. Depending on Foreman discovery it is either 422 or 
> "Host already exists": 
>
> https://github.com/theforeman/foreman_discovery/commit/210f143bc85c58caeb67e8bf9a5cc2edbe764683
>  
>

Hmm, one generic question on this - according to above logic, if my managed 
host had crashed, say because it lost its HW RAID controller, for example, 
so it can't boot off the disk anymore thus resulting in PXE boot (given 
that BIOS boot order is set that way), correct?
Now, by default, Foreman default pxeconfig file makes a system to boot off 
its disk, which in this particular situation will result in endless loop 
until some external (to Foreman) monitoring detects a system failure, then 
a human gets on a console and real troubleshooting starts only then.
That does not scale beyond a 100 systems or so. For this reason in our 
current setup where we *don't* use Foreman for OS provisioning but only for 
system discovery, I've updated the default pxeconfig to always load a 
discovery OS. This covers both a new systems and a crashed system scenario 
I described above. Each of discovered hosts is reported to a higher layer 
of orchestration on a after_commit event and that orchestration handles OS 
provisioning on its own so the discovered system never ends up in managed 
hosts in Foreman. Once OS provisioning is done, higher layer comes and 
deletes a host it just provisioned from discovered hosts. If orchestration 
detects that a hook call from Foreman reports a system that was previously 
provisioned, such system is automatically marked "maintenance" and HW 
diagnostics auto-started. Based on the result of that, orchestration will 
start either a HW replacement flow or a new problem troubleshooting starts. 
As you can see, humans are only involved very late in a process and only if 
auto-remediation is not possible (HW component failed, unknown signature 
detected). Otherwise, at large scale environments it is just impossible to 
attend to each of failed system individually. Such automation flow is 
allowing us to save hundreds of man-hours, as you can imagine.
Now, with that in mind, I was thinking of moving actual OS provisioning 
tasks to Foreman as well. However, if crashed system would never be allowed 
to re-register (get discovered) because it is already managed by Foreman, 
the above flow is just not going to work anymore and I'd have re-think all 
flows. Are there specific reasons why this in place? I understand that this 
is how it is implemented now, but is there a bigger idea behind that? If 
so, what is it? Also, if you take my example of flows stitching for a 
complete system lifecycle management, what would you suggest we could do 
differently to allow Foreman to be a system that we use for both discovery 
and OS provisioning?

Another thing (not as generic as above, but actually very applicable to my 
current issue) - if a client system is not allowed to register and given 
422 error, for example, it keeps trying to register resulting in huge 
amount of work. This is also a gap, IMHO - discovery plug-in needs to do 
this differently somehow so rejected systems do not take away Foreman 
resources (see below for actual numbers of such attempts in one of my 
cluster).
 

> Anyway you wrote you have deadlocks, but in the log snippet I do see 
> that you have host discovery at rate 1-2 imports per minute. This 
> cannot block anything, this is quite slow rate. I don't understand, 
> can you pastebin log snippet from the peak time when you have these 
> deadlocks? 
>

After more digging I've done after this issue was reported to me, it does 
not look to me as load-related. Even with low number of registrations, I 
see a high rate of deadlocks. I took another Foreman cluster (3 active 
nodes as well) and see the following activity as it pertains to system 
discovery (since 3:30am this morning):

[root@spc01 ~]# grep "/api/v2/discovered_hosts/facts" 
/var/log/foreman/production.log | wc -l
282

[root@spc02 ~]# grep "/api/v2/discovered_hosts/facts" 
/var/log/foreman/production.log | wc -l
2278

[root@spc03 ~]# grep "/api/v2/discovered_hosts/facts" 
/var/log/foreman/production.log | wc -l
143

These are the numbers of attempts rejected (all of them are 422s):

[root@spc01 ~]# grep Entity /var/log/foreman/production.log | wc -l
110

[root@spc02 ~]# grep Entity /var/log/foreman/production.log | wc -l
2182

[root@spc03 ~]# grep Entity /var/log/foreman/production.log | wc -l
57

A number of deadlocks:

[root@spc01 ~]# grep -i deadlock /var/log/foreman/production.log | wc -l
59

[root@spc02 ~]# grep -i deadlock /var/log/foreman/production.log | wc -l
31

[root@spc03 ~]# grep -i deadlock /var/log/foreman/production.log | wc -l
30

Actual deadlock messages are here - 
https://gist.github.com/anonymous/a20f4097396037cd30903d232a3e6d0f
As you can see, most of them are locked on attempts to update facts.

That large number of attempts to register and rejects on spc02 node is 
mostly contributed by a single host:

[root@spc02 ~]# grep "Multiple discovered hosts found with MAC address" 
/var/log/foreman/production.log | wc -l
1263

[root@spc02 ~]# grep "Multiple discovered hosts found with MAC address" 
/var/log/foreman/production.log | head -1
2017-09-20 04:39:15 de3ee3bf [app] [W] Multiple discovered hosts found with 
MAC address 00:8c:fa:f1:ab:e4, choosing one

After I removed both incomplete "mac008cfaf1abe4" 
<https://spc.vip.phx.ebay.com/discovered_hosts/mac008cfaf1abe4> records, 
that system has finally was able to properly register.

Here's also a full debug I took yesterday - it is a single host trying to 
register. Unfortunately, this one does not have any deadlocks - 
https://gist.github.com/anonymous/47fe4baa60fc5285b70faf37e6f797af
Do you want me to try to get one of those deadlock?

All this business makes me think that root cause of this behavior may be 
outside of Foreman - 2 obvious things spring to mind:
(a) local-balanced active/active configuration of my Foreman nodes - even 
though I do have source_address binding enabled for connections to 443 on 
Foreman vServer on a LB, maybe there's more to it. This is rather easy to 
verify - I'm going to shut off other 2 instances and see if I get any 
deadlock again.
(b) Second possibility is Galera-based MySQL. This one is harder to check, 
but if the first option does not help me, I'll have to convert a DB back to 
single node and see. If this turns out to do an issue, it is very bad as 
that would mean no proper HA for a Foreman DB, so I'm hoping this is not 
the case.

While I'm working on, please let me know if I can provide any more info or 
if you have any other suggestions, etc.
Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"Foreman users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to foreman-users+unsubscr...@googlegroups.com.
To post to this group, send email to foreman-users@googlegroups.com.
Visit this group at https://groups.google.com/group/foreman-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to