On Wednesday, September 20, 2017 at 3:55:43 AM UTC-7, Lukas Zapletal wrote: > > A MAC address can only exist once, if you already have a > (managed/unmanaged) host and you try to discover a host with same MAC, > you will get error. Depending on Foreman discovery it is either 422 or > "Host already exists": > > https://github.com/theforeman/foreman_discovery/commit/210f143bc85c58caeb67e8bf9a5cc2edbe764683 > >
Hmm, one generic question on this - according to above logic, if my managed host had crashed, say because it lost its HW RAID controller, for example, so it can't boot off the disk anymore thus resulting in PXE boot (given that BIOS boot order is set that way), correct? Now, by default, Foreman default pxeconfig file makes a system to boot off its disk, which in this particular situation will result in endless loop until some external (to Foreman) monitoring detects a system failure, then a human gets on a console and real troubleshooting starts only then. That does not scale beyond a 100 systems or so. For this reason in our current setup where we *don't* use Foreman for OS provisioning but only for system discovery, I've updated the default pxeconfig to always load a discovery OS. This covers both a new systems and a crashed system scenario I described above. Each of discovered hosts is reported to a higher layer of orchestration on a after_commit event and that orchestration handles OS provisioning on its own so the discovered system never ends up in managed hosts in Foreman. Once OS provisioning is done, higher layer comes and deletes a host it just provisioned from discovered hosts. If orchestration detects that a hook call from Foreman reports a system that was previously provisioned, such system is automatically marked "maintenance" and HW diagnostics auto-started. Based on the result of that, orchestration will start either a HW replacement flow or a new problem troubleshooting starts. As you can see, humans are only involved very late in a process and only if auto-remediation is not possible (HW component failed, unknown signature detected). Otherwise, at large scale environments it is just impossible to attend to each of failed system individually. Such automation flow is allowing us to save hundreds of man-hours, as you can imagine. Now, with that in mind, I was thinking of moving actual OS provisioning tasks to Foreman as well. However, if crashed system would never be allowed to re-register (get discovered) because it is already managed by Foreman, the above flow is just not going to work anymore and I'd have re-think all flows. Are there specific reasons why this in place? I understand that this is how it is implemented now, but is there a bigger idea behind that? If so, what is it? Also, if you take my example of flows stitching for a complete system lifecycle management, what would you suggest we could do differently to allow Foreman to be a system that we use for both discovery and OS provisioning? Another thing (not as generic as above, but actually very applicable to my current issue) - if a client system is not allowed to register and given 422 error, for example, it keeps trying to register resulting in huge amount of work. This is also a gap, IMHO - discovery plug-in needs to do this differently somehow so rejected systems do not take away Foreman resources (see below for actual numbers of such attempts in one of my cluster). > Anyway you wrote you have deadlocks, but in the log snippet I do see > that you have host discovery at rate 1-2 imports per minute. This > cannot block anything, this is quite slow rate. I don't understand, > can you pastebin log snippet from the peak time when you have these > deadlocks? > After more digging I've done after this issue was reported to me, it does not look to me as load-related. Even with low number of registrations, I see a high rate of deadlocks. I took another Foreman cluster (3 active nodes as well) and see the following activity as it pertains to system discovery (since 3:30am this morning): [root@spc01 ~]# grep "/api/v2/discovered_hosts/facts" /var/log/foreman/production.log | wc -l 282 [root@spc02 ~]# grep "/api/v2/discovered_hosts/facts" /var/log/foreman/production.log | wc -l 2278 [root@spc03 ~]# grep "/api/v2/discovered_hosts/facts" /var/log/foreman/production.log | wc -l 143 These are the numbers of attempts rejected (all of them are 422s): [root@spc01 ~]# grep Entity /var/log/foreman/production.log | wc -l 110 [root@spc02 ~]# grep Entity /var/log/foreman/production.log | wc -l 2182 [root@spc03 ~]# grep Entity /var/log/foreman/production.log | wc -l 57 A number of deadlocks: [root@spc01 ~]# grep -i deadlock /var/log/foreman/production.log | wc -l 59 [root@spc02 ~]# grep -i deadlock /var/log/foreman/production.log | wc -l 31 [root@spc03 ~]# grep -i deadlock /var/log/foreman/production.log | wc -l 30 Actual deadlock messages are here - https://gist.github.com/anonymous/a20f4097396037cd30903d232a3e6d0f As you can see, most of them are locked on attempts to update facts. That large number of attempts to register and rejects on spc02 node is mostly contributed by a single host: [root@spc02 ~]# grep "Multiple discovered hosts found with MAC address" /var/log/foreman/production.log | wc -l 1263 [root@spc02 ~]# grep "Multiple discovered hosts found with MAC address" /var/log/foreman/production.log | head -1 2017-09-20 04:39:15 de3ee3bf [app] [W] Multiple discovered hosts found with MAC address 00:8c:fa:f1:ab:e4, choosing one After I removed both incomplete "mac008cfaf1abe4" <https://spc.vip.phx.ebay.com/discovered_hosts/mac008cfaf1abe4> records, that system has finally was able to properly register. Here's also a full debug I took yesterday - it is a single host trying to register. Unfortunately, this one does not have any deadlocks - https://gist.github.com/anonymous/47fe4baa60fc5285b70faf37e6f797af Do you want me to try to get one of those deadlock? All this business makes me think that root cause of this behavior may be outside of Foreman - 2 obvious things spring to mind: (a) local-balanced active/active configuration of my Foreman nodes - even though I do have source_address binding enabled for connections to 443 on Foreman vServer on a LB, maybe there's more to it. This is rather easy to verify - I'm going to shut off other 2 instances and see if I get any deadlock again. (b) Second possibility is Galera-based MySQL. This one is harder to check, but if the first option does not help me, I'll have to convert a DB back to single node and see. If this turns out to do an issue, it is very bad as that would mean no proper HA for a Foreman DB, so I'm hoping this is not the case. While I'm working on, please let me know if I can provide any more info or if you have any other suggestions, etc. Thanks! -- You received this message because you are subscribed to the Google Groups "Foreman users" group. To unsubscribe from this group and stop receiving emails from it, send an email to foreman-users+unsubscr...@googlegroups.com. To post to this group, send email to foreman-users@googlegroups.com. Visit this group at https://groups.google.com/group/foreman-users. For more options, visit https://groups.google.com/d/optout.