MoritzMuehlenhoff added a comment.
In T294961#7479399 <https://phabricator.wikimedia.org/T294961#7479399>, @EBernhardson wrote: > Without having better information, i would guess we are triggering a deadlock somewhere in the kernel related to disk writes? But no particular locking is mentioned in the traces so perhaps not. As it affected all 6 instances it seems fairly reproducable, although perhaps we should try and take the application out of the picture and try to reproduce with some simple tools to generate disk writes? This might be fairly tedious, the error took more than a day to trigger and the only idea I have for reproducing externally is to generate random write loads of similar size. > > @MoritzMuehlenhoff as someone who deals with the kernel often, any suggestions for where to investigate? It looks to be deadlocking somewhere deep in the I/O layer, there are some sysctls and kernel settings that we could fine-tune, but given that this only happens after a full day run, that'll be a slow going process. I see two next steps that we should try: 1. Looking at Netbox these were purchased in March, but that doesn't necessarily mean that the system firmware is up-to-date. Often these are delivered with the firmware version once the specific server model was originally shipped. We could ask DC ops to upgrade one of the servers to the latest versions and re-test. 2. These are Buster systems, but we can try the 5.10 kernel available from Debian backports (installable with "apt-get install linux-image-5.10.0-0.bpo.9"). We have a handful of services which also the kernel on buster already (e.g. the Hadoop and stat* hosts with the AMD GPUs). TASK DETAIL https://phabricator.wikimedia.org/T294961 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: MoritzMuehlenhoff Cc: MoritzMuehlenhoff, RKemper, Aklapper, Gehel, CBogen, ttaylor, Zache, Fuzheado, So9q, GFontenelle_WMF, EBernhardson, joanna_borun, Invadibot, MPhamWMF, Devnull, maantietaja, lmata, Muchiri124, Akuckartz, RhinosF1, Legado_Shulgin, ReaperDawn, Nandana, Namenlos314, Davinaclare77, Techguru.pc, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Hfbn0, QZanden, EBjune, merbst, LawExplorer, Zppix, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Addshore, Mbch331, Jay8g, fgiunchedi
_______________________________________________ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org