MoritzMuehlenhoff added a comment.

  In T294961#7479399 <https://phabricator.wikimedia.org/T294961#7479399>, 
@EBernhardson wrote:
  
  > Without having better information, i would guess we are triggering a 
deadlock somewhere in the kernel related to disk writes? But no particular 
locking is mentioned in the traces so perhaps not. As it affected all 6 
instances it seems fairly reproducable, although perhaps we should try and take 
the application out of the picture and try to reproduce with some simple tools 
to generate disk writes? This might be fairly tedious, the error took more than 
a day to trigger and the only idea I have for reproducing externally is to 
generate random write loads of similar size.
  >
  > @MoritzMuehlenhoff as someone who deals with the kernel often, any 
suggestions for where to investigate?
  
  It looks to be deadlocking somewhere deep in the I/O layer, there are some 
sysctls and kernel settings that we could fine-tune, but given that this only 
happens after a full day run, that'll be a slow going process.
  
  I see two next steps that we should try:
  
  1. Looking at Netbox these were purchased in March, but that doesn't 
necessarily mean that the system firmware is up-to-date. Often these are 
delivered with the firmware version once the specific server model was 
originally shipped. We could ask DC ops to upgrade one of the servers to the 
latest versions and re-test.
  2. These are Buster systems, but we can try the 5.10 kernel available from 
Debian backports (installable with "apt-get install 
linux-image-5.10.0-0.bpo.9"). We have a handful of services which also the 
kernel on buster already (e.g. the Hadoop and stat* hosts with the AMD GPUs).

TASK DETAIL
  https://phabricator.wikimedia.org/T294961

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: MoritzMuehlenhoff
Cc: MoritzMuehlenhoff, RKemper, Aklapper, Gehel, CBogen, ttaylor, Zache, 
Fuzheado, So9q, GFontenelle_WMF, EBernhardson, joanna_borun, Invadibot, 
MPhamWMF, Devnull, maantietaja, lmata, Muchiri124, Akuckartz, RhinosF1, 
Legado_Shulgin, ReaperDawn, Nandana, Namenlos314, Davinaclare77, Techguru.pc, 
Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Hfbn0, QZanden, EBjune, 
merbst, LawExplorer, Zppix, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, 
Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, 
faidon, Addshore, Mbch331, Jay8g, fgiunchedi
_______________________________________________
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org

Reply via email to