quick update: since kicking httpd on the jenkins master "fixes" the GUI hanging, i set up a cron job to restart httpd 4 times per day.
this is not the final solution, but will definitely help over the weekend as i'm heading out of town. shane On Fri, Mar 22, 2019 at 9:50 AM shane knapp <skn...@berkeley.edu> wrote: > i was right to not hold my breath... while my apache changes seem to have > helped a bit, things are still slowing down after 10-12 hours. > > i have a few other things i can look at, and will get as much done as > possible before the weekend. serious troubleshooting will begin anew > monday. > > apologies again, > > shane > > On Thu, Mar 21, 2019 at 12:54 PM shane knapp <skn...@berkeley.edu> wrote: > >> i tweaked some apache settings (MaxClients increased to fix an error i >> found buried in the logs, and added 'retry' and 'acquire' to the reverse >> proxy settings to hopefully combat the dreaded 502 response), restarted >> httpd and things actually seem quite snappy right now! >> >> i'm not holding my breath, however... only time will tell. >> >> On Tue, Mar 19, 2019 at 7:18 AM Imran Rashid <iras...@apache.org> wrote: >> >>> seems wedged again? >>> >>> sorry for the bad news Shane, thanks for all the work on fixing it >>> >>> On Mon, Mar 18, 2019 at 4:02 PM shane knapp <skn...@berkeley.edu> wrote: >>> >>>> ok, i dug through the logs and noticed that rsyslogd was dropping >>>> messages to do imuxsock being spammed by postfix... which i then tracked >>>> down to our installation of fail2ban being incorrectly configured and >>>> attempting to send IP ban/unban status emails to 'em...@example.com'. >>>> >>>> since we're a university, and especially one w/a reputation like ours, >>>> we are constantly under attack. the logs of the attempted dictionary >>>> attacks would astound you in their size and scope. since we have so many >>>> ban/unban actions happening for all of these unique IP address, each of >>>> which generates an email that was directed to an invalid address, we ended >>>> up w/well over 100M of plain-text messages waiting in the mail queue. >>>> postfix was continually trying to send these messages, which was causing >>>> the system to behave strangely, including breaking rsyslogd. >>>> >>>> so, i disabled email reports in fail2ban, restarted the impacted >>>> services, picked my sysadmin's brain and then purged the mail queue (when >>>> was the last time anyone actually used postfix?). jenkins now seems to be >>>> behaving (maybe?). >>>> >>>> i'm not entirely sure that this will fix the strange GUI hangs, but all >>>> reports i found on stackoverflow and other sites detail strange system >>>> behavior across the board when rsyslogd starts dropping messages. at the >>>> very least we won't be (potentially) losing system-level log messages >>>> anymore, which might actually help me track down what's happening if >>>> jenkins gets wedged again. >>>> >>>> and finally, the obligatory IT Crowd clip: >>>> https://www.youtube.com/watch?v=5UT8RkSmN4k >>>> >>>> shane (who expects jenkins to crash within 5 minutes of this email >>>> going out) >>>> >>>> On Fri, Mar 15, 2019 at 8:22 PM Sean Owen <sro...@gmail.com> wrote: >>>> >>>>> It's not responding again. Is there any way to kick it harder? I know >>>>> it's well understood but this means not much can be merged in Spark >>>>> >>>>> On Fri, Mar 15, 2019 at 12:08 PM shane knapp <skn...@berkeley.edu> >>>>> wrote: >>>>> > >>>>> > well, that box rebooted in record time! we're back up and building. >>>>> > >>>>> > and as always, i'll keep a close eye on things today... jenkins >>>>> usually works great, until it doesn't. :\ >>>>> > >>>>> > On Fri, Mar 15, 2019 at 9:52 AM shane knapp <skn...@berkeley.edu> >>>>> wrote: >>>>> >> >>>>> >> as some of you may have noticed, jenkins got itself in a bad state >>>>> multiple times over the past couple of weeks. usually restarting the >>>>> service is sufficient, but it appears that i need to hit it w/the reboot >>>>> hammer. >>>>> >> >>>>> >> jenkins will be down for the next 20-30 minutes as the node reboots >>>>> and jenkins spins back up. i'll reply here w/any updates. >>>>> >> >>>>> >> shane >>>>> >> -- >>>>> >> Shane Knapp >>>>> >> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>>> >> https://rise.cs.berkeley.edu >>>>> > >>>>> > >>>>> > >>>>> > -- >>>>> > Shane Knapp >>>>> > UC Berkeley EECS Research / RISELab Staff Technical Lead >>>>> > https://rise.cs.berkeley.edu >>>>> >>>> >>>> >>>> -- >>>> Shane Knapp >>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>> https://rise.cs.berkeley.edu >>>> >>> >> >> -- >> Shane Knapp >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > > > -- > Shane Knapp > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu