i tweaked some apache settings (MaxClients increased to fix an error i found buried in the logs, and added 'retry' and 'acquire' to the reverse proxy settings to hopefully combat the dreaded 502 response), restarted httpd and things actually seem quite snappy right now!
i'm not holding my breath, however... only time will tell. On Tue, Mar 19, 2019 at 7:18 AM Imran Rashid <iras...@apache.org> wrote: > seems wedged again? > > sorry for the bad news Shane, thanks for all the work on fixing it > > On Mon, Mar 18, 2019 at 4:02 PM shane knapp <skn...@berkeley.edu> wrote: > >> ok, i dug through the logs and noticed that rsyslogd was dropping >> messages to do imuxsock being spammed by postfix... which i then tracked >> down to our installation of fail2ban being incorrectly configured and >> attempting to send IP ban/unban status emails to 'em...@example.com'. >> >> since we're a university, and especially one w/a reputation like ours, we >> are constantly under attack. the logs of the attempted dictionary attacks >> would astound you in their size and scope. since we have so many ban/unban >> actions happening for all of these unique IP address, each of which >> generates an email that was directed to an invalid address, we ended up >> w/well over 100M of plain-text messages waiting in the mail queue. postfix >> was continually trying to send these messages, which was causing the system >> to behave strangely, including breaking rsyslogd. >> >> so, i disabled email reports in fail2ban, restarted the impacted >> services, picked my sysadmin's brain and then purged the mail queue (when >> was the last time anyone actually used postfix?). jenkins now seems to be >> behaving (maybe?). >> >> i'm not entirely sure that this will fix the strange GUI hangs, but all >> reports i found on stackoverflow and other sites detail strange system >> behavior across the board when rsyslogd starts dropping messages. at the >> very least we won't be (potentially) losing system-level log messages >> anymore, which might actually help me track down what's happening if >> jenkins gets wedged again. >> >> and finally, the obligatory IT Crowd clip: >> https://www.youtube.com/watch?v=5UT8RkSmN4k >> >> shane (who expects jenkins to crash within 5 minutes of this email going >> out) >> >> On Fri, Mar 15, 2019 at 8:22 PM Sean Owen <sro...@gmail.com> wrote: >> >>> It's not responding again. Is there any way to kick it harder? I know >>> it's well understood but this means not much can be merged in Spark >>> >>> On Fri, Mar 15, 2019 at 12:08 PM shane knapp <skn...@berkeley.edu> >>> wrote: >>> > >>> > well, that box rebooted in record time! we're back up and building. >>> > >>> > and as always, i'll keep a close eye on things today... jenkins >>> usually works great, until it doesn't. :\ >>> > >>> > On Fri, Mar 15, 2019 at 9:52 AM shane knapp <skn...@berkeley.edu> >>> wrote: >>> >> >>> >> as some of you may have noticed, jenkins got itself in a bad state >>> multiple times over the past couple of weeks. usually restarting the >>> service is sufficient, but it appears that i need to hit it w/the reboot >>> hammer. >>> >> >>> >> jenkins will be down for the next 20-30 minutes as the node reboots >>> and jenkins spins back up. i'll reply here w/any updates. >>> >> >>> >> shane >>> >> -- >>> >> Shane Knapp >>> >> UC Berkeley EECS Research / RISELab Staff Technical Lead >>> >> https://rise.cs.berkeley.edu >>> > >>> > >>> > >>> > -- >>> > Shane Knapp >>> > UC Berkeley EECS Research / RISELab Staff Technical Lead >>> > https://rise.cs.berkeley.edu >>> >> >> >> -- >> Shane Knapp >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > -- Shane Knapp UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu