quick update:

since kicking httpd on the jenkins master "fixes" the GUI hanging, i set up
a cron job to restart httpd 4 times per day.

this is not the final solution, but will definitely help over the weekend
as i'm heading out of town.

shane

On Fri, Mar 22, 2019 at 9:50 AM shane knapp <skn...@berkeley.edu> wrote:

> i was right to not hold my breath...  while my apache changes seem to have
> helped a bit, things are still slowing down after 10-12 hours.
>
> i have a few other things i can look at, and will get as much done as
> possible before the weekend.  serious troubleshooting will begin anew
> monday.
>
> apologies again,
>
> shane
>
> On Thu, Mar 21, 2019 at 12:54 PM shane knapp <skn...@berkeley.edu> wrote:
>
>> i tweaked some apache settings (MaxClients increased to fix an error i
>> found buried in the logs, and added 'retry' and 'acquire' to the reverse
>> proxy settings to hopefully combat the dreaded 502 response), restarted
>> httpd and things actually seem quite snappy right now!
>>
>> i'm not holding my breath, however...  only time will tell.
>>
>> On Tue, Mar 19, 2019 at 7:18 AM Imran Rashid <iras...@apache.org> wrote:
>>
>>> seems wedged again?
>>>
>>> sorry for the bad news Shane, thanks for all the work on fixing it
>>>
>>> On Mon, Mar 18, 2019 at 4:02 PM shane knapp <skn...@berkeley.edu> wrote:
>>>
>>>> ok, i dug through the logs and noticed that rsyslogd was dropping
>>>> messages to do imuxsock being spammed by postfix...  which i then tracked
>>>> down to our installation of fail2ban being incorrectly configured and
>>>> attempting to send IP ban/unban status emails to 'em...@example.com'.
>>>>
>>>> since we're a university, and especially one w/a reputation like ours,
>>>> we are constantly under attack.  the logs of the attempted dictionary
>>>> attacks would astound you in their size and scope.  since we have so many
>>>> ban/unban actions happening for all of these unique IP address, each of
>>>> which generates an email that was directed to an invalid address, we ended
>>>> up w/well over 100M of plain-text messages waiting in the mail queue.
>>>> postfix was continually trying to send these messages, which was causing
>>>> the system to behave strangely, including breaking rsyslogd.
>>>>
>>>> so, i disabled email reports in fail2ban, restarted the impacted
>>>> services, picked my sysadmin's brain and then purged the mail queue (when
>>>> was the last time anyone actually used postfix?).  jenkins now seems to be
>>>> behaving (maybe?).
>>>>
>>>> i'm not entirely sure that this will fix the strange GUI hangs, but all
>>>> reports i found on stackoverflow and other sites detail strange system
>>>> behavior across the board when rsyslogd starts dropping messages.  at the
>>>> very least we won't be (potentially) losing system-level log messages
>>>> anymore, which might actually help me track down what's happening if
>>>> jenkins gets wedged again.
>>>>
>>>> and finally, the obligatory IT Crowd clip:
>>>> https://www.youtube.com/watch?v=5UT8RkSmN4k
>>>>
>>>> shane (who expects jenkins to crash within 5 minutes of this email
>>>> going out)
>>>>
>>>> On Fri, Mar 15, 2019 at 8:22 PM Sean Owen <sro...@gmail.com> wrote:
>>>>
>>>>> It's not responding again. Is there any way to kick it harder? I know
>>>>> it's well understood but this means not much can be merged in Spark
>>>>>
>>>>> On Fri, Mar 15, 2019 at 12:08 PM shane knapp <skn...@berkeley.edu>
>>>>> wrote:
>>>>> >
>>>>> > well, that box rebooted in record time!  we're back up and building.
>>>>> >
>>>>> > and as always, i'll keep a close eye on things today...  jenkins
>>>>> usually works great, until it doesn't.  :\
>>>>> >
>>>>> > On Fri, Mar 15, 2019 at 9:52 AM shane knapp <skn...@berkeley.edu>
>>>>> wrote:
>>>>> >>
>>>>> >> as some of you may have noticed, jenkins got itself in a bad state
>>>>> multiple times over the past couple of weeks.  usually restarting the
>>>>> service is sufficient, but it appears that i need to hit it w/the reboot
>>>>> hammer.
>>>>> >>
>>>>> >> jenkins will be down for the next 20-30 minutes as the node reboots
>>>>> and jenkins spins back up.  i'll reply here w/any updates.
>>>>> >>
>>>>> >> shane
>>>>> >> --
>>>>> >> Shane Knapp
>>>>> >> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>> >> https://rise.cs.berkeley.edu
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Shane Knapp
>>>>> > UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>> > https://rise.cs.berkeley.edu
>>>>>
>>>>
>>>>
>>>> --
>>>> Shane Knapp
>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>> https://rise.cs.berkeley.edu
>>>>
>>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

Reply via email to