Re: [OpenStack-Infra] Ask.o.o down

2017-03-07 Thread Paul Belanger
On Tue, Mar 07, 2017 at 07:39:13PM +1100, Ian Wienand wrote:
> On 03/07/2017 07:20 PM, Gene Kuo wrote:
> > These errors do line up to the time where it's down.
> > However, I have no idea what cause apache to seg fault.
> 
> Something disappearing underneath it would be my suspicion
> 
> Anyway, I added "CoreDumpDirectory /var/cache/apache2" to
> /etc/apache2/apache2.conf manually (don't think it's puppet managed?)
> 
> Let's see if we can pick up a core dump, we can
> at least then trace it back
> 
You could do something like 359278[1].

[1] https://review.openstack.org/#/c/359278

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] Ask.o.o down

2017-03-07 Thread Ian Wienand

On 03/07/2017 07:20 PM, Gene Kuo wrote:

These errors do line up to the time where it's down.
However, I have no idea what cause apache to seg fault.


Something disappearing underneath it would be my suspicion

Anyway, I added "CoreDumpDirectory /var/cache/apache2" to
/etc/apache2/apache2.conf manually (don't think it's puppet managed?)

Let's see if we can pick up a core dump, we can
at least then trace it back

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] Ask.o.o down

2017-03-07 Thread Gene Kuo

These errors do line up to the time where it's down.
However, I have no idea what cause apache to seg fault.
 
-Original Message-
From: "Ian Wienand" 
Sent: Tuesday, March 7, 2017 3:10am
To: "Gene Kuo" , "Tom Fifield" 
Cc: openstack-infra@lists.openstack.org
Subject: Re: [OpenStack-Infra] Ask.o.o down

On 03/07/2017 06:46 PM, Gene Kuo wrote:
> I found that ask.o.o is down again. 

I restarted apache

---
root@ask:/var/log/apache2# date
Tue Mar 7 07:54:26 UTC 2017

[Tue Mar 07 06:01:38.469993 2017] [core:notice] [pid 19511:tid 140460060575616] 
AH00052: child pid 19517 exit signal Segmentation fault (11)
[Tue Mar 07 06:31:21.397621 2017] [mpm_event:notice] [pid 19511:tid 
140460060575616] AH00493: SIGUSR1 received. Doing graceful restart
[Tue Mar 07 06:31:21.529687 2017] [core:notice] [pid 19511] AH00060: seg fault 
or similar nasty error detected in the parent process
---

These errors probably maybe line up with the failure start?

Here are the recent failures I can find, do these line up with failure
times? This does not seem to be consistent time like suggested with
the cron job.

---
error.log.1 :[Tue Mar 07 06:01:38.469993 2017] [core:notice] [pid 19511:tid 
140460060575616] AH00052: child pid 19517 exit signal Segmentation fault (11)
error.log.2.gz:[Sun Mar 05 17:42:53.457689 2017] [core:notice] [pid 18065:tid 
140399259293568] AH00052: child pid 16820 exit signal Segmentation fault (11)
error.log.4.gz:[Fri Mar 03 18:21:50.242282 2017] [core:notice] [pid 6066:tid 
140255921559424] AH00052: child pid 6072 exit signal Segmentation fault (11)
error.log.6.gz:[Wed Mar 01 08:40:37.026106 2017] [core:notice] [pid 9628:tid 
140257025800064] AH00052: child pid 10324 exit signal Segmentation fault (11)
error.log.6.gz:[Wed Mar 01 13:38:41.474969 2017] [core:notice] [pid 9628:tid 
140257025800064] AH00052: child pid 11891 exit signal Segmentation fault (11)
error.log.6.gz:[Wed Mar 01 13:57:44.712564 2017] [core:notice] [pid 9628:tid 
140257025800064] AH00052: child pid 9635 exit signal Segmentation fault (11)
error.log.6.gz:[Wed Mar 01 22:03:17.717681 2017] [core:notice] [pid 9628:tid 
140257025800064] AH00052: child pid 19925 exit signal Segmentation fault (11)
error.log.6.gz:[Thu Mar 02 05:51:14.236546 2017] [core:notice] [pid 9628:tid 
140257025800064] AH00052: child pid 15214 exit signal Segmentation fault (11)
---

I'm really not sure what happened with the logs; the 8th rotation
seems to have disappeared and then they get really old.

---
-rw-r- 1 root adm 481 Mar 1 06:54 error.log.7.gz
-rw-r- 1 root adm 4831 Oct 3 06:44 error.log.9.gz
---

-i___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Ask.o.o down

2017-03-07 Thread Ian Wienand
On 03/07/2017 06:46 PM, Gene Kuo wrote:
> I found that ask.o.o is down again. 

I restarted apache

---
root@ask:/var/log/apache2# date
Tue Mar  7 07:54:26 UTC 2017

[Tue Mar 07 06:01:38.469993 2017] [core:notice] [pid 19511:tid 140460060575616] 
AH00052: child pid 19517 exit signal Segmentation fault (11)
[Tue Mar 07 06:31:21.397621 2017] [mpm_event:notice] [pid 19511:tid 
140460060575616] AH00493: SIGUSR1 received.  Doing graceful restart
[Tue Mar 07 06:31:21.529687 2017] [core:notice] [pid 19511] AH00060: seg fault 
or similar nasty error detected in the parent process
---

These errors probably maybe line up with the failure start?

Here are the recent failures I can find, do these line up with failure
times?  This does not seem to be consistent time like suggested with
the cron job.

---
error.log.1   :[Tue Mar 07 06:01:38.469993 2017] [core:notice] [pid 19511:tid 
140460060575616] AH00052: child pid 19517 exit signal Segmentation fault (11)
error.log.2.gz:[Sun Mar 05 17:42:53.457689 2017] [core:notice] [pid 18065:tid 
140399259293568] AH00052: child pid 16820 exit signal Segmentation fault (11)
error.log.4.gz:[Fri Mar 03 18:21:50.242282 2017] [core:notice] [pid 6066:tid 
140255921559424] AH00052: child pid 6072 exit signal Segmentation fault (11)
error.log.6.gz:[Wed Mar 01 08:40:37.026106 2017] [core:notice] [pid 9628:tid 
140257025800064] AH00052: child pid 10324 exit signal Segmentation fault (11)
error.log.6.gz:[Wed Mar 01 13:38:41.474969 2017] [core:notice] [pid 9628:tid 
140257025800064] AH00052: child pid 11891 exit signal Segmentation fault (11)
error.log.6.gz:[Wed Mar 01 13:57:44.712564 2017] [core:notice] [pid 9628:tid 
140257025800064] AH00052: child pid 9635 exit signal Segmentation fault (11)
error.log.6.gz:[Wed Mar 01 22:03:17.717681 2017] [core:notice] [pid 9628:tid 
140257025800064] AH00052: child pid 19925 exit signal Segmentation fault (11)
error.log.6.gz:[Thu Mar 02 05:51:14.236546 2017] [core:notice] [pid 9628:tid 
140257025800064] AH00052: child pid 15214 exit signal Segmentation fault (11)
---

I'm really not sure what happened with the logs; the 8th rotation
seems to have disappeared and then they get really old.

---
-rw-r- 1 root adm  481 Mar  1 06:54 error.log.7.gz
-rw-r- 1 root adm 4831 Oct  3 06:44 error.log.9.gz
---

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] Ask.o.o down

2017-03-06 Thread Gene Kuo
Hi All,

I found that ask.o.o is down again. 
I’m able to ping the server but connection through port 80 is refused.
Can someone in the infra team check the server. The same problem happened 
yesterday at about 7 am UTC.

Regards,

Gene Kuo

-Original Message-
From: "Tom Fifield" 
Sent: Tuesday, February 21, 2017 2:15am
To: openstack-infra@lists.openstack.org
Subject: Re: [OpenStack-Infra] Ask.o.o down



On 廿十七年二月廿一日 暮 03:11, Tom Fifield wrote:
>
>
> On 廿十七年二月十四日 暮 04:19, Joshua Hesketh wrote:
>>
>>
>> On Tue, Feb 14, 2017 at 7:15 PM, Tom Fifield > <mailto:t...@openstack.org>> wrote:
>>
>> On 14/02/17 16:11, Joshua Hesketh wrote:
>>
>> Hey Tom,
>>
>> Where is that script being fired from (a quick grep doesn't find
>> it), or
>> is it a tool people are using?
>>
>> If it's a tool we'd need to make sure whoever is using it gets
>> a new
>> version to rule it out.
>>
>>
>> Indeed.
>>
>>
>> It's fired from a PHP service on www.openstack.org
>> <http://www.openstack.org> itself, which writes to the Member
>> database:
>>
>>
>> https://github.com/OpenStackweb/openstack-org/blob/master/auc-metrics/code/services/ActiveModeratorService.php
>>
>>
>> <https://github.com/OpenStackweb/openstack-org/blob/master/auc-metrics/code/services/ActiveModeratorService.php>
>>
>>
>>
>>
>> Right. I wonder if somebody could check the logs to see if the process
>> times out. Sadly looking at that code it looks like any output messages
>> from the script will be discarded.
>>
>
> ... and my patch was deployed, but the site is down today. So, looks
> like it wasn't that.

Though, is it staying down for less time? It came back up just now - 
normally it'd be down for another 45mins.

Interesting traffic spikes at:
http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=2549&rra_id=all

seem to correlate with the outage. Perhaps we can set up some tcpdumps?

>>
>>
>> The next step is to update the copy of the script it references:
>>
>>
>> https://github.com/OpenStackweb/openstack-org/blob/master/auc-metrics/lib/uc-recognition/tools/get_active_moderator.py
>>
>>
>> <https://github.com/OpenStackweb/openstack-org/blob/master/auc-metrics/lib/uc-recognition/tools/get_active_moderator.py>
>>
>>
>> I am not sure if this is in place using git submodules or manually,
>> but will figure it out and get that updated.
>>
>>
>>
>>
>>  - Josh
>>
>> On Tue, Feb 14, 2017 at 7:07 PM, Tom Fifield > <mailto:t...@openstack.org>
>> <mailto:t...@openstack.org <mailto:t...@openstack.org>>> wrote:
>>
>> On 14/02/17 16:06, Joshua Hesketh wrote:
>>
>> Hey,
>>
>> I've brought the service back up, but have no new clues
>> as to why.
>>
>>
>> Cheers.
>>
>> Going to try: https://review.openstack.org/#/c/433478/
>> <https://review.openstack.org/#/c/433478/>
>> <https://review.openstack.org/#/c/433478/
>> <https://review.openstack.org/#/c/433478/>>
>> to see if this script is culprit.
>>
>>
>> - Josh
>>
>> On Tue, Feb 14, 2017 at 6:50 PM, Tom Fifield
>> mailto:t...@openstack.org>
>> <mailto:t...@openstack.org <mailto:t...@openstack.org>>
>> <mailto:t...@openstack.org <mailto:t...@openstack.org>
>> <mailto:t...@openstack.org <mailto:t...@openstack.org>>>> wrote:
>>
>> On 10/02/17 22:39, Jeremy Stanley wrote:
>>
>> On 2017-02-10 16:08:51 +0800 (+0800), Tom
>> Fifield wrote:
>> [...]
>>
>> Down again, this time with "Network is
>> unreachable".
>>
>> [...]
>>
>> I'm not finding any obvious errors on the
>> server nor
>> relevant
>> maintenance notices/trouble tickets from the
>> service
>> provider to
>> explain this. I do see conspicuous gaps in
>> network
>> traffic volume

Re: [OpenStack-Infra] Ask.o.o down

2017-02-20 Thread Tom Fifield



On 廿十七年二月廿一日 暮 03:11, Tom Fifield wrote:



On 廿十七年二月十四日 暮 04:19, Joshua Hesketh wrote:



On Tue, Feb 14, 2017 at 7:15 PM, Tom Fifield mailto:t...@openstack.org>> wrote:

On 14/02/17 16:11, Joshua Hesketh wrote:

Hey Tom,

Where is that script being fired from (a quick grep doesn't find
it), or
is it a tool people are using?

If it's a tool we'd need to make sure whoever is using it gets
a new
version to rule it out.


Indeed.


It's fired from a PHP service on www.openstack.org
 itself, which writes to the Member
database:


https://github.com/OpenStackweb/openstack-org/blob/master/auc-metrics/code/services/ActiveModeratorService.php







Right. I wonder if somebody could check the logs to see if the process
times out. Sadly looking at that code it looks like any output messages
from the script will be discarded.



... and my patch was deployed, but the site is down today. So, looks
like it wasn't that.


Though, is it staying down for less time? It came back up just now - 
normally it'd be down for another 45mins.


Interesting traffic spikes at:
http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=2549&rra_id=all

seem to correlate with the outage. Perhaps we can set up some tcpdumps?




The next step is to update the copy of the script it references:


https://github.com/OpenStackweb/openstack-org/blob/master/auc-metrics/lib/uc-recognition/tools/get_active_moderator.py





I am not sure if this is in place using git submodules or manually,
but will figure it out and get that updated.




 - Josh

On Tue, Feb 14, 2017 at 7:07 PM, Tom Fifield mailto:t...@openstack.org>
>> wrote:

On 14/02/17 16:06, Joshua Hesketh wrote:

Hey,

I've brought the service back up, but have no new clues
as to why.


Cheers.

Going to try: https://review.openstack.org/#/c/433478/

>
to see if this script is culprit.


- Josh

On Tue, Feb 14, 2017 at 6:50 PM, Tom Fifield
mailto:t...@openstack.org>
>


>


>>

Skipping back through previous days I find some
similar gaps
starting anywhere from 06:30 to 07:00 and ending
between
07:00 and
08:00 but they don't seem to occur every day and
I'm not
having much
luck finding a pattern. It _is_ conspicuously
close to when
/etc/cron.daily scripts get fired from the
crontab so
might coincide
with log rotation/service restarts? The graphs
don't
show these gaps
correlating with any spikes in CPU, memory or
disk
activity so it
doesn't seem to be resource starvation (at least
not for
any common
resources we're tracking).


   

Re: [OpenStack-Infra] Ask.o.o down

2017-02-20 Thread Tom Fifield



On 廿十七年二月十四日 暮 04:19, Joshua Hesketh wrote:



On Tue, Feb 14, 2017 at 7:15 PM, Tom Fifield mailto:t...@openstack.org>> wrote:

On 14/02/17 16:11, Joshua Hesketh wrote:

Hey Tom,

Where is that script being fired from (a quick grep doesn't find
it), or
is it a tool people are using?

If it's a tool we'd need to make sure whoever is using it gets a new
version to rule it out.


Indeed.


It's fired from a PHP service on www.openstack.org
 itself, which writes to the Member database:


https://github.com/OpenStackweb/openstack-org/blob/master/auc-metrics/code/services/ActiveModeratorService.php





Right. I wonder if somebody could check the logs to see if the process
times out. Sadly looking at that code it looks like any output messages
from the script will be discarded.



... and my patch was deployed, but the site is down today. So, looks 
like it wasn't that.





The next step is to update the copy of the script it references:


https://github.com/OpenStackweb/openstack-org/blob/master/auc-metrics/lib/uc-recognition/tools/get_active_moderator.py



I am not sure if this is in place using git submodules or manually,
but will figure it out and get that updated.




 - Josh

On Tue, Feb 14, 2017 at 7:07 PM, Tom Fifield mailto:t...@openstack.org>
>> wrote:

On 14/02/17 16:06, Joshua Hesketh wrote:

Hey,

I've brought the service back up, but have no new clues
as to why.


Cheers.

Going to try: https://review.openstack.org/#/c/433478/

>
to see if this script is culprit.


- Josh

On Tue, Feb 14, 2017 at 6:50 PM, Tom Fifield
mailto:t...@openstack.org>
>


>


>>

Skipping back through previous days I find some
similar gaps
starting anywhere from 06:30 to 07:00 and ending
between
07:00 and
08:00 but they don't seem to occur every day and
I'm not
having much
luck finding a pattern. It _is_ conspicuously
close to when
/etc/cron.daily scripts get fired from the
crontab so
might coincide
with log rotation/service restarts? The graphs don't
show these gaps
correlating with any spikes in CPU, memory or disk
activity so it
doesn't seem to be resource starvation (at least
not for
any common
resources we're tracking).


Indeed. It's down again today during the same timeslot.

Another idea for the cron-based theory:




https://github.com/openstack/uc-recognition/blob/master/tools/get_active_moderator.py






Re: [OpenStack-Infra] Ask.o.o down

2017-02-14 Thread Joshua Hesketh
On Tue, Feb 14, 2017 at 7:15 PM, Tom Fifield  wrote:

> On 14/02/17 16:11, Joshua Hesketh wrote:
>
>> Hey Tom,
>>
>> Where is that script being fired from (a quick grep doesn't find it), or
>> is it a tool people are using?
>>
>> If it's a tool we'd need to make sure whoever is using it gets a new
>> version to rule it out.
>>
>
> Indeed.
>
>
> It's fired from a PHP service on www.openstack.org itself, which writes
> to the Member database:
>
> https://github.com/OpenStackweb/openstack-org/blob/master/
> auc-metrics/code/services/ActiveModeratorService.php



Right. I wonder if somebody could check the logs to see if the process
times out. Sadly looking at that code it looks like any output messages
from the script will be discarded.

 - Josh




>
>
> The next step is to update the copy of the script it references:
>
> https://github.com/OpenStackweb/openstack-org/blob/master/
> auc-metrics/lib/uc-recognition/tools/get_active_moderator.py
>
> I am not sure if this is in place using git submodules or manually, but
> will figure it out and get that updated.
>
>
>
>
>  - Josh
>>
>> On Tue, Feb 14, 2017 at 7:07 PM, Tom Fifield > > wrote:
>>
>> On 14/02/17 16:06, Joshua Hesketh wrote:
>>
>> Hey,
>>
>> I've brought the service back up, but have no new clues as to why.
>>
>>
>> Cheers.
>>
>> Going to try: https://review.openstack.org/#/c/433478/
>> 
>> to see if this script is culprit.
>>
>>
>> - Josh
>>
>> On Tue, Feb 14, 2017 at 6:50 PM, Tom Fifield > 
>> >> wrote:
>>
>> On 10/02/17 22:39, Jeremy Stanley wrote:
>>
>> On 2017-02-10 16:08:51 +0800 (+0800), Tom Fifield wrote:
>> [...]
>>
>> Down again, this time with "Network is unreachable".
>>
>> [...]
>>
>> I'm not finding any obvious errors on the server nor
>> relevant
>> maintenance notices/trouble tickets from the service
>> provider to
>> explain this. I do see conspicuous gaps in network
>> traffic volume
>> and system load from ~06:45 to ~08:10 UTC according to
>> cacti:
>>
>> http://cacti.openstack.org/?tree_id=1&leaf_id=156
>> 
>> > >
>>
>> Skipping back through previous days I find some similar
>> gaps
>> starting anywhere from 06:30 to 07:00 and ending between
>> 07:00 and
>> 08:00 but they don't seem to occur every day and I'm not
>> having much
>> luck finding a pattern. It _is_ conspicuously close to
>> when
>> /etc/cron.daily scripts get fired from the crontab so
>> might coincide
>> with log rotation/service restarts? The graphs don't
>> show these gaps
>> correlating with any spikes in CPU, memory or disk
>> activity so it
>> doesn't seem to be resource starvation (at least not for
>> any common
>> resources we're tracking).
>>
>>
>> Indeed. It's down again today during the same timeslot.
>>
>> Another idea for the cron-based theory:
>>
>>
>> https://github.com/openstack/uc-recognition/blob/master/tool
>> s/get_active_moderator.py
>> > ls/get_active_moderator.py>
>>
>> > ls/get_active_moderator.py
>> > ls/get_active_moderator.py>>
>>
>> loops through the list of Ask OpenStack users via the API on
>> a cron
>> running on www.openstack.org 
>> . Not sure
>> when that cron runs, but if it's similar, this could
>> potentially be
>> a high-load generator.
>>
>>
>>
>>
>> Regards,
>>
>>
>> Tom
>>
>>
>> ___
>> OpenStack-Infra mailing list
>> OpenStack-Infra@lists.openstack.org
>> 
>> > >
>>
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstac
>> k-infra
>> > ck-infra>
>>
>> > ck-infra
>> 

Re: [OpenStack-Infra] Ask.o.o down

2017-02-14 Thread Tom Fifield

On 14/02/17 16:11, Joshua Hesketh wrote:

Hey Tom,

Where is that script being fired from (a quick grep doesn't find it), or
is it a tool people are using?

If it's a tool we'd need to make sure whoever is using it gets a new
version to rule it out.


Indeed.


It's fired from a PHP service on www.openstack.org itself, which writes 
to the Member database:


https://github.com/OpenStackweb/openstack-org/blob/master/auc-metrics/code/services/ActiveModeratorService.php

The next step is to update the copy of the script it references:

https://github.com/OpenStackweb/openstack-org/blob/master/auc-metrics/lib/uc-recognition/tools/get_active_moderator.py

I am not sure if this is in place using git submodules or manually, but 
will figure it out and get that updated.






 - Josh

On Tue, Feb 14, 2017 at 7:07 PM, Tom Fifield mailto:t...@openstack.org>> wrote:

On 14/02/17 16:06, Joshua Hesketh wrote:

Hey,

I've brought the service back up, but have no new clues as to why.


Cheers.

Going to try: https://review.openstack.org/#/c/433478/

to see if this script is culprit.


- Josh

On Tue, Feb 14, 2017 at 6:50 PM, Tom Fifield mailto:t...@openstack.org>
>> wrote:

On 10/02/17 22:39, Jeremy Stanley wrote:

On 2017-02-10 16:08:51 +0800 (+0800), Tom Fifield wrote:
[...]

Down again, this time with "Network is unreachable".

[...]

I'm not finding any obvious errors on the server nor
relevant
maintenance notices/trouble tickets from the service
provider to
explain this. I do see conspicuous gaps in network
traffic volume
and system load from ~06:45 to ~08:10 UTC according to
cacti:

http://cacti.openstack.org/?tree_id=1&leaf_id=156

>

Skipping back through previous days I find some similar gaps
starting anywhere from 06:30 to 07:00 and ending between
07:00 and
08:00 but they don't seem to occur every day and I'm not
having much
luck finding a pattern. It _is_ conspicuously close to when
/etc/cron.daily scripts get fired from the crontab so
might coincide
with log rotation/service restarts? The graphs don't
show these gaps
correlating with any spikes in CPU, memory or disk
activity so it
doesn't seem to be resource starvation (at least not for
any common
resources we're tracking).


Indeed. It's down again today during the same timeslot.

Another idea for the cron-based theory:



https://github.com/openstack/uc-recognition/blob/master/tools/get_active_moderator.py




>

loops through the list of Ask OpenStack users via the API on
a cron
running on www.openstack.org 
. Not sure
when that cron runs, but if it's similar, this could
potentially be
a high-load generator.




Regards,


Tom


___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org

>

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


>







___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] Ask.o.o down

2017-02-14 Thread Joshua Hesketh
Hey Tom,

Where is that script being fired from (a quick grep doesn't find it), or is
it a tool people are using?

If it's a tool we'd need to make sure whoever is using it gets a new
version to rule it out.

 - Josh

On Tue, Feb 14, 2017 at 7:07 PM, Tom Fifield  wrote:

> On 14/02/17 16:06, Joshua Hesketh wrote:
>
>> Hey,
>>
>> I've brought the service back up, but have no new clues as to why.
>>
>
> Cheers.
>
> Going to try: https://review.openstack.org/#/c/433478/
> to see if this script is culprit.
>
>
> - Josh
>>
>> On Tue, Feb 14, 2017 at 6:50 PM, Tom Fifield > > wrote:
>>
>> On 10/02/17 22:39, Jeremy Stanley wrote:
>>
>> On 2017-02-10 16:08:51 +0800 (+0800), Tom Fifield wrote:
>> [...]
>>
>> Down again, this time with "Network is unreachable".
>>
>> [...]
>>
>> I'm not finding any obvious errors on the server nor relevant
>> maintenance notices/trouble tickets from the service provider to
>> explain this. I do see conspicuous gaps in network traffic volume
>> and system load from ~06:45 to ~08:10 UTC according to cacti:
>>
>> http://cacti.openstack.org/?tree_id=1&leaf_id=156
>> 
>>
>> Skipping back through previous days I find some similar gaps
>> starting anywhere from 06:30 to 07:00 and ending between 07:00 and
>> 08:00 but they don't seem to occur every day and I'm not having
>> much
>> luck finding a pattern. It _is_ conspicuously close to when
>> /etc/cron.daily scripts get fired from the crontab so might
>> coincide
>> with log rotation/service restarts? The graphs don't show these
>> gaps
>> correlating with any spikes in CPU, memory or disk activity so it
>> doesn't seem to be resource starvation (at least not for any
>> common
>> resources we're tracking).
>>
>>
>> Indeed. It's down again today during the same timeslot.
>>
>> Another idea for the cron-based theory:
>>
>> https://github.com/openstack/uc-recognition/blob/master/tool
>> s/get_active_moderator.py
>> > ls/get_active_moderator.py>
>>
>> loops through the list of Ask OpenStack users via the API on a cron
>> running on www.openstack.org . Not sure
>> when that cron runs, but if it's similar, this could potentially be
>> a high-load generator.
>>
>>
>>
>>
>> Regards,
>>
>>
>> Tom
>>
>>
>> ___
>> OpenStack-Infra mailing list
>> OpenStack-Infra@lists.openstack.org
>> 
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra
>> 
>>
>>
>>
>
___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Ask.o.o down

2017-02-14 Thread Tom Fifield

On 14/02/17 16:06, Joshua Hesketh wrote:

Hey,

I've brought the service back up, but have no new clues as to why.


Cheers.

Going to try: https://review.openstack.org/#/c/433478/
to see if this script is culprit.



- Josh

On Tue, Feb 14, 2017 at 6:50 PM, Tom Fifield mailto:t...@openstack.org>> wrote:

On 10/02/17 22:39, Jeremy Stanley wrote:

On 2017-02-10 16:08:51 +0800 (+0800), Tom Fifield wrote:
[...]

Down again, this time with "Network is unreachable".

[...]

I'm not finding any obvious errors on the server nor relevant
maintenance notices/trouble tickets from the service provider to
explain this. I do see conspicuous gaps in network traffic volume
and system load from ~06:45 to ~08:10 UTC according to cacti:

http://cacti.openstack.org/?tree_id=1&leaf_id=156


Skipping back through previous days I find some similar gaps
starting anywhere from 06:30 to 07:00 and ending between 07:00 and
08:00 but they don't seem to occur every day and I'm not having much
luck finding a pattern. It _is_ conspicuously close to when
/etc/cron.daily scripts get fired from the crontab so might coincide
with log rotation/service restarts? The graphs don't show these gaps
correlating with any spikes in CPU, memory or disk activity so it
doesn't seem to be resource starvation (at least not for any common
resources we're tracking).


Indeed. It's down again today during the same timeslot.

Another idea for the cron-based theory:


https://github.com/openstack/uc-recognition/blob/master/tools/get_active_moderator.py



loops through the list of Ask OpenStack users via the API on a cron
running on www.openstack.org . Not sure
when that cron runs, but if it's similar, this could potentially be
a high-load generator.




Regards,


Tom


___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra






___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] Ask.o.o down

2017-02-14 Thread Joshua Hesketh
Hey,

I've brought the service back up, but have no new clues as to why.

- Josh

On Tue, Feb 14, 2017 at 6:50 PM, Tom Fifield  wrote:

> On 10/02/17 22:39, Jeremy Stanley wrote:
>
>> On 2017-02-10 16:08:51 +0800 (+0800), Tom Fifield wrote:
>> [...]
>>
>>> Down again, this time with "Network is unreachable".
>>>
>> [...]
>>
>> I'm not finding any obvious errors on the server nor relevant
>> maintenance notices/trouble tickets from the service provider to
>> explain this. I do see conspicuous gaps in network traffic volume
>> and system load from ~06:45 to ~08:10 UTC according to cacti:
>>
>> http://cacti.openstack.org/?tree_id=1&leaf_id=156
>>
>> Skipping back through previous days I find some similar gaps
>> starting anywhere from 06:30 to 07:00 and ending between 07:00 and
>> 08:00 but they don't seem to occur every day and I'm not having much
>> luck finding a pattern. It _is_ conspicuously close to when
>> /etc/cron.daily scripts get fired from the crontab so might coincide
>> with log rotation/service restarts? The graphs don't show these gaps
>> correlating with any spikes in CPU, memory or disk activity so it
>> doesn't seem to be resource starvation (at least not for any common
>> resources we're tracking).
>>
>>
> Indeed. It's down again today during the same timeslot.
>
> Another idea for the cron-based theory:
>
> https://github.com/openstack/uc-recognition/blob/master/tool
> s/get_active_moderator.py
>
> loops through the list of Ask OpenStack users via the API on a cron
> running on www.openstack.org. Not sure when that cron runs, but if it's
> similar, this could potentially be a high-load generator.
>
>
>
>
> Regards,
>
>
> Tom
>
>
> ___
> OpenStack-Infra mailing list
> OpenStack-Infra@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra
>
___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Ask.o.o down

2017-02-13 Thread Tom Fifield

On 10/02/17 22:39, Jeremy Stanley wrote:

On 2017-02-10 16:08:51 +0800 (+0800), Tom Fifield wrote:
[...]

Down again, this time with "Network is unreachable".

[...]

I'm not finding any obvious errors on the server nor relevant
maintenance notices/trouble tickets from the service provider to
explain this. I do see conspicuous gaps in network traffic volume
and system load from ~06:45 to ~08:10 UTC according to cacti:

http://cacti.openstack.org/?tree_id=1&leaf_id=156

Skipping back through previous days I find some similar gaps
starting anywhere from 06:30 to 07:00 and ending between 07:00 and
08:00 but they don't seem to occur every day and I'm not having much
luck finding a pattern. It _is_ conspicuously close to when
/etc/cron.daily scripts get fired from the crontab so might coincide
with log rotation/service restarts? The graphs don't show these gaps
correlating with any spikes in CPU, memory or disk activity so it
doesn't seem to be resource starvation (at least not for any common
resources we're tracking).



Indeed. It's down again today during the same timeslot.

Another idea for the cron-based theory:

https://github.com/openstack/uc-recognition/blob/master/tools/get_active_moderator.py

loops through the list of Ask OpenStack users via the API on a cron 
running on www.openstack.org. Not sure when that cron runs, but if it's 
similar, this could potentially be a high-load generator.





Regards,


Tom

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] Ask.o.o down

2017-02-10 Thread Jeremy Stanley
On 2017-02-10 16:08:51 +0800 (+0800), Tom Fifield wrote:
[...]
> Down again, this time with "Network is unreachable".
[...]

I'm not finding any obvious errors on the server nor relevant
maintenance notices/trouble tickets from the service provider to
explain this. I do see conspicuous gaps in network traffic volume
and system load from ~06:45 to ~08:10 UTC according to cacti:

http://cacti.openstack.org/?tree_id=1&leaf_id=156

Skipping back through previous days I find some similar gaps
starting anywhere from 06:30 to 07:00 and ending between 07:00 and
08:00 but they don't seem to occur every day and I'm not having much
luck finding a pattern. It _is_ conspicuously close to when
/etc/cron.daily scripts get fired from the crontab so might coincide
with log rotation/service restarts? The graphs don't show these gaps
correlating with any spikes in CPU, memory or disk activity so it
doesn't seem to be resource starvation (at least not for any common
resources we're tracking).
-- 
Jeremy Stanley

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] Ask.o.o down

2017-02-10 Thread Tom Fifield

On 10/02/17 16:41, Marton Kiss wrote:

Hi Tom

Still not working from your end? Works perfectly from here.


It is back now.



Marton

On Fri, Feb 10, 2017 at 9:14 AM Tom Fifield mailto:t...@openstack.org>> wrote:

On 10/02/17 16:08, Tom Fifield wrote:
> On 13/01/17 14:41, Tom Fifield wrote:
>> As at 2017-01-13 0641Z, Ask.o.o is refusing connections (at least on
>> IPv4) - tried from a couple of different hosts
>>
>> fifieldt@docwork2:~$ wget ask.openstack.org

>> --2017-01-13 06:40:49--  http://ask.openstack.org/
>> Resolving ask.openstack.org 
(ask.openstack.org )... 23.253.72.95,
>> 2001:4800:7815:103:be76:4eff:fe05:89f3
>> Connecting to ask.openstack.org 
(ask.openstack.org )|23.253.72.95|:80...
>> failed: Connection refused.
>
> Down again, this time with "Network is unreachable".

Sorry, it is still connection refused as before. The network unreachable
bit is my end's IPv6 connectivity :)


___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra




___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] Ask.o.o down

2017-02-10 Thread Marton Kiss
Hi Tom

Still not working from your end? Works perfectly from here.

Marton

On Fri, Feb 10, 2017 at 9:14 AM Tom Fifield  wrote:

> On 10/02/17 16:08, Tom Fifield wrote:
> > On 13/01/17 14:41, Tom Fifield wrote:
> >> As at 2017-01-13 0641Z, Ask.o.o is refusing connections (at least on
> >> IPv4) - tried from a couple of different hosts
> >>
> >> fifieldt@docwork2:~$ wget ask.openstack.org
> >> --2017-01-13 06:40:49--  http://ask.openstack.org/
> >> Resolving ask.openstack.org (ask.openstack.org)... 23.253.72.95,
> >> 2001:4800:7815:103:be76:4eff:fe05:89f3
> >> Connecting to ask.openstack.org (ask.openstack.org)|23.253.72.95|:80...
> >> failed: Connection refused.
> >
> > Down again, this time with "Network is unreachable".
>
> Sorry, it is still connection refused as before. The network unreachable
> bit is my end's IPv6 connectivity :)
>
>
> ___
> OpenStack-Infra mailing list
> OpenStack-Infra@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra
>
___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Ask.o.o down

2017-02-10 Thread Tom Fifield

On 13/01/17 14:41, Tom Fifield wrote:

As at 2017-01-13 0641Z, Ask.o.o is refusing connections (at least on
IPv4) - tried from a couple of different hosts

fifieldt@docwork2:~$ wget ask.openstack.org
--2017-01-13 06:40:49--  http://ask.openstack.org/
Resolving ask.openstack.org (ask.openstack.org)... 23.253.72.95,
2001:4800:7815:103:be76:4eff:fe05:89f3
Connecting to ask.openstack.org (ask.openstack.org)|23.253.72.95|:80...
failed: Connection refused.


Down again, this time with "Network is unreachable".


fifieldt@docwork2:~$ wget ask.openstack.org
--2017-02-10 08:07:10--  http://ask.openstack.org/
Resolving ask.openstack.org (ask.openstack.org)... 23.253.72.95, 
2001:4800:7815:103:be76:4eff:fe05:89f3
Connecting to ask.openstack.org (ask.openstack.org)|23.253.72.95|:80... 
failed: Connection refused.
Connecting to ask.openstack.org 
(ask.openstack.org)|2001:4800:7815:103:be76:4eff:fe05:89f3|:80... 
failed: Network is unreachable.




fifieldt@docwork2:~$ traceroute ask.openstack.org
traceroute to ask.openstack.org (23.253.72.95), 30 hops max, 60 byte packets
[snip]
11  xe-0-2-4.bdr1.a.sjc.aarnet.net.au (202.158.194.162)  160.649 ms 
160.670 ms  160.560 ms
12  xe-0-8-0-21.r01.snjsca04.us.bb.gin.ntt.net (128.241.219.153) 
161.276 ms  161.056 ms  161.216 ms
13  ae-10.r23.snjsca04.us.bb.gin.ntt.net (129.250.3.174)  161.199 ms 
161.050 ms  161.157 ms
14  ae-7.r23.dllstx09.us.bb.gin.ntt.net (129.250.4.155)  200.216 ms 
200.287 ms  200.261 ms
15  ae-26.r01.dllstx04.us.bb.gin.ntt.net (129.250.6.129)  199.711 ms 
199.693 ms  200.515 ms
16  ae-0.rackspace.dllstx04.us.bb.gin.ntt.net (129.250.207.118)  199.738 
ms  199.764 ms  199.303 ms

17  * * *
18  74.205.108.121 (74.205.108.121)  200.814 ms 
be41.coreb.dfw1.rackspace.net (74.205.108.117)  201.289 ms  202.648 ms
19  po1.corea-core10.core10.dfw3.rackspace.net (74.205.108.49)  201.402 
ms  202.792 ms po2.coreb-core10.core10.dfw3.rackspace.net 
(74.205.108.51)  201.389 ms
20  core9.aggr160a-4.dfw2.rackspace.net (98.129.84.201)  205.857 ms 
core9.aggr160b-4.dfw2.rackspace.net (98.129.84.203)  203.025 ms  204.921 ms
21  ask.openstack.org (23.253.72.95)  200.853 ms !X  202.494 ms !X 
200.562 ms !X



fifieldt@usagi:~$ traceroute ask.openstack.org
traceroute to ask.openstack.org (23.253.72.95), 30 hops max, 60 byte packets
[snip]
12  las-b21-link.telia.net (62.115.125.142)  137.764 ms  136.495 ms 
133.458 ms
13  dls-b21-link.telia.net (62.115.123.138)  203.140 ms 
dls-b21-link.telia.net (62.115.123.136)  166.356 ms 
dls-b21-link.telia.net (62.115.122.20)  168.077 ms
14  rackspace-ic-302090-dls-bb1.c.telia.net (62.115.33.78)  189.567 ms 
187.802 ms  178.015 ms
15  10.25.1.71 (10.25.1.71)  188.676 ms  184.027 ms 10.25.1.66 
(10.25.1.66)  172.849 ms
16  74.205.108.121 (74.205.108.121)  167.863 ms 
be41.corea.dfw1.rackspace.net (74.205.108.113)  186.449 ms 
74.205.108.121 (74.205.108.121)  167.069 ms
17  po1.corea-core9.core9.dfw3.rackspace.net (74.205.108.45)  204.053 ms 
 204.352 ms  182.891 ms
18  core9.aggr160b-4.dfw2.rackspace.net (98.129.84.203)  211.036 ms 
core10.aggr160b-4.dfw2.rackspace.net (98.129.84.207)  206.673 ms 
core10.aggr160a-4.dfw2.rackspace.net (98.129.84.205)  178.904 ms
19  ask.openstack.org (23.253.72.95)  200.364 ms !X  201.845 ms !X 
186.189 ms !X



___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] Ask.o.o down

2017-02-10 Thread Tom Fifield

On 10/02/17 16:08, Tom Fifield wrote:

On 13/01/17 14:41, Tom Fifield wrote:

As at 2017-01-13 0641Z, Ask.o.o is refusing connections (at least on
IPv4) - tried from a couple of different hosts

fifieldt@docwork2:~$ wget ask.openstack.org
--2017-01-13 06:40:49--  http://ask.openstack.org/
Resolving ask.openstack.org (ask.openstack.org)... 23.253.72.95,
2001:4800:7815:103:be76:4eff:fe05:89f3
Connecting to ask.openstack.org (ask.openstack.org)|23.253.72.95|:80...
failed: Connection refused.


Down again, this time with "Network is unreachable".


Sorry, it is still connection refused as before. The network unreachable 
bit is my end's IPv6 connectivity :)



___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] Ask.o.o down

2017-01-13 Thread Jeremy Stanley
On 2017-01-13 10:33:24 + (+), Marton Kiss wrote:
> You can find more details about the host here:
> http://cacti.openstack.org/cacti/graph_view.php?action=tree&tree_id=1&leaf_id=156
> It had a network outage somewhere, if you check the eth0, the
> traffic was zero.

Unfortunately I find no corresponding outage details listed at
https://status.rackspace.com/ nor any support tickets for the tenant
providing the instance for that service. The timeframe is
suspiciously right around when daily cron jobs would be running
(they start at 06:25 UTC) but I don't see anything in the system
logs that would indicate we ran anything that would paralyze the
system like that.
-- 
Jeremy Stanley

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] Ask.o.o down

2017-01-13 Thread Marton Kiss
Tom,

You can find more details about the host here:
http://cacti.openstack.org/cacti/graph_view.php?action=tree&tree_id=1&leaf_id=156
It
had a network outage somewhere, if you check the eth0, the traffic was zero.

Marton

On Fri, Jan 13, 2017 at 8:05 AM Tom Fifield  wrote:

> ... and at 0700Z it is back ...
>
> On 13/01/17 14:41, Tom Fifield wrote:
> > As at 2017-01-13 0641Z, Ask.o.o is refusing connections (at least on
> > IPv4) - tried from a couple of different hosts
> >
> > fifieldt@docwork2:~$ wget ask.openstack.org
> > --2017-01-13 06:40:49--  http://ask.openstack.org/
> > Resolving ask.openstack.org (ask.openstack.org)... 23.253.72.95,
> > 2001:4800:7815:103:be76:4eff:fe05:89f3
> > Connecting to ask.openstack.org (ask.openstack.org)|23.253.72.95|:80...
> > failed: Connection refused.
> >
> > ___
> > OpenStack-Infra mailing list
> > OpenStack-Infra@lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra
>
>
> ___
> OpenStack-Infra mailing list
> OpenStack-Infra@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra
>
___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Ask.o.o down

2017-01-12 Thread Tom Fifield

... and at 0700Z it is back ...

On 13/01/17 14:41, Tom Fifield wrote:

As at 2017-01-13 0641Z, Ask.o.o is refusing connections (at least on
IPv4) - tried from a couple of different hosts

fifieldt@docwork2:~$ wget ask.openstack.org
--2017-01-13 06:40:49--  http://ask.openstack.org/
Resolving ask.openstack.org (ask.openstack.org)... 23.253.72.95,
2001:4800:7815:103:be76:4eff:fe05:89f3
Connecting to ask.openstack.org (ask.openstack.org)|23.253.72.95|:80...
failed: Connection refused.

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra



___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


[OpenStack-Infra] Ask.o.o down

2017-01-12 Thread Tom Fifield
As at 2017-01-13 0641Z, Ask.o.o is refusing connections (at least on 
IPv4) - tried from a couple of different hosts


fifieldt@docwork2:~$ wget ask.openstack.org
--2017-01-13 06:40:49--  http://ask.openstack.org/
Resolving ask.openstack.org (ask.openstack.org)... 23.253.72.95, 
2001:4800:7815:103:be76:4eff:fe05:89f3
Connecting to ask.openstack.org (ask.openstack.org)|23.253.72.95|:80... 
failed: Connection refused.


___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra