retrace / faf issues

2017-06-26 Thread Kevin Fenzi
Greetings.

I've seen some various retrace/faf issues of late, so I thought I would
collect them into an email and see if you all could take a look and
solve them. :)

- retrace02.qa.fedoraproject.org has a 100% full disk.

- retrace01.qa.fedoraproject.org is almost constantly alerting on swap
being full. Not sure what to do about this, but perhaps we could add
more swap or somehow limit it to use only memory for normal jobs?

- faf01.stg has a aily cron that outputs:

/etc/cron.daily/logrotate:

error: skipping "/var/log/faf/create-problems.log" because parent
directory has insecure permissions (It's world writable or writable by
group which is not "root") Set "su" directive in config file to tell
logrotate which user/group should be used for rotation.
error: skipping "/var/log/faf/faf-celery-beat.log" because parent
directory has insecure permissions (It's world writable or writable by
group which is not "root") Set "su" directive in config file to tell
logrotate which user/group should be used for rotation.
error: skipping "/var/log/faf/faf-celery-worker.log" because parent
directory has insecure permissions (It's world writable or writable by
group which is not "root") Set "su" directive in config file to tell
logrotate which user/group should be used for rotation.
error: skipping "/var/log/faf/reposync.log" because parent directory has
insecure permissions (It's world writable or writable by group which is
not "root") Set "su" directive in config file to tell logrotate which
user/group should be used for rotation.
error: skipping "/var/log/faf/save-reports.log" because parent directory
has insecure permissions (It's world writable or writable by group which
is not "root") Set "su" directive in config file to tell logrotate which
user/group should be used for rotation.

- retrace01.qa.fedoraproject.org has a daily cron that outputs:

/etc/cron.daily/logrotate:

error: skipping "/var/log/faf/create-problems-core.log" because parent
directory has insecure permissions (It's world writable or writable by
group which is not "root") Set "su" directive in config file to tell
logrotate which user/group should be used for rotation.
error: skipping "/var/log/faf/create-problems.log" because parent
directory has insecure permissions (It's world writable or writable by
group which is not "root") Set "su" directive in config file to tell
logrotate which user/group should be used for rotation.
error: skipping "/var/log/faf/create-problems-oops.log" because parent
directory has insecure permissions (It's world writable or writable by
group which is not "root") Set "su" directive in config file to tell
logrotate which user/group should be used for rotation.
error: skipping "/var/log/faf/create-problems-python.log" because parent
directory has insecure permissions (It's world writable or writable by
group which is not "root") Set "su" directive in config file to tell
logrotate which user/group should be used for rotation.
error: skipping "/var/log/faf/create-problems-ruby.log" because parent
directory has insecure permissions (It's world writable or writable by
group which is not "root") Set "su" directive in config file to tell
logrotate which user/group should be used for rotation.
error: skipping "/var/log/faf/db_backup.log" because parent directory
has insecure permissions (It's world writable or writable by group which
is not "root") Set "su" directive in config file to tell logrotate which
user/group should be used for rotation.
error: skipping "/var/log/faf/export.log" because parent directory has
insecure permissions (It's world writable or writable by group which is
not "root") Set "su" directive in config file to tell logrotate which
user/group should be used for rotation.
error: skipping "/var/log/faf/faf-celery-beat.log" because parent
directory has insecure permissions (It's world writable or writable by
group which is not "root") Set "su" directive in config file to tell
logrotate which user/group should be used for rotation.
error: skipping "/var/log/faf/faf-celery-worker.log" because parent
directory has insecure permissions (It's world writable or writable by
group which is not "root") Set "su" directive in config file to tell
logrotate which user/group should be used for rotation.
error: skipping "/var/log/faf/find-components-centos.log" because parent
directory has insecure permissions (It's world writable or writable by
group which is not "root") Set "su" directive in config fi

Re: retrace / faf issues

2017-06-27 Thread Miroslav Suchý
Dne 26.6.2017 v 18:50 Kevin Fenzi napsal(a):
> Greetings.
> 
> I've seen some various retrace/faf issues of late, so I thought I would
> collect them into an email and see if you all could take a look and
> solve them. :)

Thank you for bringing it up.


> - retrace02.qa.fedoraproject.org has a 100% full disk.

retrace02 is used just for staging/development. So not big issue. But I am 
working on it right now. Should be resolved
by EOB.  Resolved now. :)

> - retrace01.qa.fedoraproject.org is almost constantly alerting on swap
> being full. Not sure what to do about this, but perhaps we could add
> more swap or somehow limit it to use only memory for normal jobs?

Few months ago I set postgresql to use more agressive caching. So that is main 
culprint for consuming so much memory.
I can easily lower it by few percent. But... I see right now that there is 16GB 
swap and 8 GB is free. And total
available memory is 16 GB. Because 8GB free swap and 8GB are kernel 
buffers/cache. So when you see those errors and what
are the exact numbers in those alerts?


I will investigate remaining issues tomorrow.

-- 
Miroslav Suchy, RHCA
Red Hat, Senior Software Engineer, #brno, #devexp, #fedora-buildsys
___
infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org


Re: retrace / faf issues

2017-06-27 Thread Stephen John Smoogen
On 27 June 2017 at 09:47, Miroslav Suchý  wrote:
> Dne 26.6.2017 v 18:50 Kevin Fenzi napsal(a):
>> Greetings.
>>
>> I've seen some various retrace/faf issues of late, so I thought I would
>> collect them into an email and see if you all could take a look and
>> solve them. :)
>
> Thank you for bringing it up.
>
>
>> - retrace02.qa.fedoraproject.org has a 100% full disk.
>
> retrace02 is used just for staging/development. So not big issue. But I am 
> working on it right now. Should be resolved
> by EOB.  Resolved now. :)
>

Should we rename the system to be retrace01.stg.qa.fedoraproject.org?
That way we can put problems on it as a lower priority from our point?

Second, who should we put on monitoring it and the other servers? I am
updating the nagios so it can have more people aware of different
classes of users.

>> - retrace01.qa.fedoraproject.org is almost constantly alerting on swap
>> being full. Not sure what to do about this, but perhaps we could add
>> more swap or somehow limit it to use only memory for normal jobs?
>
> Few months ago I set postgresql to use more agressive caching. So that is 
> main culprint for consuming so much memory.
> I can easily lower it by few percent. But... I see right now that there is 
> 16GB swap and 8 GB is free. And total
> available memory is 16 GB. Because 8GB free swap and 8GB are kernel 
> buffers/cache. So when you see those errors and what
> are the exact numbers in those alerts?
>
>
> I will investigate remaining issues tomorrow.
>
> --
> Miroslav Suchy, RHCA
> Red Hat, Senior Software Engineer, #brno, #devexp, #fedora-buildsys
> ___
> infrastructure mailing list -- infrastructure@lists.fedoraproject.org
> To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org



-- 
Stephen J Smoogen.
___
infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org


Re: retrace / faf issues

2017-06-27 Thread Kevin Fenzi
On 06/27/2017 07:47 AM, Miroslav Suchý wrote:
> Dne 26.6.2017 v 18:50 Kevin Fenzi napsal(a):
>> Greetings.
>>
>> I've seen some various retrace/faf issues of late, so I thought I would
>> collect them into an email and see if you all could take a look and
>> solve them. :)
> 
> Thank you for bringing it up.
> 
> 
>> - retrace02.qa.fedoraproject.org has a 100% full disk.
> 
> retrace02 is used just for staging/development. So not big issue. But I am 
> working on it right now. Should be resolved
> by EOB.  Resolved now. :)

Thanks!

That does bring up one more issue: You are using firewalld there and
aren't allowing our nagios/nrpe. I added a rule to allow port 5666/tcp.
You might also add this upstream/ansible.

> 
>> - retrace01.qa.fedoraproject.org is almost constantly alerting on swap
>> being full. Not sure what to do about this, but perhaps we could add
>> more swap or somehow limit it to use only memory for normal jobs?
> 
> Few months ago I set postgresql to use more agressive caching. So that is 
> main culprint for consuming so much memory.
> I can easily lower it by few percent. But... I see right now that there is 
> 16GB swap and 8 GB is free. And total
> available memory is 16 GB. Because 8GB free swap and 8GB are kernel 
> buffers/cache. So when you see those errors and what
> are the exact numbers in those alerts?

retrace01.qa.fedoraproject.org

Looks like it alerted just a few min ago:

Swap

Notifications for this service have been disabled
CRITICAL06-27-2017 14:15:24 0d 0h 11m 8s3/3 SWAP 
CRITICAL - 7%
free (1011 MB out of 16383 MB)

Swap-Is-Low

Notifications for this service have been disabled
CRITICAL06-27-2017 14:15:03 0d 0h 11m 29s   4/4 SWAP 
CRITICAL - 7%
free (1002 MB out of 16383 MB)


> 
> I will investigate remaining issues tomorrow.
> 

Great, thanks.

kevin



signature.asc
Description: OpenPGP digital signature
___
infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org


Re: retrace / faf issues

2017-07-03 Thread Miroslav Suchý
Dne 27.6.2017 v 16:17 Kevin Fenzi napsal(a):
>>> - retrace01.qa.fedoraproject.org is almost constantly alerting on swap
>>> being full. Not sure what to do about this, but perhaps we could add
>>> more swap or somehow limit it to use only memory for normal jobs?
>> Few months ago I set postgresql to use more agressive caching. So that is 
>> main culprint for consuming so much memory.
>> I can easily lower it by few percent. But... I see right now that there is 
>> 16GB swap and 8 GB is free. And total
>> available memory is 16 GB. Because 8GB free swap and 8GB are kernel 
>> buffers/cache. So when you see those errors and what
>> are the exact numbers in those alerts?
> retrace01.qa.fedoraproject.org
> 
> Looks like it alerted just a few min ago:
>   
> Swap
>   
> Notifications for this service have been disabled
>   CRITICAL06-27-2017 14:15:24 0d 0h 11m 8s3/3 SWAP 
> CRITICAL - 7%
> free (1011 MB out of 16383 MB)
>   
> Swap-Is-Low
>   
> Notifications for this service have been disabled
>   CRITICAL06-27-2017 14:15:03 0d 0h 11m 29s   4/4 SWAP 
> CRITICAL - 7%
> free (1002 MB out of 16383 MB)
> 
> 

OK. I lowered the DB cache settings. Please let me know if this happen again.

-- 
Miroslav Suchy, RHCA
Red Hat, Senior Software Engineer, #brno, #devexp, #fedora-buildsys
___
infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org


Re: retrace / faf issues

2017-07-04 Thread Kevin Fenzi
On 07/03/2017 05:44 AM, Miroslav Suchý wrote:
> Dne 27.6.2017 v 16:17 Kevin Fenzi napsal(a):
 - retrace01.qa.fedoraproject.org is almost constantly alerting on swap
 being full. Not sure what to do about this, but perhaps we could add
 more swap or somehow limit it to use only memory for normal jobs?
>>> Few months ago I set postgresql to use more agressive caching. So that is 
>>> main culprint for consuming so much memory.
>>> I can easily lower it by few percent. But... I see right now that there is 
>>> 16GB swap and 8 GB is free. And total
>>> available memory is 16 GB. Because 8GB free swap and 8GB are kernel 
>>> buffers/cache. So when you see those errors and what
>>> are the exact numbers in those alerts?
>> retrace01.qa.fedoraproject.org
>>
>> Looks like it alerted just a few min ago:
>>  
>> Swap
>>  
>> Notifications for this service have been disabled
>>  CRITICAL06-27-2017 14:15:24 0d 0h 11m 8s3/3 SWAP 
>> CRITICAL - 7%
>> free (1011 MB out of 16383 MB)
>>  
>> Swap-Is-Low
>>  
>> Notifications for this service have been disabled
>>  CRITICAL06-27-2017 14:15:03 0d 0h 11m 29s   4/4 SWAP 
>> CRITICAL - 7%
>> free (1002 MB out of 16383 MB)
>>
>>
> 
> OK. I lowered the DB cache settings. Please let me know if this happen again.

ok. It looks like it's good so far. ;)

The only thing left on our nagios report is retrace02 being 100% full on
disk. ;)

kevin





signature.asc
Description: OpenPGP digital signature
___
infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org


Re: retrace / faf issues

2017-07-05 Thread Brandon Gray
Below is a patch to add firewalld to the base_pkg_erase var (used by base
role).  Like the Fedora var, this will remove firewalld from RHEL systems
and should fix the issue below.

>From dc7c5dc38efab1873c43b6a5d85978d44843bc72 Mon Sep 17 00:00:00 2001
From: Brandon Gray 
Date: Wed, 5 Jul 2017 08:12:54 -0500
Subject: [PATCH] added firewalld to base package removal for rhel

---
 vars/RedHat.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/vars/RedHat.yml b/vars/RedHat.yml
index bd4c73c..3aff512 100644
--- a/vars/RedHat.yml
+++ b/vars/RedHat.yml
@@ -1,7 +1,7 @@
 ---
 dist_tag: el{{ ansible_distribution_version[0] }}
 base_pkgs_inst: []
-base_pkgs_erase: ['firstboot-tui','bluez-utils', 'sendmail']
+base_pkgs_erase: ['firstboot-tui','bluez-utils', 'sendmail', 'firewalld']
 service_disabled: []
 service_enabled: []
 is_rhel: True
-- 
2.9.4



> That does bring up one more issue: You are using firewalld there and
> aren't allowing our nagios/nrpe. I added a rule to allow port 5666/tcp.
> You might also add this upstream/ansible.
>
>
___
infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org


Re: retrace / faf issues

2017-07-05 Thread Stephen John Smoogen
Looks good. I would +1 this

On 5 July 2017 at 09:22, Brandon Gray  wrote:
> Below is a patch to add firewalld to the base_pkg_erase var (used by base
> role).  Like the Fedora var, this will remove firewalld from RHEL systems
> and should fix the issue below.
>
> From dc7c5dc38efab1873c43b6a5d85978d44843bc72 Mon Sep 17 00:00:00 2001
> From: Brandon Gray 
> Date: Wed, 5 Jul 2017 08:12:54 -0500
> Subject: [PATCH] added firewalld to base package removal for rhel
>
> ---
>  vars/RedHat.yml | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/vars/RedHat.yml b/vars/RedHat.yml
> index bd4c73c..3aff512 100644
> --- a/vars/RedHat.yml
> +++ b/vars/RedHat.yml
> @@ -1,7 +1,7 @@
>  ---
>  dist_tag: el{{ ansible_distribution_version[0] }}
>  base_pkgs_inst: []
> -base_pkgs_erase: ['firstboot-tui','bluez-utils', 'sendmail']
> +base_pkgs_erase: ['firstboot-tui','bluez-utils', 'sendmail', 'firewalld']
>  service_disabled: []
>  service_enabled: []
>  is_rhel: True
> --
> 2.9.4
>
>
>>
>> That does bring up one more issue: You are using firewalld there and
>> aren't allowing our nagios/nrpe. I added a rule to allow port 5666/tcp.
>> You might also add this upstream/ansible.
>>
>
> ___
> infrastructure mailing list -- infrastructure@lists.fedoraproject.org
> To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org
>



-- 
Stephen J Smoogen.
___
infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org


Re: retrace / faf issues

2017-07-19 Thread Brandon Gray
Now that we're out of freeze, is this something that should be committed?

On Wed, Jul 5, 2017 at 10:12 AM, Stephen John Smoogen 
wrote:

> Looks good. I would +1 this
>
> On 5 July 2017 at 09:22, Brandon Gray  wrote:
> > Below is a patch to add firewalld to the base_pkg_erase var (used by base
> > role).  Like the Fedora var, this will remove firewalld from RHEL systems
> > and should fix the issue below.
> >
> > From dc7c5dc38efab1873c43b6a5d85978d44843bc72 Mon Sep 17 00:00:00 2001
> > From: Brandon Gray 
> > Date: Wed, 5 Jul 2017 08:12:54 -0500
> > Subject: [PATCH] added firewalld to base package removal for rhel
> >
> > ---
> >  vars/RedHat.yml | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/vars/RedHat.yml b/vars/RedHat.yml
> > index bd4c73c..3aff512 100644
> > --- a/vars/RedHat.yml
> > +++ b/vars/RedHat.yml
> > @@ -1,7 +1,7 @@
> >  ---
> >  dist_tag: el{{ ansible_distribution_version[0] }}
> >  base_pkgs_inst: []
> > -base_pkgs_erase: ['firstboot-tui','bluez-utils', 'sendmail']
> > +base_pkgs_erase: ['firstboot-tui','bluez-utils', 'sendmail',
> 'firewalld']
> >  service_disabled: []
> >  service_enabled: []
> >  is_rhel: True
> > --
> > 2.9.4
> >
> >
> >>
> >> That does bring up one more issue: You are using firewalld there and
> >> aren't allowing our nagios/nrpe. I added a rule to allow port 5666/tcp.
> >> You might also add this upstream/ansible.
> >>
> >
> > ___
> > infrastructure mailing list -- infrastructure@lists.fedoraproject.org
> > To unsubscribe send an email to infrastructure-leave@lists.
> fedoraproject.org
> >
>
>
>
> --
> Stephen J Smoogen.
> ___
> infrastructure mailing list -- infrastructure@lists.fedoraproject.org
> To unsubscribe send an email to infrastructure-leave@lists.
> fedoraproject.org
>
___
infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org


Re: retrace / faf issues

2017-07-19 Thread Stephen John Smoogen
Thanks for the reminder. Done [master 1307eb0] put in
graybran...@gmail.com patch to remove firewalld from base install

On 19 July 2017 at 09:02, Brandon Gray  wrote:
> Now that we're out of freeze, is this something that should be committed?
>
> On Wed, Jul 5, 2017 at 10:12 AM, Stephen John Smoogen 
> wrote:
>>
>> Looks good. I would +1 this
>>
>> On 5 July 2017 at 09:22, Brandon Gray  wrote:
>> > Below is a patch to add firewalld to the base_pkg_erase var (used by
>> > base
>> > role).  Like the Fedora var, this will remove firewalld from RHEL
>> > systems
>> > and should fix the issue below.
>> >
>> > From dc7c5dc38efab1873c43b6a5d85978d44843bc72 Mon Sep 17 00:00:00 2001
>> > From: Brandon Gray 
>> > Date: Wed, 5 Jul 2017 08:12:54 -0500
>> > Subject: [PATCH] added firewalld to base package removal for rhel
>> >
>> > ---
>> >  vars/RedHat.yml | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/vars/RedHat.yml b/vars/RedHat.yml
>> > index bd4c73c..3aff512 100644
>> > --- a/vars/RedHat.yml
>> > +++ b/vars/RedHat.yml
>> > @@ -1,7 +1,7 @@
>> >  ---
>> >  dist_tag: el{{ ansible_distribution_version[0] }}
>> >  base_pkgs_inst: []
>> > -base_pkgs_erase: ['firstboot-tui','bluez-utils', 'sendmail']
>> > +base_pkgs_erase: ['firstboot-tui','bluez-utils', 'sendmail',
>> > 'firewalld']
>> >  service_disabled: []
>> >  service_enabled: []
>> >  is_rhel: True
>> > --
>> > 2.9.4
>> >
>> >
>> >>
>> >> That does bring up one more issue: You are using firewalld there and
>> >> aren't allowing our nagios/nrpe. I added a rule to allow port 5666/tcp.
>> >> You might also add this upstream/ansible.
>> >>
>> >
>> > ___
>> > infrastructure mailing list -- infrastructure@lists.fedoraproject.org
>> > To unsubscribe send an email to
>> > infrastructure-le...@lists.fedoraproject.org
>> >
>>
>>
>>
>> --
>> Stephen J Smoogen.
>> ___
>> infrastructure mailing list -- infrastructure@lists.fedoraproject.org
>> To unsubscribe send an email to
>> infrastructure-le...@lists.fedoraproject.org
>
>
>
> ___
> infrastructure mailing list -- infrastructure@lists.fedoraproject.org
> To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org
>



-- 
Stephen J Smoogen.
___
infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org