Re: problems with cbl.abuseat.org

2022-08-03 Thread Claudio Kuenzler
On Thu, Jun 30, 2022 at 9:55 AM Jeremy Ardley  wrote:

> I'm using postfix as my MTA and lately I've been missing a significant
> fraction from my usual mail
>
> e.g. email from linkedin and spamassassin list.
>
> Tracking it down I see they are all getting rejected by abuseat. e.g.
>
> Jun 30 14:20:09 egde postfix/25pass/smtpd[21040]: NOQUEUE: reject: RCPT
> from mail.openbsd.org[199.185.178.25]: 554 5.7.1 Service unavailable;
> Client host [199.185.178.25] blocked using cbl.abuseat.org; Error: open
> resolver; https://www.spamhaus.org/returnc/pub/172.68.1.20;
> from= to=<
> jer...@ardley.org> proto=ESMTP helo=
>

I just ran into this problem as well ->
https://twitter.com/ClaudioKuenzler/status/1554559303507492865
Starting yesterday (August 2nd 2022) afternoon, all incoming mails are
being rejected due to a "blocked using...". This concerns all DNSBL managed
by Spamhaus.
The reason is that Spamhaus stopped serving their DNSBL via public DNS
resolvers, such as Cloudflare's 1.1.1.1 or Google's 8.8.8.8 and 8.8.4.4
resolvers.
See:
https://www.spamhaus.com/resource-center/if-you-query-spamhaus-projects-dnsbls-via-cloudflares-dns-move-to-the-free-data-query-service/

The solution, according to Spamhaus support, is to register for a free
account and use their DQS (Data Query Service).
I just did that this morning and although I was assured this would be a
free of charge account, I now received a quote of USD 450 per year.
I guess that was it then with Spamhaus.

cheers,
ck


Re: wtf just happened to my local staging web server

2022-05-04 Thread Claudio Kuenzler
On Wed, May 4, 2022 at 7:18 PM Gary Dale  wrote:

> May 04 12:16:55 TheLibrarian systemd[1]: Starting The Apache HTTP
> Server...
> May 04 12:16:55 TheLibrarian apachectl[7935]: (98)Address already in use:
> AH00072: make_sock: could not bind to addre>
> May 04 12:16:55 TheLibrarian apachectl[7935]: (98)Address already in use:
> AH00072: make_sock: could not bind to addre>
> May 04 12:16:55 TheLibrarian apachectl[7935]: no listening sockets
> available, shutting down
> May 04 12:16:55 TheLibrarian apachectl[7935]: AH00015: Unable to open logs
> May 04 12:16:55 TheLibrarian apachectl[7932]: Action 'start' failed.
> May 04 12:16:55 TheLibrarian apachectl[7932]: The Apache error log may
> have more information.
> May 04 12:16:55 TheLibrarian systemd[1]: apache2.service: Control process
> exited, code=exited, status=1/FAILURE
> May 04 12:16:55 TheLibrarian systemd[1]: apache2.service: Failed with
> result 'exit-code'.
> May 04 12:16:55 TheLibrarian systemd[1]: Failed to start The Apache HTTP
> Server.
>
The errors show that Apache was unable to bind to the listener port
(Address already in use).

Check for other services (maybe Nginx?) which are listening on the same
port as Apache tries to bind to.
Run: netstat -lntup

Also check /etc/apache2/ports.conf for possible misconfigurations.

Are you using HTTP (Port 80) only or also HTTPS (Port 443)?

Just to rule a config error out, run "apache2ctl configtest".

As I said, I do get the default Apache2 page saying "It works" but that
> appears to be optimistic. ps aux | grep apache2 fails to show the service,
> which confirms the systemctl message that it isn't running.
>
That could be your browser cache tricking you. You can verify with "curl
localhost".


Re: Bullseye (mostly) not booting on Proliant DL380 G7

2021-09-16 Thread Claudio Kuenzler
On Wed, Jun 30, 2021 at 9:51 AM Paul Wise  wrote:

> Claudio Kuenzler wrote:
>
> > I currently suspect a Kernel bug in 5.10.
>

Thanks to everyone for hints and suggestions!
At the end it turned out to be an issue with the hpwdt module. After
blacklisting this module, no boot or stability issues with Bullseye were
detected anymore.
Findings documented in my blog:
https://www.claudiokuenzler.com/blog/1125/debian-11-bullseye-boot-freeze-kernel-panic-hp-proliant-dl380


Re: Bullseye (mostly) not booting on Proliant DL380 G7

2021-06-29 Thread Claudio Kuenzler
I tend to suspect it's unrelated, but if you add "nomodeset nofb" to
> your boot command line it will turn off the graphics drivers.
>

Yes, I guess it is indeed unrelated. With buster I can see the same
messages during boot:
- *ERROR* Failed to load firmware! on drm
- pcc_cpufreq_init: Too many CPUs, dynamic performance scaling disabled

But no crash happens, Buster boots correctly (every time) and I can use the
system.
I currently suspect a Kernel bug in 5.10. I might be dead wrong, but that's
my current guess.


Re: Bullseye (mostly) not booting on Proliant DL380 G7

2021-06-29 Thread Claudio Kuenzler
>
> Trace dump suggests that crash occurs while executing cpuidle module.
> Try to boot with "intel_pstate=force" kernel parameter [1] to force
> different CPU driver (if CPU supports it) and\or "cpuidle.off=1" to disable
> cpuidle subsystem.
>
>
Thank you Alexander and Georgi (thanks for the link!) for your answers.
highly appreciate it!

I have tried the additional kernel parameters intel_pstate=force and
cpuidle.off=1 but unfortunately this didn't solve the problem. The freeze
still happened at around 50% of the boots.

I now wiped Bullseye and installed Buster. The very same server was
rebooted at least 10 times without any hiccup/freeze/crash.
Seems there are indeed some major issues which are not solved yet. If they
come from Debian Installer, they may be related to bug #987441 (
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=987441).
Or if they are Kernel related (Buster uses 4.19, Bullseye 5.10) it might be
a completely different problem.


Re: Bullseye (mostly) not booting on Proliant DL380 G7

2021-06-29 Thread Claudio Kuenzler
Hi Georgi

I noticed that kernel logs you posted are between 62nd - 64th second
> after kernel loading. Why is the boot process so slow?
>

Due to a disabled SATA device in BIOS, the kernel tries to do an ERST and
SRST and does this until 60s after boot.
That's OK, it's been the same on Buster, too.


> If you think that video driver can be an issue then you can try to
> configure the system not to use framebuffer (if the system
>  doesn't use GUI).
>

Could you tell me how? Or a reference to it?

In the meantime I re-configured grub to boot with the following parameters:

debug rootwait earlyprintk=vga,keep earlycon pause_on_oops=5 panic=60
no_console_suspend


The last two boots now show a crash in the console - even with
firmware-amd-graphics and firmware-linux-nonfree installed.

[ 69.546005 ] asm_call_irq_on_stack+0x12/0x20
[ 69.546005 ] 
[ 69.546006 ] common_interrupt+0xb0/0x130
[ 69.546006 ] asm_common_interrupt+0x1e/0x40
[ 69.546006 ] RIP: 0010:cpuidle_enter_state+0xc4/0x350
[ 69.546007 ] Code: a2 ff 65 8b 3d 6d 39 f7 7b e8 78 2e a2 ff 49 89 c5 66
66 66
  66 90 31 ff e8 09 39 a2 ff 45 84 ff 0f 85 fa 00 00 00 fb 66 66 90 <66> 66
90 45
  85 f6 0f 88 06 01 00 00 49 63 c6 4c 2b 2c 24 48 8d 14
[ 69.546008 ] RSP: 0018:9dec062cfea8 EFLAGS: 0246
[ 69.546008 ] RAX: 89266fa2bc00 RBX: 0004 RCX:
001f

[ 69.546009 ] RDX:  RSI: 24eefefa RDI:


[ 69.546009 ] RBP: bdebff218e00 R08: 001022521b20 R09:
0018

[ 69.546010 ] R10: 04fa R11: 006cd R12:
851ae680

[ 69.546010 ] R13: 001022521b20 R14: 4 R15:


[ 69.546010 ] ? cpuidle_enter_state+0xb7/0x350
[ 69.546011 ] cpuidle_enter+0x29/0x40
[ 69.546011 ] do_idle+0x1ef/0x2b0
[ 69.546011 ] cpu_startup_entry+0x19/0x20
[ 69.546012 ] secondary_startup_64_no_verify+0xb0/0xbb
[ 69.546012 ] ---[ end trace 96fbf4be0200356d ]---

And on another crash almost the same but slightly different:

[ 69.331313 ] ? mwait_idle_with_hints.constprop.0+0x4b/0x90
[ 69.331313 ] ? mwait_idle_with_hints.constprop.0+0x4b/0x90
[ 69.331313 ] 
[ 69.331314 ] intel_idle+0x1f/0x30
[ 69.331314 ] cpuidle_enter_state+0x89/0x350
[ 69.331314 ] cpuidle_enter+0x29/0x40
[ 69.331315 ] do_idle+0x1ef/0x2b0
[ 69.331315 ] cpu_startup_entry+0x19/0x20
[ 69.331315 ] start_kernel+0x587/0x5a8
[ 69.331315 ] secondary_startup_64_no_verify+0xb0/0xbb
[ 69.511534 ] DMAR: [DMA Read] Request device [00:1e.0] PASID 
fault ad
r 3000 [fault reason 06] PTE Read access is not set
[ 69.511541 ] Kernel Offset: 0x28a0 from 0x8100 (relocation
ran
ge: 0x8000-0xbfff)

I have recorded the boot crash: https://youtu.be/TIfX-isjM3E (see between
second 47 and 48).


Re: Bullseye (mostly) not booting on Proliant DL380 G7

2021-06-29 Thread Claudio Kuenzler
Sorry for auto-responding all the time ;-)
I was just able to catch a "freeze" followed by a successful boot
afterwards.

The successful boot continues with these lines:

[   62.922169] systemd[1]: Finished Create System Users.
[   62.923633] systemd[1]: Starting Create Static Device Nodes in /dev...
[   62.941753] systemd[1]: Finished Create Static Device Nodes in /dev.
[   62.944691] systemd[1]: Starting Rule-based Manager for Device Events
and Files...
[   62.953082] systemd[1]: modprobe@drm.service: Succeeded.
[   62.953539] systemd[1]: Finished Load Kernel Module drm.
[   62.983630] systemd[1]: Started Rule-based Manager for Device Events and
Files.
[   62.991307] systemd[1]: Finished Set the console keyboard layout.
[   63.015898] input: Power Button as
/devices/LNXSYSTM:00/LNXPWRBN:00/input/input5
[   63.016490] systemd[1]: Finished Coldplug All udev Devices.
[   63.018250] systemd[1]: Starting Helper to synchronize boot up for
ifupdown...
[   63.020119] power_meter ACPI000D:00: Found ACPI power meter.
[   63.020214] power_meter ACPI000D:00: Ignoring unsafe software power cap!
[   63.020280] power_meter ACPI000D:00: hwmon_device_register() is
deprecated. Please convert the driver to use
hwmon_device_register_with_info().
[   63.029971] systemd[1]: Finished Monitoring of LVM2 mirrors, snapshots
etc. using dmeventd or progress polling.
[   63.030392] systemd[1]: Reached target Local File Systems (Pre).
[   63.031784] IPMI message handler: version 39.2
[   63.035060] ipmi device interface
[   63.036149] ACPI: Power Button [PWRF]
[   63.038539] EDAC MC1: Giving out device to module i7core_edac.c
controller i7 core #1: DEV :3e:03.0 (INTERRUPT)
[   63.038670] EDAC PCI0: Giving out device to module i7core_edac
controller EDAC PCI controller: DEV :3e:03.0 (POLLED)
[   63.039204] EDAC MC0: Giving out device to module i7core_edac.c
controller i7 core #0: DEV :3f:03.0 (INTERRUPT)
[   63.039315] EDAC PCI1: Giving out device to module i7core_edac
controller EDAC PCI controller: DEV :3f:03.0 (POLLED)
[   63.039405] EDAC i7core: Driver loaded, 2 memory controller(s) found.
[   63.044910] ipmi_si: IPMI System Interface driver
[   63.044996] ipmi_si dmi-ipmi-si.0: ipmi_platform: probing via SMBIOS
[   63.045059] ipmi_platform: ipmi_si: SMBIOS: io 0xca2 regsize 1 spacing 1
irq 0
[   63.045134] ipmi_si: Adding SMBIOS-specified kcs state machine
[   63.045263] ipmi_si IPI0001:00: ipmi_platform: probing via ACPI
[   63.045393] ipmi_si IPI0001:00: ipmi_platform: [io  0x0ca2-0x0ca3]
regsize 1 spacing 1 irq 0
[   63.045652] iTCO_vendor_support: vendor-support=0
[   63.046504] hpwdt :02:00.0: HPE Watchdog Timer Driver: NMI decoding
initialized

This line catches my attention:

[   62.953082] systemd[1]: modprobe@drm.service: Succeeded.

This is missing (doesn't show) when the freeze happens.

FYI in the meantime I also installed firmware-amd-graphics however the
behaviour (sometimes freeze, sometimes boot) is still the same.

I continue to troubleshoot but if anyone has experienced something similar
or has some hints or can point to existing bugs please let me know.

On Tue, Jun 29, 2021 at 10:04 AM Claudio Kuenzler 
wrote:

> Meanwhile I was able to identify more by removing "quiet" from the grub
> loader.
> The pcc_cpufreq_init does not seem to hurt the booting - these are just
> warnings popping up.
>
> The following messages appear on the console before the server freezes:
>
> [ OK ] Finished Load Kernel Module fuse.
> [ 62.887855] systemd[1]: Mounting FUSE Control File System...
>Mounting FUSE Controle File System...
> [ 62.891852] systemd[1]: Finished Apply Kernel Variables.
> [ OK ] Finished Apply Kernel Variables.
> [ 62.892237] systemd[1]: Mounted FUSE Control File System.
> [ OK ] Mounted FUSE Control File System.
> [ 62.900668] systemd[1]: Finished Create System Users.
> [ OK ] Finished Create System Users.
> [ 62.902224] systemd[1]: Starting Create Static Device Nodes in /dev...
>   Starting Create Static Device Nodes in /dev...
> [ 62.920767] systemd[1]: modprobe@drm.service: Succeeded.
> [ 62.921202] systemd[1]: Finished Load Kernel Module drm.
> [ OK ] Finished Load Kernel Module drm.
> [ 62.921979] systemd[1]: Finished Create Static Device Nodes in /dev.
> [ OK ] Finished Create Static Device Nodes in /dev.
> [ 62.925007] systemd[1]: Starting Rule-based Manager for Device Events and
> Files...
>Starting Rule-based Manager for Device Events and Files...
> [ 62.955322] systemd[1]: Finished Monitoring of LVM2 mirrors, snapshots
> etc. using dmeventd or progress polling.
> [ OK ] Finished Monitoring of LVM2 mirrors, snapshots etc. using dmeventd
> or progress polling.
> [ 62.962186] systemd[1]: Started Rule-based Manager for Device Events and
> Files.
>
> After this, no further messages, no login prompt, server does not react to
> keyboard input an

Re: Bullseye (mostly) not booting on Proliant DL380 G7

2021-06-29 Thread Claudio Kuenzler
 0x60 0x60 0x60 0x60
Jun 28 16:15:05 irczsrvp08 kernel: [   63.207000] [drm]   Encoders:
Jun 28 16:15:05 irczsrvp08 kernel: [   63.207001] [drm] CRT1:
INTERNAL_DAC1
Jun 28 16:15:05 irczsrvp08 kernel: [   63.207002] [drm] Connector 1:
Jun 28 16:15:05 irczsrvp08 kernel: [   63.207003] [drm]   VGA-2
Jun 28 16:15:05 irczsrvp08 kernel: [   63.207004] [drm]   DDC: 0x6c 0x6c
0x6c 0x6c 0x6c 0x6c 0x6c 0x6c
Jun 28 16:15:05 irczsrvp08 kernel: [   63.207004] [drm]   Encoders:
Jun 28 16:15:05 irczsrvp08 kernel: [   63.207005] [drm] CRT2:
INTERNAL_DAC2
Jun 28 16:15:05 irczsrvp08 kernel: [   63.236242] kvm:
VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL does not work properly. Using workaround
Jun 28 16:15:05 irczsrvp08 kernel: [   63.245005] EXT4-fs (dm-0): mounted
filesystem with ordered data mode. Opts: (null)
Jun 28 16:15:05 irczsrvp08 kernel: [   63.250269] [drm] fb mappable at
0xE804
Jun 28 16:15:05 irczsrvp08 kernel: [   63.250270] [drm] vram apper at
0xE800
Jun 28 16:15:05 irczsrvp08 kernel: [   63.250271] [drm] size 1572864
Jun 28 16:15:05 irczsrvp08 kernel: [   63.250271] [drm] fb depth is 16
Jun 28 16:15:05 irczsrvp08 kernel: [   63.250272] [drm]pitch is 2048

Maybe related to the known bullseye errata
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=989863 ?



On Mon, Jun 28, 2021 at 8:32 PM Claudio Kuenzler 
wrote:

> Hello!
>
> Currently testing the new Bullseye release (using
> firmware-bullseye-DI-rc2-amd64-netinst.iso) and see a strange phenomenon on
> a HP Proliant DL380 G7 server.
>
> During boot, the following messages show up in the console:
>
> [63.063844] pcc_cpufreq_init: Too many CPUs, dynamic performance scaling
> disabled
> [63.063895] pcc_cpufreq_init: Try to enable another scaling driver through
> BIOS settings
> [63.063943] pcc_cpufreq_init: and complain to the system vendor
>
> According to
> https://patchwork.kernel.org/project/linux-pm/patch/5423012.zznfdyd...@aspire.rjw.lan/
> this is a Kernel patch from July 2018.
> According to Andreas Herrmann, the settings can be defined in the HP
> server BIOS:
>
> Power Management -> Advanced Power Options -> Collaborative Power Control
> = enabled
>
> This is active (is the default I believe). The Power Regulator is set to
> "Dynamic Power Savings Mode".
>
> After these messages show up on the console, no login prompt appears. No
> network started. The server seems frozen - doesn't even react to
> CTRL+ALT+DEL on the console anymore. Not sure if this is caused by cpufreq
> or something else though.
>
> This boot problem happened on 2 out of 3 server boots.
>
> Is this a bug in Bullseye?
>
> thx for any hints.
>
>


Bullseye (mostly) not booting on Proliant DL380 G7

2021-06-28 Thread Claudio Kuenzler
Hello!

Currently testing the new Bullseye release (using
firmware-bullseye-DI-rc2-amd64-netinst.iso) and see a strange phenomenon on
a HP Proliant DL380 G7 server.

During boot, the following messages show up in the console:

[63.063844] pcc_cpufreq_init: Too many CPUs, dynamic performance scaling
disabled
[63.063895] pcc_cpufreq_init: Try to enable another scaling driver through
BIOS settings
[63.063943] pcc_cpufreq_init: and complain to the system vendor

According to
https://patchwork.kernel.org/project/linux-pm/patch/5423012.zznfdyd...@aspire.rjw.lan/
this is a Kernel patch from July 2018.
According to Andreas Herrmann, the settings can be defined in the HP server
BIOS:

Power Management -> Advanced Power Options -> Collaborative Power Control =
enabled

This is active (is the default I believe). The Power Regulator is set to
"Dynamic Power Savings Mode".

After these messages show up on the console, no login prompt appears. No
network started. The server seems frozen - doesn't even react to
CTRL+ALT+DEL on the console anymore. Not sure if this is caused by cpufreq
or something else though.

This boot problem happened on 2 out of 3 server boots.

Is this a bug in Bullseye?

thx for any hints.


Re: How to make dhclient reread its config? (Debian 10)

2020-08-25 Thread Claudio Kuenzler
On Wed, Aug 26, 2020 at 5:56 AM Victor Sudakov  wrote:

> Dear Colleagues,
>
> I've made some changes to /etc/dhcp/dhclient.conf, now I need to make
> dhclient reread it (and apply the changes to /etc/resolv.conf).
>
> There seems to be no dhclient service in systemd, and I don't find
> any info about signalling dhclient with "kill -HUP" etc.
>
> "dhclient -x" or "dhclient -r" bring the network down (I've actually
> confirmed that on a test host), and I loathe to risk it on a remote box.
> A reboot seems too radical.
>
> Any ideas please?
>

A simple "dhclient" as root (or "sudo dhclient") without parameters should
be enough.
In syslog you should be able to see something like this afterwards:

Aug 26 06:10:41 mailtest dhclient[899]: DHCPDISCOVER on eth0 to
255.255.255.255 port 67 interval 7
Aug 26 06:10:41 mailtest dhclient[899]: DHCPOFFER of 192.168.15.23 from
192.168.15.1
Aug 26 06:10:41 mailtest dhclient[899]: DHCPREQUEST for 192.168.15.23 on
eth0 to 255.255.255.255 port 67
Aug 26 06:10:41 mailtest dhclient[899]: DHCPACK of 192.168.15.23 from
192.168.15.1
Aug 26 06:10:41 mailtest dhclient[899]: bound to 192.168.15.23 -- renewal
in 73845 seconds.


Re: lost dig

2019-02-19 Thread Claudio Kuenzler
On Tue, Feb 19, 2019 at 12:55 PM tony  wrote:

>
> > Isn't the alias defined in '~/.bashrc' or '~/.bash_aliases'?
> >
> no...
>

Maybe it's not an alias at all but rather an "alternative". Check
"update-alternatives --get-selections" if there is an entry for dig.


Re: lost dig

2019-02-19 Thread Claudio Kuenzler
On 2/19/2019 12:10 PM, tony wrote:
> > In my fiddling with DNS, I installed (as su) a python package from pypi
> > called 'dig'. It turned out to not be what I expected, so I abandoned it.
> >
> > However, now when I enter 'dig' on the command line, it runs this python
> > thing. So I uninstalled dig from python, using 'pip3 uninstall dig'.
> > That seemed to work fine, but now when I type 'dig' at the terminal, I
> > get bash: /usr/local/bin/dig: No such file or directory. Well, that's OK
> > because dig - the proper one - is at /usr/bin/dig.
> >
> > 'which dig' gives me '/usr/bin/dig/
> >
> > So, how do I now get the alias (if that's what it is) to point at the
> > right file?
>

First check with "alias" if there is really still some alias defined which
points to /usr/local/bin/dig.
You might also have to logout and login again to clear your environment.


Status of LXC in Stretch?

2019-02-18 Thread Claudio Kuenzler
Dear all, LXC maintainers,

It seems that there hasn't been much going on concerning the LXC package(s)
in Debian 9 Stretch. The version is stuck at 2.0.7 without any patches
backported since Jan 2018. Yet there are known (important) bugs which break
LXC on Stretch.
For example when using cgroup resource limits, bug
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=888647 occurs, which at
the end is a bug in the used libpam-cgfs package.

Even in backports there is the following note in the changelog of the lxcfs
package:

lxcfs (2.0.8-1~bpo9+1) stretch-backports; urgency=medium

  * Team upload
  * Rebuild for stretch-backports.
  * This backport release is an alternative to 2.0.7-1 that has a couple of
issues, and shouldn't have reached stable.
See https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=867619 for more
intel.

 -- Pierre-Elliott Bécue   Sat, 17 Nov 2018 09:01:07 +0100

In bug #888647 as well as in a discussion on linuxcontainers.org (
https://discuss.linuxcontainers.org/t/failed-creating-cgroups/272/10) a
possible solution is to remove the Debian package of libpam-cgfs and
instead install the Ubuntu package. Really?!
Although this workaround seems to work for some, it doesn't work for others
including the author of the last comment in bug #888647.

Meanwhile LXC 2.0.9 is out since October 2017 (yes, 2017). Instead of
keeping a bugged 2.0.7, wouldn't it be better to include the latest
upstream version of the 2.0 LTS branch?

Note: LXC itself works fine for privileged containers _without_ resource
lmits. But as soon as resource limits are used, this bug comes in place and
breaks LXC.

What's the current status with LXC and its related packages in Debian
Stretch? Can we expect a new upstream release, a bugfix or a new version
(3.0 LTS) made available in backports?

Thanks in advance for letting us know.


Re: OT: Current_Pending_Sector on /dev/sd?

2019-02-14 Thread Claudio Kuenzler
> ./check_smart.pl -g /dev/sd[a-z] -i ata
> OK: [/dev/sda] - Device is clean|
>

> Is it ok that is only return one drive?
>

you have to use double-quotes because it's a regular expression within the
perl plugin:

./check_smart -g "/dev/sd[a-z]" -i ata
OK: [/dev/sda] - Device is clean --- [/dev/sdb] - Device is clean ---
[/dev/sdc] - Device is clean --- [/dev/sdd] - Device is clean|

But this is the variant for "lazy admins". ;-)
I suggest you use it on a single drive to obtain more information
(performance data).
Example:

./check_smart -d "/dev/sda" -i ata
OK: no SMART errors detected. |Raw_Read_Error_Rate=0 Spin_Up_Time=4000
Start_Stop_Count=30 Reallocated_Sector_Ct=0 Seek_Error_Rate=0
Power_On_Hours=18183 Spin_Retry_Count=0 Calibration_Retry_Count=0
Power_Cycle_Count=30 Power-Off_Retract_Count=20 Load_Cycle_Count=249397
Temperature_Celsius=32 Reallocated_Event_Count=0 Current_Pending_Sector=0
Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=0

You can now parse these values and save it wherever you need to create a
history of the drive's SMART values.


Re: OT: Current_Pending_Sector on /dev/sd?

2019-02-13 Thread Claudio Kuenzler
On Wed, Feb 13, 2019 at 2:22 PM basti  wrote:

> hello,
> I have a raid6 with 4 disks. 2 of them show Current_Pending_Sector 1.
>

Hi Basti

are you using mdadm for the raid-6 or a hardware raid controller?


> The disks has warranty till Apr. 2019 so I decide to replace them.
>

If there's only 1 current pending sector it could be difficult to get a
full replacement. HDD's have spare sectors which are used in such events
and are (*should*) be capable to handle a few defect sectors. It doesn't
mean (yet) that the drive is defect.


>
> After I change the disk and install it on an other computer to overwrite
> with zero it the Current_Pending_Sector is gone.
>

Yes, I've seen this too a couple of months ago on a remote NAS server. I
probably had the same reaction as you: I couldn't believe it. Especially as
the Current_Pending_Sector went to 0 and Reallocated_Sectors and
Offline_Uncorrectable staid at the same value as before, too.


>
> What should I do? Whats our experience?
>

Continuously monitor your drive's SMART values and (if possible) store the
results in a database (RRD, Timeseries DB, you name it) to create graphs
from the values. You can use the check_smart.pl monitoring plugin as an
examle. This will show you if the number of defect sectors increase or if
they stay steady. If the bad sectors increase, it's just a matter of time
until the drive physically fails. You can see an example of such a graph
(rrd in this case) with increasing bad sectors over 5 weeks here:
https://www.claudiokuenzler.com/blog/469/multiple-several-ways-monitor-physical-hard-drive-disk

As helpful as SMART is, never rely 100% on it, as drives may also fail
without any bad values in SMART.


Re: Bug with soft raid?

2019-02-13 Thread Claudio Kuenzler
Hello Steve,

As some of the other responders already said, check your drives' SMART
values.
But a disk may fail without any indication in the SMART table. I've seen
this a couple of years ago and documented it here:
https://www.claudiokuenzler.com/blog/301/disk-failure-not-detected-by-smart-ata1-failed-command
The errors you've seen are also kind of similar as the ones I saw (although
there are a couple of years in between, so Kernel messages might be a bit
different now).

Long story short: It was indeed a defect hard drive causing the problems
(and log entries).

On Wed, Feb 13, 2019 at 7:20 PM David Christensen 
wrote:

> On 2/12/19 12:48 PM, David Christensen wrote:
> > I had a Linux md RAID0 (mirror) ...
>
> Correction -- RAID1 is mirror.
>
>
> David
>
>


Re: hp server hardware monitoring

2014-07-30 Thread Claudio Kuenzler
On Wed, Jul 30, 2014 at 9:50 PM, Bonno Bloksma  wrote:

> Hi,
> >> [...]
> >> What may be relevant too is that on the g6 server Debian uses the
> >> CCISS drivers for the raid hardware, the volume shows up as
> >> /dev/cciss/c0d0
> >> On the g7 and g8 hardware the raid volume simply shows up as /dev/sda
> >
> > cciss has been superseded by hpsa, "The hpsa driver is intended to
> supplant the cciss driver for newer Smart Array controllers.", cf.
> > .
>
> Ok, thanks for the heads up.
>
> >> How can I get this to work under a g7 or g8 server. Do I need a newer
> >> version of the package of do I need a different package?
> >
> > I generally avoid installing third-party tools on the bulk of my
> servers, and for simple monitoring the packages cciss-vol-status and
> nagios-plugins-standard from Debian will suffice:
>
> I will try that too.  But
>
> >root@vz02:~# lsmod | grep -e cciss -e hpsa
> >hpsa   50787  2
>
> Ok, I have that too
> linein:~# lsmod | grep -e hpsa -e cciss
> hpsa   40765  2
> scsi_mod  162269  5 hpsa,libata,sd_mod,sg,sr_mod
>
> > root@vz02:~# cciss_vol_status /dev/sda
> > /dev/sda: (Smart Array P420i) RAID 5 Volume 0 status: OK.
>
> But I get...
> linein:~# cciss_vol_status /dev/sda
> cciss_vol_status: /dev/sda: Unknown SCSI device.
>
> Which is weird because I have (copy from iLO):
>   Model: HP Smart Array P420i Controller
>   Firmware: Version 3.22
>
> Now what?
>
> Bonno Bloksma
>
>
Did you give the plugin check_ilo2_health.pl a shot?
The plugin uses ILO to get the status of the hardware. It works fine for
servers running with at least ILO2.
Everything you see in ILO is exported and the plugin checks the state. On
some older server generations, some hardware parts were missing in ILO
(e.g. disks) but in recent servers/ILO versions (G7 and Gen8) the disks are
also being checked in ILO.

See
https://www.monitoringexchange.org/inventory/Check-Plugins/Hardware/Server/HP-%2528Compaq%2529/check_ilo2_health

I'm monitoring the hardware of 168 HP servers with this plugin.