Re: Kernel crash every day at 6:30am
On 2/15/22 2:10 PM, Alex wrote: Hi, Here's a bit of the kernel message from dmesg [ cut here ] WARNING: CPU: 4 PID: 633983 at kernel/exit.c:739 do_exit+0x37/0xa90 general protection fault, probably for non-canonical address 0xcc2a8cfcb62a56a1: [#1] SMP PTI CPU: 4 PID: 633983 Comm: rsync Not tainted 5.14.18-200.fc34.x86_64 #1 Hardware name: To be filled by O.E.M. To be filled by O.E.M./P8B-M Series, BIOS 6801 05/07/2018 RIP: 0010:__bio_crypt_clone+0x28/0x60 This morning I noticed it crashed again, but with a different kernel message. I also discovered there was a 224GB log file being backed up over the internet from a mail server with a misconfigured /etc/rsyslog.conf that rsync was copying when the kernel crashed. I've since removed the huge log file and disabled the log entry in rsyslog.conf, so I'll now continue to watch it, but it's still a legit kernel crash. aops:ext4_da_aops ino:bca118c dentry name:"rsyslog.log" flags: 0x17c0060010(lru|mappedtodisk|reclaim|node=0|zone=2|lastcpupid=0x1f) raw: 0017c0060010 e613d17e0208 e613d17e00c8 8e452d1070d0 raw: 02fd3d40 0001 8e45031fd000 page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio)) [ cut here ] kernel BUG at mm/filemap.c:1516! invalid opcode: [#1] PREEMPT SMP PTI CPU: 3 PID: 922 Comm: md2_raid5 Not tainted 5.16.8-100.fc34.x86_64 #1 Is the hardware ok? A rsync job with encryption and raid could stress the system (thermally too) and trigger instability. Try to test your hardware in other ways (repeated gzip and md5 checks, or memtest86+) to be sure. Regards. -- Roberto Ragusamail at robertoragusa.it ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: Kernel crash every day at 6:30am
Hi, > > >> Here's a bit of the kernel message from dmesg > > >> [ cut here ] > > >> WARNING: CPU: 4 PID: 633983 at kernel/exit.c:739 do_exit+0x37/0xa90 > > >> general protection fault, probably for non-canonical address > > >> 0xcc2a8cfcb62a56a1: [#1] SMP PTI > > >> CPU: 4 PID: 633983 Comm: rsync Not tainted 5.14.18-200.fc34.x86_64 #1 > > >> Hardware name: To be filled by O.E.M. To be filled by O.E.M./P8B-M > > >> Series, BIOS 6801 05/07/2018 > > >> RIP: 0010:__bio_crypt_clone+0x28/0x60 This morning I noticed it crashed again, but with a different kernel message. I also discovered there was a 224GB log file being backed up over the internet from a mail server with a misconfigured /etc/rsyslog.conf that rsync was copying when the kernel crashed. I've since removed the huge log file and disabled the log entry in rsyslog.conf, so I'll now continue to watch it, but it's still a legit kernel crash. aops:ext4_da_aops ino:bca118c dentry name:"rsyslog.log" flags: 0x17c0060010(lru|mappedtodisk|reclaim|node=0|zone=2|lastcpupid=0x1f) raw: 0017c0060010 e613d17e0208 e613d17e00c8 8e452d1070d0 raw: 02fd3d40 0001 8e45031fd000 page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio)) [ cut here ] kernel BUG at mm/filemap.c:1516! invalid opcode: [#1] PREEMPT SMP PTI CPU: 3 PID: 922 Comm: md2_raid5 Not tainted 5.16.8-100.fc34.x86_64 #1 Do I submit this to the fedora bugzilla or the main kernel.org bugzilla? ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: Kernel crash every day at 6:30am
Hi, > >> Here's a bit of the kernel message from dmesg > >> [ cut here ] > >> WARNING: CPU: 4 PID: 633983 at kernel/exit.c:739 do_exit+0x37/0xa90 > >> general protection fault, probably for non-canonical address > >> 0xcc2a8cfcb62a56a1: [#1] SMP PTI > >> CPU: 4 PID: 633983 Comm: rsync Not tainted 5.14.18-200.fc34.x86_64 #1 > >> Hardware name: To be filled by O.E.M. To be filled by O.E.M./P8B-M > >> Series, BIOS 6801 05/07/2018 > >> RIP: 0010:__bio_crypt_clone+0x28/0x60 > > > > > > bio_crypt_clone suggests something wrong in an encrypted block device. > > Maybe corrupt data that rsync traverses during the backup? > > > > What does the output of "lsblk" look like for your system? What about > > "lvs"? > > This is worth tossing into bugzilla, for "kernel", notwithstanding abrt's > reluctance in dealing with it. It's now gone two days without it having happened again, so I've rebooted and will watch it over the coming days. Here's also the output from lsblk and an fsck run. I've also added a -v to the rsync backup script. # fsck /dev/md2 -r -C fsck from util-linux 2.36.2 e2fsck 1.45.6 (20-Mar-2020) /dev/md2: clean, 26678858/274702336 files, 1534551536/2197600512 blocks /dev/md2: status 0, rss 11532, real 2.183027, user 1.455968, sys 0.009868 # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:00 2.7T 0 disk └─sda1 8:10 2.7T 0 part └─md2 9:20 8.2T 0 raid5 /var/backup sdb 8:16 0 55.9G 0 disk ├─sdb1 8:17 0 51.8G 0 part │ └─md127 9:127 0 51.8G 0 raid1 / ├─sdb2 8:18 0 501M 0 part │ └─md126 9:126 0 500.7M 0 raid1 /boot ├─sdb3 8:19 096M 0 part │ └─md125 9:125 0 95.9M 0 raid1 /boot/efi └─sdb4 8:20 0 3.5G 0 part [SWAP] sdc 8:32 0 2.7T 0 disk └─sdc1 8:33 0 2.7T 0 part └─md2 9:20 8.2T 0 raid5 /var/backup sdd 8:48 0 55.9G 0 disk ├─sdd1 8:49 0 51.8G 0 part │ └─md127 9:127 0 51.8G 0 raid1 / ├─sdd2 8:50 0 501M 0 part │ └─md126 9:126 0 500.7M 0 raid1 /boot ├─sdd3 8:51 096M 0 part │ └─md125 9:125 0 95.9M 0 raid1 /boot/efi └─sdd4 8:52 0 3.5G 0 part [SWAP] sde 8:64 0 2.7T 0 disk └─sde1 8:65 0 2.7T 0 part └─md2 9:20 8.2T 0 raid5 /var/backup sdf 8:80 0 3.6T 0 disk └─sdf1 8:81 0 3.6T 0 part └─md2 9:20 8.2T 0 raid5 /var/backup ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: Kernel crash every day at 6:30am
Gordon Messmer writes: On 2/12/22 14:59, Alex wrote: Here's a bit of the kernel message from dmesg [ cut here ] WARNING: CPU: 4 PID: 633983 at kernel/exit.c:739 do_exit+0x37/0xa90 general protection fault, probably for non-canonical address 0xcc2a8cfcb62a56a1: [#1] SMP PTI CPU: 4 PID: 633983 Comm: rsync Not tainted 5.14.18-200.fc34.x86_64 #1 Hardware name: To be filled by O.E.M. To be filled by O.E.M./P8B-M Series, BIOS 6801 05/07/2018 RIP: 0010:__bio_crypt_clone+0x28/0x60 bio_crypt_clone suggests something wrong in an encrypted block device. Maybe corrupt data that rsync traverses during the backup? What does the output of "lsblk" look like for your system? What about "lvs"? This is worth tossing into bugzilla, for "kernel", notwithstanding abrt's reluctance in dealing with it. pgpu76lbQ2UYA.pgp Description: PGP signature ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: Kernel crash every day at 6:30am
On Sun, 13 Feb 2022 10:41:36 -0800 Gordon Messmer wrote: > bio_crypt_clone suggests something wrong in an encrypted block device. > Maybe corrupt data that rsync traverses during the backup? Perhaps run the same rsync command with a -v option in a terminal and see if the crash happens on the same file every time. ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: Kernel crash every day at 6:30am
On 2/12/22 14:59, Alex wrote: Here's a bit of the kernel message from dmesg [ cut here ] WARNING: CPU: 4 PID: 633983 at kernel/exit.c:739 do_exit+0x37/0xa90 general protection fault, probably for non-canonical address 0xcc2a8cfcb62a56a1: [#1] SMP PTI CPU: 4 PID: 633983 Comm: rsync Not tainted 5.14.18-200.fc34.x86_64 #1 Hardware name: To be filled by O.E.M. To be filled by O.E.M./P8B-M Series, BIOS 6801 05/07/2018 RIP: 0010:__bio_crypt_clone+0x28/0x60 bio_crypt_clone suggests something wrong in an encrypted block device. Maybe corrupt data that rsync traverses during the backup? What does the output of "lsblk" look like for your system? What about "lvs"? ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: Kernel crash every day at 6:30am
Hi, On Sat, Feb 12, 2022 at 9:26 PM Joe Zeff wrote: > > On 2/12/22 15:59, Alex wrote: > > Anyone else experiencing similar problems with the latest kernels? > > The fact that it happens at the same time every day makes me wonder if > there's some job that's in process causing it. Yes, there is an automated backup using rsync around that time, but there should be no userland process that could ever cause a kernel crash. CPU: 4 PID: 633983 Comm: rsync Not tainted 5.14.18-200.fc34.x86_64 #1 Thanks, Alex ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: Kernel crash every day at 6:30am
On 2/12/22 15:59, Alex wrote: Anyone else experiencing similar problems with the latest kernels? The fact that it happens at the same time every day makes me wonder if there's some job that's in process causing it. ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: kernel crash
On Wed, 2010-08-18 at 12:03 -0400, Steve Blackwell wrote: sensors indicated the the CPU fan and the power supply rails are monitored. Does anyone know if I can put something in smartd.conf to get reports on them and if so what? SMART is for hard drives. If you want to monitor other things, you need to use another tool. -- [...@localhost ~]$ uname -r 2.6.27.25-78.2.56.fc9.i686 Don't send private replies to my address, the mailbox is ignored. I read messages from the public lists. -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Re: kernel crash
On Wed, 18 Aug 2010 08:44:16 +0300 Gilboa Davara gilb...@gmail.com wrote: On Tue, 2010-08-17 at 13:08 -0400, Steve Blackwell wrote: On Wed, 18 Aug 2010 02:12:16 +0930 Tim ignored_mail...@yahoo.com.au wrote: On Tue, 2010-08-17 at 12:05 -0400, Steve Blackwell wrote: I've been looking at my logs some more. I don't understand these messages: Aug 17 10:30:50 steve kernel: CPU0: Temperature above threshold, cpu clock throttled (total events = 455) Aug 17 10:30:50 steve kernel: CPU1: Temperature above threshold, cpu clock throttled (total events = 455) Aug 17 10:30:50 steve kernel: CPU1: Temperature/speed normal Aug 17 10:30:50 steve kernel: CPU0: Temperature/speed normal And the CPU overheating as well as your hard drive? Is the computer in a hot room? Are the fans working? Is the ventilation blocked? Is the computer wedged in between things that restrict airflow? Are things full of fluff and dust? Well it would seems so but I don't trust the messages. It doesn't seem reasonable that the CPUs go overtemp and then immediately cool down enough to be OK. Actually it is possible. Your CPU has auto-throttle support. Read: When the CPU passes a certain temperature threshold, it automatically clocks down (or inserts NOPs) in-order to prevent is from burning out. Never the less, if your machine's cooling is sufficient you shouldn't see this message. If you CPU's high and low water mark are the same (E.g. 90C), the CPU will reach 90C, throttle, and drop to 89C - all in one second. I'd suggest you configure lm_sensros and monitor the CPU and board temperature. $ sensors-detect $ /etc/init.d/lm_sensors restart $ sensors -s $ sensors - Gilboa P.S. can you post your hardware configuration? Running sensors-detect produced the same /etc/sysconfig/lm_sensors file that I already had. Running sensors shows this: # sensors atk0110-acpi-0 Adapter: ACPI interface Vcore Voltage: +1.42 V (min = +1.45 V, max = +1.75 V) +3.3 Voltage: +1.68 V (min = +3.00 V, max = +3.60 V) +5.0 Voltage: +1.62 V (min = +4.50 V, max = +5.50 V) +12.0 Voltage:+11.98 V (min = +11.20 V, max = +13.20 V) CPU FAN Speed:56250 RPM (min =0 RPM) CHASSIS FAN Speed: 0 RPM (min =0 RPM) POWER FAN Speed: 0 RPM (min =0 RPM) CPU Temperature: +62.0°C (high = +90.0°C, crit = +125.0°C) MB Temperature:+49.0°C (high = +70.0°C, crit = +125.0°C) Power Temperature: +24.0°C (high = +80.0°C, crit = +125.0°C) The first thing that jumps out at me is that I think I need a new PSU! How is this machine even running if 3 of the 4 voltages are low? The second thing is that the temps are just fine. So why do I keep getting these messages in the logs? Perhaps because the power rails are low? The chassis and power fans are 2 wire so no data. lshw dumps a lot of information. Anything in particular you are looking for? I think I have solved my lockup problem. I'll write a separate post about that. Thanks, Steve -- Changing lives one card at a time http://www.send1cardnow.com signature.asc Description: PGP signature -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Re: kernel crash
On 19 August 2010 00:41, Steve Blackwell zep...@cfl.rr.com wrote: Running sensors-detect produced the same /etc/sysconfig/lm_sensors file that I already had. Running sensors shows this: +3.3 Voltage: +1.68 V (min = +3.00 V, max = +3.60 V) +5.0 Voltage: +1.62 V (min = +4.50 V, max = +5.50 V) CPU FAN Speed: 56250 RPM (min = 0 RPM) Fan rpm value looks much higher than typical. The motherboard can not possibly be running with the above reported voltages. More likely the sensors raw-to-reported scaling is incorrectly calibrated for your system. -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Re: kernel crash [SOLVED]
On Tue, 17 Aug 2010 18:07:18 +0300 Gilboa Davara gilb...@gmail.com wrote: On Tue, 2010-08-17 at 09:44 -0400, Steve Blackwell wrote: I leave my computer on 24/7 so that my backups can run at night. Lately, it has been crashing during the night usually leaving no trace of what happened. Last night it crashed but left this in /var/log/messages: Aug 17 01:04:56 steve kernel: INFO: task kjournald:1960 blocked for more than 120 seconds. Aug 17 01:04:56 steve kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Could a hard drive get shut down because it was getting too hot? What would be a normal temp for a hard drive that has just completed a backup? 124C seems really hot. The HD cooling fan had been broken so I replaced it this past weekend but it doesn't seem to have helped. Too late? Permanent HD damage already done? Any other comments or suggestions? Hello Steve, This is not a crash. The kjournald kernel process (which handles various file-system task). You assumption that the HD went into some type of sleep/suspend mode during write sounds reasonable to me. My machine locked up again yesterday evening and my inner Grissom had been nagging me all day. Something just didn't make sense. If it was a temperature problem, then why did it always happen after long periods of inactivity and usually at night when the temperatures should be at their coolest? I went back and looked at the logs from yesterday. This time I found that the last log written was pm-suspend.log. (Ah-ha!) At the bottom of this log are these lines: ... /usr/lib/pm-utils/sleep.d/98smart-kernel-video suspend suspend: success. /usr/lib/pm-utils/sleep.d/99hd-apm-restore.hook suspend suspend: Advanced Power Management not supported by device sdc. Advanced Power Management not supported by device sda. Advanced Power Management not supported by device sdb. success. /usr/lib/pm-utils/sleep.d/99video suspend suspend: kernel.acpi_video_flags = 0 success. Tue Aug 17 21:42:23 EDT 2010: performing suspend I found that I have 2 screensavers enabled, an X screensaver and a GNOME screensaver. One of them had power management enabled and set to suspend after 2 hrs of inactivity. I disabled the X screensaver and disabled power management. Last night the machine ran all night with no problems. It looks like the power management was able to put the disks to sleep but not wake them up because power management is not supported by the disks. This problems started a few weeks ago so I suspect an update of screensaver on Aug 6 started the problem. So nothing to do with overheating but I am learning more about smartd. Steve. -- Changing lives one card at a time http://www.send1cardnow.com signature.asc Description: PGP signature -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Re: kernel crash
On Tue, 2010-08-17 at 09:44 -0400, Steve Blackwell wrote: I leave my computer on 24/7 so that my backups can run at night. Lately, it has been crashing during the night usually leaving no trace of what happened. Last night it crashed but left this in /var/log/messages: Aug 17 01:04:56 steve kernel: INFO: task kjournald:1960 blocked for more than 120 seconds. Aug 17 01:04:56 steve kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Could a hard drive get shut down because it was getting too hot? What would be a normal temp for a hard drive that has just completed a backup? 124C seems really hot. The HD cooling fan had been broken so I replaced it this past weekend but it doesn't seem to have helped. Too late? Permanent HD damage already done? Any other comments or suggestions? Hello Steve, This is not a crash. The kjournald kernel process (which handles various file-system task). You assumption that the HD went into some type of sleep/suspend mode during write sounds reasonable to me. 124C seems -very- hot. Even during heavy I/O. Two things spring into mind: A. Is it a normal desktop SATA drive or high-speed SCSI/SAS drive? B. Please post the SMART log of the drive. (smartctl -a /dev/sdX). - Gilboa -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Re: kernel crash
On Tue, 17 Aug 2010 18:07:18 +0300 Gilboa Davara gilb...@gmail.com wrote: On Tue, 2010-08-17 at 09:44 -0400, Steve Blackwell wrote: I leave my computer on 24/7 so that my backups can run at night. Lately, it has been crashing during the night usually leaving no trace of what happened. Last night it crashed but left this in /var/log/messages: Aug 17 01:04:56 steve kernel: INFO: task kjournald:1960 blocked for more than 120 seconds. Aug 17 01:04:56 steve kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Could a hard drive get shut down because it was getting too hot? What would be a normal temp for a hard drive that has just completed a backup? 124C seems really hot. The HD cooling fan had been broken so I replaced it this past weekend but it doesn't seem to have helped. Too late? Permanent HD damage already done? Any other comments or suggestions? Hello Steve, This is not a crash. The kjournald kernel process (which handles various file-system task). You assumption that the HD went into some type of sleep/suspend mode during write sounds reasonable to me. 124C seems -very- hot. Even during heavy I/O. Two things spring into mind: A. Is it a normal desktop SATA drive or high-speed SCSI/SAS drive? B. Please post the SMART log of the drive. (smartctl -a /dev/sdX). - Gilboa Hello Gilboa, Yes I realize that it was not a crash. When I first saw the kernel messages I thought it was and started writing the e-mail. I neglected to correct the subject line after I actually read the messages. Sorry about that. I had already run the command: smartctl -t long /dev/sdb before I got your reply. The results should be ready soon. I've been looking at my logs some more. I don't understand these messages: Aug 17 10:30:50 steve kernel: CPU0: Temperature above threshold, cpu clock throttled (total events = 455) Aug 17 10:30:50 steve kernel: CPU1: Temperature above threshold, cpu clock throttled (total events = 455) Aug 17 10:30:50 steve kernel: CPU1: Temperature/speed normal Aug 17 10:30:50 steve kernel: CPU0: Temperature/speed normal These messages are repeated every hour or so. It seems unlikely that every time the threshold is exceeded, it immediately (within one second) drops back again. What is going on here? The drive is an old IDE drive: WDC WD1600JB-00F Thanks, Steve -- Changing lives one card at a time http://www.send1cardnow.com signature.asc Description: PGP signature -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Re: kernel crash
On 08/17/2010 06:44 AM, Steve Blackwell wrote: I leave my computer on 24/7 so that my backups can run at night. Lately, it has been crashing during the night usually leaving no trace of what happened. Last night it crashed but left this in /var/log/messages: Aug 17 01:04:56 steve kernel: INFO: task kjournald:1960 blocked for more than 120 seconds. Aug 17 01:04:56 steve kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Aug 17 01:04:56 steve kernel: kjournald D 2743 0 1960 2 0x0080 Aug 17 01:04:56 steve kernel: cf98fd9c 0046 ff2f442e 2743 00032558 f15c756c cf82d400 Aug 17 01:04:56 steve kernel: c0a5e6ac c0a63140 f15c756c c0a63140 c0a63140 cf98fd74 c05b61ef f1714e18 Aug 17 01:04:56 steve kernel: 0001 2743 f15c72c0 b39690c0 1b48082c f6630a60 c2208140 Aug 17 01:04:56 steve kernel: Call Trace: Aug 17 01:04:56 steve kernel: [c05b61ef] ? cfq_may_queue+0x48/0xa8 Aug 17 01:04:56 steve kernel: [c0793ef7] io_schedule+0x5f/0x98 Aug 17 01:04:56 steve kernel: [c05ac02f] get_request_wait+0xc7/0x13c Aug 17 01:04:56 steve kernel: [c0454641] ? autoremove_wake_function+0x0/0x34 Aug 17 01:04:56 steve kernel: [c05ac4a4] __make_request+0x27f/0x386 Aug 17 01:04:56 steve kernel: [c04cebd4] ? __slab_alloc+0x269/0x3f6 Aug 17 01:04:56 steve kernel: [c05ab011] generic_make_request+0x286/0x2d0 Aug 17 01:04:56 steve kernel: [c04a77e5] ? mempool_alloc_slab+0x13/0x15 Aug 17 01:04:56 steve kernel: [c04a78b1] ? mempool_alloc+0x5c/0xf2 Aug 17 01:04:56 steve kernel: [c05ab122] submit_bio+0xc7/0xe0 Aug 17 01:04:56 steve kernel: [c04fc9d3] ? bio_alloc_bioset+0x2a/0xb9 Aug 17 01:04:56 steve kernel: [c04f9038] submit_bh+0xf4/0x114 Aug 17 01:04:56 steve kernel: [c0562f74] journal_commit_transaction+0x38b/0xcc7 Aug 17 01:04:56 steve kernel: [c044747a] ? lock_timer_base+0x26/0x45 Aug 17 01:04:56 steve kernel: [c0447696] ? try_to_del_timer_sync+0x5e/0x66 Aug 17 01:04:56 steve kernel: [c0565f1d] kjournald+0xb8/0x1cc Aug 17 01:04:56 steve kernel: [c0454641] ? autoremove_wake_function+0x0/0x34 Aug 17 01:04:56 steve kernel: [c0565e65] ? kjournald+0x0/0x1cc Aug 17 01:04:56 steve kernel: [c0454409] kthread+0x64/0x69 Aug 17 01:04:56 steve kernel: [c04543a5] ? kthread+0x0/0x69 Aug 17 01:04:56 steve kernel: [c04041e7] kernel_thread_helper+0x7/0x10 This happened in the middle of the backup which started at 1:00am and finished (successfully) at 1:28am so perhaps the backup blocked the kjournald process but it didn't crash the computer because there are later messages in the backup log and the messages file. The last entry in the messages file is: Aug 17 02:03:55 steve smartd[2347]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 167 to 168 Aug 17 02:03:55 steve smartd[2347]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 122 to 124 Could a hard drive get shut down because it was getting too hot? What would be a normal temp for a hard drive that has just completed a backup? 124C seems really hot. The HD cooling fan had been broken so I replaced it this past weekend but it doesn't seem to have helped. Too late? Permanent HD damage already done? Any other comments or suggestions? Thanks Steve Hi Steve, REPLACE THE DRIVE IMMEDIATELY!! Otherwise, you are courting disaster! See if it is still under warranty and ask manfacturer for RMA. -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Re: kernel crash
On Tue, 2010-08-17 at 12:05 -0400, Steve Blackwell wrote: I've been looking at my logs some more. I don't understand these messages: Aug 17 10:30:50 steve kernel: CPU0: Temperature above threshold, cpu clock throttled (total events = 455) Aug 17 10:30:50 steve kernel: CPU1: Temperature above threshold, cpu clock throttled (total events = 455) Aug 17 10:30:50 steve kernel: CPU1: Temperature/speed normal Aug 17 10:30:50 steve kernel: CPU0: Temperature/speed normal And the CPU overheating as well as your hard drive? Is the computer in a hot room? Are the fans working? Is the ventilation blocked? Is the computer wedged in between things that restrict airflow? Are things full of fluff and dust? -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Re: kernel crash
On Tue, 17 Aug 2010 12:05:44 -0400 Steve Blackwell zep...@cfl.rr.com wrote: On Tue, 17 Aug 2010 18:07:18 +0300 Gilboa Davara gilb...@gmail.com wrote: On Tue, 2010-08-17 at 09:44 -0400, Steve Blackwell wrote: I leave my computer on 24/7 so that my backups can run at night. Lately, it has been crashing during the night usually leaving no trace of what happened. Last night it crashed but left this in /var/log/messages: Aug 17 01:04:56 steve kernel: INFO: task kjournald:1960 blocked for more than 120 seconds. Aug 17 01:04:56 steve kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Could a hard drive get shut down because it was getting too hot? What would be a normal temp for a hard drive that has just completed a backup? 124C seems really hot. The HD cooling fan had been broken so I replaced it this past weekend but it doesn't seem to have helped. Too late? Permanent HD damage already done? Any other comments or suggestions? Hello Steve, This is not a crash. The kjournald kernel process (which handles various file-system task). You assumption that the HD went into some type of sleep/suspend mode during write sounds reasonable to me. 124C seems -very- hot. Even during heavy I/O. Two things spring into mind: A. Is it a normal desktop SATA drive or high-speed SCSI/SAS drive? B. Please post the SMART log of the drive. (smartctl -a /dev/sdX). - Gilboa Hello Gilboa, Yes I realize that it was not a crash. When I first saw the kernel messages I thought it was and started writing the e-mail. I neglected to correct the subject line after I actually read the messages. Sorry about that. I had already run the command: smartctl -t long /dev/sdb before I got your reply. The results should be ready soon. I've been looking at my logs some more. I don't understand these messages: Aug 17 10:30:50 steve kernel: CPU0: Temperature above threshold, cpu clock throttled (total events = 455) Aug 17 10:30:50 steve kernel: CPU1: Temperature above threshold, cpu clock throttled (total events = 455) Aug 17 10:30:50 steve kernel: CPU1: Temperature/speed normal Aug 17 10:30:50 steve kernel: CPU0: Temperature/speed normal These messages are repeated every hour or so. It seems unlikely that every time the threshold is exceeded, it immediately (within one second) drops back again. What is going on here? The drive is an old IDE drive: WDC WD1600JB-00F Thanks, Steve Well, the long self test passed. Here is the result of # smartctl -a /dev/sdb smartctl 5.39.1 2010-01-28 r3054 [i386-redhat-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar SE family Device Model: WDC WD1600JB-00FUA0 Serial Number:WD-WCAES1024695 Firmware Version: 15.05R15 User Capacity:160,041,885,696 bytes Device is:In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: Exact ATA specification draft version not indicated Local Time is:Tue Aug 17 12:36:35 2010 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x85) Offline data collection activity was aborted by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (5073) seconds. Offline data collection capabilities:(0x79) SMART execute Offline immediate. No Auto Offline data collection support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities:(0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability:(0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time:( 2) minutes. Extended self-test routine recommended polling time:( 67)
Re: kernel crash
On Wed, 18 Aug 2010 02:12:16 +0930 Tim ignored_mail...@yahoo.com.au wrote: On Tue, 2010-08-17 at 12:05 -0400, Steve Blackwell wrote: I've been looking at my logs some more. I don't understand these messages: Aug 17 10:30:50 steve kernel: CPU0: Temperature above threshold, cpu clock throttled (total events = 455) Aug 17 10:30:50 steve kernel: CPU1: Temperature above threshold, cpu clock throttled (total events = 455) Aug 17 10:30:50 steve kernel: CPU1: Temperature/speed normal Aug 17 10:30:50 steve kernel: CPU0: Temperature/speed normal And the CPU overheating as well as your hard drive? Is the computer in a hot room? Are the fans working? Is the ventilation blocked? Is the computer wedged in between things that restrict airflow? Are things full of fluff and dust? Well it would seems so but I don't trust the messages. It doesn't seem reasonable that the CPUs go overtemp and then immediately cool down enough to be OK. As for your other questions, I spent the weekend replacing a broken cooling fan, removing the dust build-up, rearranging the internal components to maximize the space between them and rearranging my office to place the computer in a more open space. None of these actions appear to have helped. Steve -- Changing lives one card at a time http://www.send1cardnow.com signature.asc Description: PGP signature -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Re: kernel crash
On 18 August 2010 09:22, Bill Davidsen david...@tmr.com wrote: If this line is for real: 194 Temperature_Celsius 0x0022 116 253 000 Old_age Always - 34 Then your drive is running hotter than boiling water and has been close to melting point of solder. In spite of that the error count is fine, but holding your hand an inch or so from the drive should tell you if this is that hot. Having rtfm, I think the values reported there are normalised values, not degrees Celsius. See http://sourceforge.net/apps/trac/smartmontools/wiki/FAQ#Whyismydisktemperaturesreportedbysmartdas150Celsius and read 'man smartctl' under option -A -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Re: kernel crash
On Tue, 2010-08-17 at 13:08 -0400, Steve Blackwell wrote: On Wed, 18 Aug 2010 02:12:16 +0930 Tim ignored_mail...@yahoo.com.au wrote: On Tue, 2010-08-17 at 12:05 -0400, Steve Blackwell wrote: I've been looking at my logs some more. I don't understand these messages: Aug 17 10:30:50 steve kernel: CPU0: Temperature above threshold, cpu clock throttled (total events = 455) Aug 17 10:30:50 steve kernel: CPU1: Temperature above threshold, cpu clock throttled (total events = 455) Aug 17 10:30:50 steve kernel: CPU1: Temperature/speed normal Aug 17 10:30:50 steve kernel: CPU0: Temperature/speed normal And the CPU overheating as well as your hard drive? Is the computer in a hot room? Are the fans working? Is the ventilation blocked? Is the computer wedged in between things that restrict airflow? Are things full of fluff and dust? Well it would seems so but I don't trust the messages. It doesn't seem reasonable that the CPUs go overtemp and then immediately cool down enough to be OK. Actually it is possible. Your CPU has auto-throttle support. Read: When the CPU passes a certain temperature threshold, it automatically clocks down (or inserts NOPs) in-order to prevent is from burning out. Never the less, if your machine's cooling is sufficient you shouldn't see this message. If you CPU's high and low water mark are the same (E.g. 90C), the CPU will reach 90C, throttle, and drop to 89C - all in one second. I'd suggest you configure lm_sensros and monitor the CPU and board temperature. $ sensors-detect $ /etc/init.d/lm_sensors restart $ sensors -s $ sensors - Gilboa P.S. can you post your hardware configuration? -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Re: kernel crash
On Tue, 2010-08-17 at 12:48 -0400, Steve Blackwell wrote: On Tue, 17 Aug 2010 12:05:44 -0400 Steve Blackwell zep...@cfl.rr.com wrote: On Tue, 17 Aug 2010 18:07:18 +0300 Gilboa Davara gilb...@gmail.com wrote: On Tue, 2010-08-17 at 09:44 -0400, Steve Blackwell wrote: I leave my computer on 24/7 so that my backups can run at night. Lately, it has been crashing during the night usually leaving no trace of what happened. Last night it crashed but left this in /var/log/messages: Aug 17 01:04:56 steve kernel: INFO: task kjournald:1960 blocked for more than 120 seconds. Aug 17 01:04:56 steve kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Could a hard drive get shut down because it was getting too hot? What would be a normal temp for a hard drive that has just completed a backup? 124C seems really hot. The HD cooling fan had been broken so I replaced it this past weekend but it doesn't seem to have helped. Too late? Permanent HD damage already done? Any other comments or suggestions? Hello Steve, This is not a crash. The kjournald kernel process (which handles various file-system task). You assumption that the HD went into some type of sleep/suspend mode during write sounds reasonable to me. 124C seems -very- hot. Even during heavy I/O. Two things spring into mind: A. Is it a normal desktop SATA drive or high-speed SCSI/SAS drive? B. Please post the SMART log of the drive. (smartctl -a /dev/sdX). - Gilboa Hello Gilboa, Yes I realize that it was not a crash. When I first saw the kernel messages I thought it was and started writing the e-mail. I neglected to correct the subject line after I actually read the messages. Sorry about that. I had already run the command: smartctl -t long /dev/sdb before I got your reply. The results should be ready soon. I've been looking at my logs some more. I don't understand these messages: Aug 17 10:30:50 steve kernel: CPU0: Temperature above threshold, cpu clock throttled (total events = 455) Aug 17 10:30:50 steve kernel: CPU1: Temperature above threshold, cpu clock throttled (total events = 455) Aug 17 10:30:50 steve kernel: CPU1: Temperature/speed normal Aug 17 10:30:50 steve kernel: CPU0: Temperature/speed normal These messages are repeated every hour or so. It seems unlikely that every time the threshold is exceeded, it immediately (within one second) drops back again. What is going on here? The drive is an old IDE drive: WDC WD1600JB-00F Thanks, Steve Well, the long self test passed. Here is the result of # smartctl -a /dev/sdb smartctl 5.39.1 2010-01-28 r3054 [i386-redhat-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar SE family Device Model: WDC WD1600JB-00FUA0 Serial Number:WD-WCAES1024695 Firmware Version: 15.05R15 User Capacity:160,041,885,696 bytes Device is:In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: Exact ATA specification draft version not indicated Local Time is:Tue Aug 17 12:36:35 2010 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x85) Offline data collection activity was aborted by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (5073) seconds. Offline data collection capabilities: (0x79) SMART execute Offline immediate. No Auto Offline data collection support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities:(0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability:(0x01) Error logging supported. No
Re: Kernel crash message in F12 - anyone explain?
As far as I know this is not a crash, but rather an unoptimal (slower) path taken somewhere. I don't thing there's neet to worry about this... - Clemens 2010/1/30 Mike Cloaked mike.cloa...@gmail.com: Today I got the following after abrt popped up - I don't know what this refers to or why the crash happened! Jan 30 14:17:03 home1 kernel: [ cut here ] Jan 30 14:17:03 home1 kernel: WARNING: at fs/notify/inotify/inotify_fsnotify.c:129 idr_callback+0x32/0x56() (Not tainted) Jan 30 14:17:03 home1 kernel: Hardware name: OptiPlex 960 Jan 30 14:17:03 home1 kernel: inotify closing but id=0 for entry=ef4832c0 in group=f006f480 still in idr. Probably leaking memory Jan 30 14:17:03 home1 kernel: Modules linked in: fuse ipt_MASQUERADE bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq iptable_nat nf_nat iptable_mangle ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath kvm uinput usblp snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_hwdep iTCO_wdt snd_seq iTCO_vendor_support e1000e snd_seq_device snd_pcm snd_timer serio_raw ppdev snd parport_pc i2c_i801 soundcore parport dcdbas snd_page_alloc wmi pata_acpi ata_generic nouveau ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: microcode] Jan 30 14:17:03 home1 kernel: Pid: 2413, comm: imap Not tainted 2.6.31.12-174.2.3.fc12.i686.PAE #1 Jan 30 14:17:03 home1 kernel: Call Trace: Jan 30 14:17:03 home1 kernel: [c043db4b] warn_slowpath_common+0x70/0x87 Jan 30 14:17:03 home1 kernel: [c04ede48] ? idr_callback+0x32/0x56 Jan 30 14:17:03 home1 kernel: [c043dba0] warn_slowpath_fmt+0x29/0x2c Jan 30 14:17:03 home1 kernel: [c04ede48] idr_callback+0x32/0x56 Jan 30 14:17:03 home1 kernel: [c059ee84] idr_for_each+0x5c/0x97 Jan 30 14:17:03 home1 kernel: [c04ede16] ? idr_callback+0x0/0x56 Jan 30 14:17:03 home1 kernel: [c04ec4fc] ? fsnotify_put_event+0x48/0x4b Jan 30 14:17:03 home1 kernel: [c04ede05] inotify_free_group_priv+0x1a/0x2b Jan 30 14:17:03 home1 kernel: [c04ec5db] fsnotify_final_destroy_group+0x1e/0x28 Jan 30 14:17:03 home1 kernel: [c04ec69f] fsnotify_put_group+0x75/0x78 Jan 30 14:17:03 home1 kernel: [c04edfbd] inotify_release+0x1e/0x28 Jan 30 14:17:03 home1 kernel: [c04c9dbf] __fput+0xed/0x184 Jan 30 14:17:03 home1 kernel: [c04c9e6e] fput+0x18/0x1a Jan 30 14:17:03 home1 kernel: [c04c7311] filp_close+0x56/0x60 Jan 30 14:17:03 home1 kernel: [c04c737b] sys_close+0x60/0x8f Jan 30 14:17:03 home1 kernel: [c0408fbb] sysenter_do_call+0x12/0x28 Jan 30 14:17:03 home1 kernel: ---[ end trace 934d4ab904d334bf ]--- Jan 30 14:17:03 home1 kernel: entry-group=(null) inode=(null) wd=1024 Jan 30 14:17:52 home1 named[1064]: client 127.0.0.1#36927: RFC 1918 response from Internet for 1.122.168.192.in-addr.arpa Jan 30 14:18:17 home1 abrt: Kerneloops: Reported 1 kernel oopses to Abrt Jan 30 14:18:17 home1 abrtd: Directory 'kerneloops-1264861097-1' creation detected Jan 30 14:18:17 home1 abrtd: Getting local universal unique identification Jan 30 14:18:17 home1 abrtd: New crash, saving Jan 30 14:18:17 home1 abrtd: RunApp('/var/cache/abrt/kerneloops-1264861097-1','test x`cat component` = xxorg-x11-server-Xorg cp /var/log/Xorg.0.log .') Jan 30 14:18:17 home1 abrtd: Getting local universal unique identification Jan 30 14:18:54 home1 named[1064]: client 127.0.0.1#53596: RFC 1918 response from Internet for 1.122.168.192.in-addr.arpa Jan 30 14:19:55 home1 abrtd: Getting crash infos... Jan 30 14:19:56 home1 named[1064]: client 127.0.0.1#44889: RFC 1918 response from Internet for 1.122.168.192.in-addr.arpa Jan 30 14:20:12 home1 abrtd: Creating report... Jan 30 14:20:12 home1 abrtd: Getting local universal unique identification Jan 30 14:20:12 home1 abrtd: Getting local universal unique identification Anyone interpret this for me? Tnanks -- View this message in context: http://n3.nabble.com/Kernel-crash-message-in-F12-anyone-explain-tp178722p178722.html Sent from the Fedora Users mailing list archive at Nabble.com. -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Guidelines: http://fedoraproject.org/wiki/Communicate/MailingListGuidelines -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Guidelines: http://fedoraproject.org/wiki/Communicate/MailingListGuidelines