Re: [CentOS] Server locking up everyday around 3:30 AM
PJ wrote: > This may or may not be CentOS related, but am out of ideas at this point and wanted to bounce this off the list. > > I'm running a CentOS 5.5 server, running the latest kernel > 2.6.18-194.32.1.el5. > > Almost everyday around 3:30 AM the server completely locks up and has to be power cycled before it will come back online. > (this means someone hat to wake up and reboot the server, oh how I love being an internet janitor! :) Please log of the Internet. We are cleaning it. You may log back on later. > I was able to pull this from /var/log/messages, this happens just seconds before locking up completely... > > Mar 8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more than 120 seconds. > Mar 8 03:33:19 web1 kernel: "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Mar 8 03:33:19 web1 kernel: wget D 810001004420 0 13608 13607 (NOTLB) > Mar 8 03:33:19 web1 kernel: 81007bc7bc78 0086 > 81007bc7bd88 81000100d3f8 > Mar 8 03:33:19 web1 kernel: 81007bc7bbf0 0007 > 8100849db0c0 80308b60 > Mar 8 03:33:19 web1 kernel: 00013a2964cdf439 3237 > 8100849db2a8 64c82eae > Mar 8 03:33:19 web1 kernel: Call Trace: > Mar 8 03:33:20 web1 kernel: [] > __mutex_lock_slowpath+0x60/0x9b Anyone else smell an OOM killer? But it's clearly whatever the wget's after that's killing the system. mark ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Server locking up everyday around 3:30 AM
On Fri, Mar 11, 2011 at 10:06 AM, wrote: > PJ wrote: >> This may or may not be CentOS related, but am out of ideas at this point > and wanted to bounce this off the list. >> >> I'm running a CentOS 5.5 server, running the latest kernel >> 2.6.18-194.32.1.el5. >> >> Almost everyday around 3:30 AM the server completely locks up and has to > be power cycled before it will come back online. >> (this means someone hat to wake up and reboot the server, oh how I love > being an internet janitor! :) > > Please log of the Internet. We are cleaning it. You may log back on later. > > >> I was able to pull this from /var/log/messages, this happens just > seconds before locking up completely... >> >> Mar 8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more than > 120 seconds. >> Mar 8 03:33:19 web1 kernel: "echo 0 > >> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> Mar 8 03:33:19 web1 kernel: wget D 810001004420 0 > 13608 13607 (NOTLB) >> Mar 8 03:33:19 web1 kernel: 81007bc7bc78 0086 >> 81007bc7bd88 81000100d3f8 >> Mar 8 03:33:19 web1 kernel: 81007bc7bbf0 0007 >> 8100849db0c0 80308b60 >> Mar 8 03:33:19 web1 kernel: 00013a2964cdf439 3237 >> 8100849db2a8 64c82eae >> Mar 8 03:33:19 web1 kernel: Call Trace: >> Mar 8 03:33:20 web1 kernel: [] >> __mutex_lock_slowpath+0x60/0x9b > > Anyone else smell an OOM killer? But it's clearly whatever the wget's > after that's killing the system. > > mark > > > > ___ > CentOS mailing list > CentOS@centos.org > http://lists.centos.org/mailman/listinfo/centos > What makes no sense to me is this runs every 5 minutes all day, but only around 3:30 AM does it look up. There is nothing in the log that suggests the kernel is having to kill processes because it is out of resources. No "httpd invoked oom-killer" etc... which I have seen before in other situations. http://bugs.centos.org/view.php?id=4515 sounds like what I have going on, but not with kjournald of course... Thanks, -- PJ ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Server locking up everyday around 3:30 AM
centos-boun...@centos.org wrote: > What makes no sense to me is this runs every 5 minutes all day, but > only around 3:30 AM does it look up. > > http://bugs.centos.org/view.php?id=4515 sounds like what I have going > on, but not with kjournald of course... > > Thanks, Can you skip the runs from 3 to 4 AM (or 3:15 to 3:45)? It might be what the target of wget is doing on *their* end that makes your access fatal to your machine. Insert spiffy .sig here: Life is complex: it has both real and imaginary parts. //me *** This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This footnote also confirms that this email message has been swept for the presence of computer viruses. www.Hubbell.com - Hubbell Incorporated** ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Server locking up everyday around 3:30 AM
On 3/11/2011 12:20 PM, PJ wrote: > > What makes no sense to me is this runs every 5 minutes all day, but > only around 3:30 AM does it look up. When did it start happening? Did it correspond to any hardware change or software update that you can pin down? > http://bugs.centos.org/view.php?id=4515 sounds like what I have going > on, but not with kjournald of course... Is anything else happening that might make the disks busy around then? Maybe the raid-check job in cron.weekly is still running? -- Les Mikesell lesmikes...@gmail.com ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Server locking up everyday around 3:30 AM
PJ wrote: > On Fri, Mar 11, 2011 at 10:06 AM, wrote: >> PJ wrote: >>> I'm running a CentOS 5.5 server, running the latest kernel >>> 2.6.18-194.32.1.el5. >>> >>> Almost everyday around 3:30 AM the server completely locks up and has >>> to be power cycled before it will come back online. >>> (this means someone hat to wake up and reboot the server, oh how I love >> being an internet janitor! :) >> >>> I was able to pull this from /var/log/messages, this happens just >> seconds before locking up completely... >>> >>> Mar 8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more >>> than 120 seconds. >>> Mar 8 03:33:19 web1 kernel: "echo 0 > >>> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>> Mar 8 03:33:19 web1 kernel: wget D 810001004420 0 >> 13608 13607 (NOTLB) >>> Mar 8 03:33:19 web1 kernel: 81007bc7bc78 0086 >>> 81007bc7bd88 81000100d3f8 >>> Mar 8 03:33:19 web1 kernel: 81007bc7bbf0 0007 >>> 8100849db0c0 80308b60 >>> Mar 8 03:33:19 web1 kernel: 00013a2964cdf439 3237 >>> 8100849db2a8 64c82eae >>> Mar 8 03:33:19 web1 kernel: Call Trace: >>> Mar 8 03:33:20 web1 kernel: [] >>> __mutex_lock_slowpath+0x60/0x9b >> >> Anyone else smell an OOM killer? But it's clearly whatever the wget's >> after that's killing the system. > > What makes no sense to me is this runs every 5 minutes all day, but > only around 3:30 AM does it look up. > > There is nothing in the log that suggests the kernel is having to kill > processes because it is out of resources. > > No "httpd invoked oom-killer" etc... which I have seen before in other > situations. > > http://bugs.centos.org/view.php?id=4515 sounds like what I have going > on, but not with kjournald of course... Couple things: a few weeks ago, we were getting OOM Killer running with no log entries, but that was due to someone starting a parallel processing job that wanted all the cores... and near the end, wanted half again the memory, and *all* the threads hit that point apparently so fast OOM Killer didn't have time or memory to run. Another thing: it may be running every five minutes, but you might want to look at what it gets at 03:30 that might be different than the rest of the day, such as a major backup, or an entire day's reconsiliations, complete with gigabytes of scans mark ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Server locking up everyday around 3:30 AM
Les Mikesell wrote: > On 3/11/2011 12:20 PM, PJ wrote: >> >> What makes no sense to me is this runs every 5 minutes all day, but >> only around 3:30 AM does it look up. > > When did it start happening? Did it correspond to any hardware change > or software update that you can pin down? > >> http://bugs.centos.org/view.php?id=4515 sounds like what I have going >> on, but not with kjournald of course... > > Is anything else happening that might make the disks busy around then? > Maybe the raid-check job in cron.weekly is still running? Any chance of something *not* on your server, such as a router or firewall doing something then? mark ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Server locking up everyday around 3:30 AM
> PJ wrote: >> Mar 8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more than > 120 seconds. Check the number of dirty pages: grep Dirty /proc/meminfo relative to the dirty_ratio setting: cat /proc/sys/vm/dirty_ratio to see if the system is going into synhronous flush mode around that time (especially if dirty_ratio is large and you have a lot of physical memory). This is what I usually see as the cause of the "blocked for more than" message. I've also found that it can be several minutes, and up to 20 minutes, before the system recovers (but recover it always does). -Steve ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Server locking up everyday around 3:30 AM
On Fri, Mar 11, 2011 at 11:05 AM, Steve Thompson wrote: > >> PJ wrote: >>> Mar 8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more than >> 120 seconds. > > Check the number of dirty pages: > > grep Dirty /proc/meminfo > > relative to the dirty_ratio setting: > > cat /proc/sys/vm/dirty_ratio > > to see if the system is going into synhronous flush mode around that time > (especially if dirty_ratio is large and you have a lot of physical > memory). This is what I usually see as the cause of the "blocked for more > than" message. I've also found that it can be several minutes, and up to > 20 minutes, before the system recovers (but recover it always does). > > -Steve > ___ > CentOS mailing list > CentOS@centos.org > http://lists.centos.org/mailman/listinfo/centos > Great replies from everyone, I really appreciate the feedback. Interesting entries in /var/log/cron: -snip- (this runs 24/7 every 5 minutes as normal...) Mar 11 02:20:01 web1 crond[12919]: (webuser) CMD (wget -q www.domain.com/cron.php >/dev/null 2>&1) Mar 11 02:25:01 web1 crond[12950]: (webuser) CMD (wget -q www.domain.com/cron.php >/dev/null 2>&1) Mar 11 02:30:01 web1 crond[12969]: (webuser) CMD (wget -q www.domain.com/cron.php >/dev/null 2>&1) Mar 11 02:35:01 web1 crond[12992]: (webuser) CMD (wget -q www.domain.com/cron.php >/dev/null 2>&1) Mar 11 02:40:01 web1 crond[13014]: (webuser) CMD (wget -q www.domain.com/cron.php >/dev/null 2>&1) Mar 11 02:45:01 web1 crond[13218]: (webuser) CMD (wget -q www.domain.com/cron.php >/dev/null 2>&1) -snip- (fast forward to 3 AM, the same cron job starts getting delayed by 3:27 the server was non responsive. Never seen this before) Mar 11 03:01:01 web1 crond[13613]: (root) CMD (run-parts /etc/cron.hourly) Mar 11 03:07:20 web1 crond[13727]: (webuser) error: Job execution of per-minute job scheduled for 03:05 delayed into subsequent minute 03:07. Skipping job run. Mar 11 03:07:20 web1 crond[13727]: CRON (webuser) ERROR: cannot set security context Mar 11 03:13:00 web1 crond[13825]: (webuser) error: Job execution of per-minute job scheduled for 03:10 delayed into subsequent minute 03:13. Skipping job run. Mar 11 03:13:00 web1 crond[13825]: CRON (webuser) ERROR: cannot set security context Mar 11 03:19:29 web1 crond[13854]: (webuser) error: Job execution of per-minute job scheduled for 03:15 delayed into subsequent minute 03:19. Skipping job run. Mar 11 03:20:16 web1 crond[13890]: (webuser) CMD (wget -q www.domain.com/cron.php >/dev/null 2>&1) Mar 11 03:21:01 web1 crond[13854]: CRON (webuser) ERROR: cannot set security context Mar 11 03:27:41 web1 crond[13912]: (webuser) error: Job execution of per-minute job scheduled for 03:25 delayed into subsequent minute 03:27. Skipping job run. Mar 11 03:27:42 web1 crond[13912]: CRON (webuser) ERROR: cannot set security context Mar 11 03:32:05 web1 crond[13930]: (webuser) error: Job execution of per-minute job scheduled for 03:30 delayed into subsequent minute 03:32. Skipping job run. Mar 11 03:32:05 web1 crond[13930]: CRON (webuser) ERROR: cannot set security context Mar 11 03:36:23 web1 crond[13948]: (webuser) error: Job execution of per-minute job scheduled for 03:35 delayed into subsequent minute 03:36. Skipping job run. Mar 11 03:36:23 web1 crond[13948]: CRON (webuser) ERROR: cannot set security context (rebooted) Mar 11 03:41:15 web1 crond[4776]: (CRON) STARTUP (V5.0) -snip- I don't think it is a coincidence I'm seeing "CRON (webuser) ERROR: cannot set security context" around the same time the server stops responding. I'm not familiar with this message, anyone here seen it? cron daily fires off at 4:02, after all this stuff... nothing in cron.hourly.. Getting warmer I think, but still cant figure it out! -- PJ ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Server locking up everyday around 3:30 AM
PJ wrote: > On Fri, Mar 11, 2011 at 11:05 AM, Steve Thompson wrote: >> >>> PJ wrote: Mar 8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more than 120 seconds. > Great replies from everyone, I really appreciate the feedback. > > Interesting entries in /var/log/cron: > Mar 11 03:01:01 web1 crond[13613]: (root) CMD (run-parts /etc/cron.hourly) Mar 11 03:07:20 web1 crond[13727]: (webuser) error: Job execution of per-minute job scheduled for 03:05 delayed into subsequent minute 03:07. Skipping job run. > Mar 11 03:07:20 web1 crond[13727]: CRON (webuser) ERROR: cannot set security context SELINUX! Look at /var/log/messages for an selinux error: if you don't have sealert, install the package, then use it on /var/log/audit. Or put selinux in permissive mode. mark ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Server locking up everyday around 3:30 AM
On Fri, Mar 11, 2011 at 2:33 PM, wrote: > PJ wrote: >> On Fri, Mar 11, 2011 at 11:05 AM, Steve Thompson wrote: >>> PJ wrote: > Mar 8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more > than 120 seconds. > >> Great replies from everyone, I really appreciate the feedback. >> >> Interesting entries in /var/log/cron: > >> Mar 11 03:01:01 web1 crond[13613]: (root) CMD (run-parts > /etc/cron.hourly) Mar 11 03:07:20 web1 crond[13727]: (webuser) error: > Job execution of per-minute job scheduled for 03:05 delayed into > subsequent minute 03:07. Skipping job run. >> Mar 11 03:07:20 web1 crond[13727]: CRON (webuser) ERROR: cannot set > security context > > SELINUX! Look at /var/log/messages for an selinux error: if you don't have > sealert, install the package, then use it on /var/log/audit. > > Or put selinux in permissive mode. > > mark > > > > ___ > CentOS mailing list > CentOS@centos.org > http://lists.centos.org/mailman/listinfo/centos > I thought about that :) [root@web1 ~]# sestatus SELinux status: disabled It's already disabled and always has been. Thanks, -- PJ ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Server locking up everyday around 3:30 AM
PJ wrote: > On Fri, Mar 11, 2011 at 2:33 PM, wrote: >> PJ wrote: >>> On Fri, Mar 11, 2011 at 11:05 AM, Steve Thompson >>> wrote: > PJ wrote: >> Mar 8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more >> than 120 seconds. >> >>> Great replies from everyone, I really appreciate the feedback. >>> >>> Interesting entries in /var/log/cron: >> >>> Mar 11 03:01:01 web1 crond[13613]: (root) CMD (run-parts >> /etc/cron.hourly) Mar 11 03:07:20 web1 crond[13727]: (webuser) error: >> Job execution of per-minute job scheduled for 03:05 delayed into >> subsequent minute 03:07. Skipping job run. >>> Mar 11 03:07:20 web1 crond[13727]: CRON (webuser) ERROR: cannot set >> security context >> >> SELINUX! Look at /var/log/messages for an selinux error: if you don't >> have sealert, install the package, then use it on /var/log/audit. >> >> Or put selinux in permissive mode. > > I thought about that :) > > [root@web1 ~]# sestatus > SELinux status: disabled > > It's already disabled and always has been. acl's? mark ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Server locking up everyday around 3:30 AM
Here's a thought: see what directory the wget's trying to put the file in. mark ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Server locking up everyday around 3:30 AM
On 03/11/2011 04:06 PM, PJ wrote: > Interesting entries in /var/log/cron: > > -snip- > (this runs 24/7 every 5 minutes as normal...) > > Mar 11 02:20:01 web1 crond[12919]: (webuser) CMD (wget -q > www.domain.com/cron.php >/dev/null 2>&1) > (fast forward to 3 AM, the same cron job starts getting delayed by > 3:27 the server was non responsive. Never seen this before) > > Mar 11 03:01:01 web1 crond[13613]: (root) CMD (run-parts /etc/cron.hourly) > Mar 11 03:07:20 web1 crond[13727]: (webuser) error: Job execution of > per-minute job scheduled for 03:05 delayed into subsequent minute > 03:07. Skipping job run. > Mar 11 03:07:20 web1 crond[13727]: CRON (webuser) ERROR: cannot set > security context > I don't think it is a coincidence I'm seeing "CRON (webuser) ERROR: > cannot set security context" around the same time the server stops > responding. > > I'm not familiar with this message, anyone here seen it? > > cron daily fires off at 4:02, after all this stuff... > > nothing in cron.hourly.. > > Getting warmer I think, but still cant figure it out! OK, did the webuser job run at 03:00:01 or does it start at 03:05:01? The system likely thinks that cron job is still running from the last time it was initiated. (Be it at 03:00:01 or 02:55:01). Is there anything that the php file called by the cron (via wget) is supposed to do at 03:00 that is different than other times? You might try something like this: */5 0,1,2,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23 * * * 15,30,45 3 * * * (those commands should run the cron normally except starting at 03:00, where it should kick off at 3:15 instead of 03:05) If it also fails to start at 03:15 then that would suggest that something is happening to the cron job the last time it is run to make it hang (or make the system think it is hung). signature.asc Description: OpenPGP digital signature ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Server locking up everyday around 3:30 AM
> > (those commands should run the cron normally except starting at 03:00, > where it should kick off at 3:15 instead of 03:05) > > If it also fails to start at 03:15 then that would suggest that > something is happening to the cron job the last time it is run to make > it hang (or make the system think it is hung). > I wonder if the other end has some file(s) locked for writing or other reason and that is choking the local wget. ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Server locking up everyday around 3:30 AM
On 3/14/2011 3:24 PM, Scott Silva wrote: > >> >> (those commands should run the cron normally except starting at 03:00, >> where it should kick off at 3:15 instead of 03:05) >> >> If it also fails to start at 03:15 then that would suggest that >> something is happening to the cron job the last time it is run to make >> it hang (or make the system think it is hung). >> > I wonder if the other end has some file(s) locked for writing or other reason > and that is choking the local wget. But nothing the other end does should cause a kernel task hang. Shouldn't that only be inside a device driver? -- Les Mikesell lesmikes...@gmail.com ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Server locking up everyday around 3:30 AM - (INFO: task wget:13608 blocked for more than 120 seconds) need sleep, help.
On Fri, Mar 11, 2011 at 12:33 PM, PJ wrote: > This may or may not be CentOS related, but am out of ideas at this > point and wanted to bounce this off the list. > > I'm running a CentOS 5.5 server, running the latest kernel > 2.6.18-194.32.1.el5. > > Almost everyday around 3:30 AM the server completely locks up and has > to be power cycled before it will come back online. > (this means someone hat to wake up and reboot the server, oh how I > love being an internet janitor! :) > > Smells like a hardware issue to me too, but I went through all of the > dell diagnostics, updated the firmware, everything checks out as being > okay, RAID, disks, RAM, etc... Spent an hour on the phone with a Dell > tech. No hardware issues, at least that we were able to find. > > There are no cron jobs that run at 3:30, no backups, the server has a > load of 0, nothing is scheduled around that time... > > The only crontab entry at all is "*/5 * * * * wget -q > www.websitedomain.com/cron.php >/dev/null 2>&1" > They are running Magento for commerce purposes and this runs every 5 minutes. > > Why does the server only lockup around 3:30 AM? Because it's knows I > am fast asleep? > > I was able to pull this from /var/log/messages, this happens just > seconds before locking up completely... > > Mar 8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more > than 120 seconds. > Mar 8 03:33:19 web1 kernel: "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Mar 8 03:33:19 web1 kernel: wget D 810001004420 0 > 13608 13607 (NOTLB) > Mar 8 03:33:19 web1 kernel: 81007bc7bc78 0086 > 81007bc7bd88 81000100d3f8 > Mar 8 03:33:19 web1 kernel: 81007bc7bbf0 0007 > 8100849db0c0 80308b60 > Mar 8 03:33:19 web1 kernel: 00013a2964cdf439 3237 > 8100849db2a8 64c82eae > Mar 8 03:33:19 web1 kernel: Call Trace: > Mar 8 03:33:20 web1 kernel: [] > __mutex_lock_slowpath+0x60/0x9b > Mar 8 03:33:20 web1 kernel: [] .text.lock.mutex+0xf/0x14 > Mar 8 03:33:20 web1 kernel: [] do_lookup+0x90/0x1e6 > Mar 8 03:33:20 web1 kernel: [] > __link_path_walk+0xa01/0xf5b > Mar 8 03:33:20 web1 kernel: [] link_path_walk+0x42/0xb2 > Mar 8 03:33:20 web1 kernel: [] do_path_lookup+0x275/0x2f1 > Mar 8 03:33:23 web1 kernel: [] getname+0x15b/0x1c2 > Mar 8 03:33:23 web1 kernel: [] __user_walk_fd+0x37/0x4c > Mar 8 03:33:23 web1 kernel: [] vfs_stat_fd+0x1b/0x4a > Mar 8 03:33:23 web1 kernel: [] sys_newstat+0x19/0x31 > Mar 8 03:33:23 web1 kernel: [] system_call+0x7e/0x83 > > If anyone has some advice on where to go from here it would be greatly > appreciated. > > Thanks in advance. > > -- > PJF > ___ > CentOS mailing list > CentOS@centos.org > http://lists.centos.org/mailman/listinfo/centos > Have you tried disabling the cron job you think is at fault to see if the lock up goes away? Also, have you checked all the users' crontabs? Boris. ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Server locking up everyday around 3:30 AM - (INFO: task wget:13608 blocked for more than 120 seconds) need sleep, help.
> -Original Message- > From: centos-boun...@centos.org [mailto:centos-boun...@centos.org] On > Behalf Of PJ > Sent: Friday, March 11, 2011 12:34 > To: centos@centos.org > Subject: [CentOS] Server locking up everyday around 3:30 AM - (INFO: task > wget:13608 blocked for more than 120 seconds) need sleep, help. > There are no cron jobs that run at 3:30, no backups, the server has a > load of 0, nothing is scheduled around that time... > Are you sure the stuff in /etc/cron.daily/ is done by then or not started yet? Could be something like the mlocate or makewhatis chewing up CPU/Mem. IIRC the stuff in /etc/cron.daily/ runs in alphabetic order so, are you (root) getting the logwatch messages, and at what time? ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Server locking up everyday around 3:30 AM - (INFO: task wget:13608 blocked for more than 120 seconds) need sleep, help.
On Mar 11, 2011, at 12:33 PM, PJ wrote: > This may or may not be CentOS related, but am out of ideas at this > point and wanted to bounce this off the list. > > I'm running a CentOS 5.5 server, running the latest kernel > 2.6.18-194.32.1.el5. > > Almost everyday around 3:30 AM the server completely locks up and has > to be power cycled before it will come back online. > (this means someone hat to wake up and reboot the server, oh how I > love being an internet janitor! :) > > Smells like a hardware issue to me too, but I went through all of the > dell diagnostics, updated the firmware, everything checks out as being > okay, RAID, disks, RAM, etc... Spent an hour on the phone with a Dell > tech. No hardware issues, at least that we were able to find. > > There are no cron jobs that run at 3:30, no backups, the server has a > load of 0, nothing is scheduled around that time... > > The only crontab entry at all is "*/5 * * * * wget -q > www.websitedomain.com/cron.php >/dev/null 2>&1" > They are running Magento for commerce purposes and this runs every 5 minutes. > > Why does the server only lockup around 3:30 AM? Because it's knows I > am fast asleep? > > I was able to pull this from /var/log/messages, this happens just > seconds before locking up completely... > > Mar 8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more > than 120 seconds. > Mar 8 03:33:19 web1 kernel: "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Mar 8 03:33:19 web1 kernel: wget D 810001004420 0 > 13608 13607 (NOTLB) > Mar 8 03:33:19 web1 kernel: 81007bc7bc78 0086 > 81007bc7bd88 81000100d3f8 > Mar 8 03:33:19 web1 kernel: 81007bc7bbf0 0007 > 8100849db0c0 80308b60 > Mar 8 03:33:19 web1 kernel: 00013a2964cdf439 3237 > 8100849db2a8 64c82eae > Mar 8 03:33:19 web1 kernel: Call Trace: > Mar 8 03:33:20 web1 kernel: [] > __mutex_lock_slowpath+0x60/0x9b > Mar 8 03:33:20 web1 kernel: [] .text.lock.mutex+0xf/0x14 > Mar 8 03:33:20 web1 kernel: [] do_lookup+0x90/0x1e6 > Mar 8 03:33:20 web1 kernel: [] > __link_path_walk+0xa01/0xf5b > Mar 8 03:33:20 web1 kernel: [] link_path_walk+0x42/0xb2 > Mar 8 03:33:20 web1 kernel: [] do_path_lookup+0x275/0x2f1 > Mar 8 03:33:23 web1 kernel: [] getname+0x15b/0x1c2 > Mar 8 03:33:23 web1 kernel: [] __user_walk_fd+0x37/0x4c > Mar 8 03:33:23 web1 kernel: [] vfs_stat_fd+0x1b/0x4a > Mar 8 03:33:23 web1 kernel: [] sys_newstat+0x19/0x31 > Mar 8 03:33:23 web1 kernel: [] system_call+0x7e/0x83 > > If anyone has some advice on where to go from here it would be greatly > appreciated. Do a fsck of the file system wget is writing to as there might be a corruption it hits only on the 3:30am run as that's when the other vendor dumps data to be downloaded. You could also check to see if a RAID patrol read (scrub/predictive failure detection) is happening around this time as well and disable/reschedule it. -Ross ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Server locking up everyday around 3:30 AM - (INFO: task wget:13608 blocked for more than 120 seconds) need sleep, help.
> > Almost everyday around 3:30 AM the server completely locks up and has > to be power cycled before it will come back online. > (this means someone hat to wake up and reboot the server, oh how I > love being an internet janitor! :) > > Smells like a hardware issue to me too, but I went through all of the > dell diagnostics, updated the firmware, everything checks out as being > okay, RAID, disks, RAM, etc... Spent an hour on the phone with a Dell > tech. No hardware issues, at least that we were able to find. > > There are no cron jobs that run at 3:30, no backups, the server has a > load of 0, nothing is scheduled around that time... do you have smartd set to run short/long hard disk checks during the night? it is done via /etc/smartd.conf, not via cron. ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Re: [CentOS] Server locking up everyday around 3:30 AM - (INFO: task wget:13608 blocked for more than 120 seconds) need sleep, help.
Update the kernel will probably be the way to fix your problem. Best Regards Sinux 在 2011-3-12,10:08,"Ross Walker" 写道: > On Mar 11, 2011, at 12:33 PM, PJ wrote: > >> This may or may not be CentOS related, but am out of ideas at this >> point and wanted to bounce this off the list. >> >> I'm running a CentOS 5.5 server, running the latest kernel >> 2.6.18-194.32.1.el5. >> >> Almost everyday around 3:30 AM the server completely locks up and has >> to be power cycled before it will come back online. >> (this means someone hat to wake up and reboot the server, oh how I >> love being an internet janitor! :) >> >> Smells like a hardware issue to me too, but I went through all of the >> dell diagnostics, updated the firmware, everything checks out as being >> okay, RAID, disks, RAM, etc... Spent an hour on the phone with a Dell >> tech. No hardware issues, at least that we were able to find. >> >> There are no cron jobs that run at 3:30, no backups, the server has a >> load of 0, nothing is scheduled around that time... >> >> The only crontab entry at all is "*/5 * * * * wget -q >> www.websitedomain.com/cron.php >/dev/null 2>&1" >> They are running Magento for commerce purposes and this runs every 5 minutes. >> >> Why does the server only lockup around 3:30 AM? Because it's knows I >> am fast asleep? >> >> I was able to pull this from /var/log/messages, this happens just >> seconds before locking up completely... >> >> Mar 8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more >> than 120 seconds. >> Mar 8 03:33:19 web1 kernel: "echo 0 > >> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> Mar 8 03:33:19 web1 kernel: wget D 810001004420 0 >> 13608 13607 (NOTLB) >> Mar 8 03:33:19 web1 kernel: 81007bc7bc78 0086 >> 81007bc7bd88 81000100d3f8 >> Mar 8 03:33:19 web1 kernel: 81007bc7bbf0 0007 >> 8100849db0c0 80308b60 >> Mar 8 03:33:19 web1 kernel: 00013a2964cdf439 3237 >> 8100849db2a8 64c82eae >> Mar 8 03:33:19 web1 kernel: Call Trace: >> Mar 8 03:33:20 web1 kernel: [] >> __mutex_lock_slowpath+0x60/0x9b >> Mar 8 03:33:20 web1 kernel: [] .text.lock.mutex+0xf/0x14 >> Mar 8 03:33:20 web1 kernel: [] do_lookup+0x90/0x1e6 >> Mar 8 03:33:20 web1 kernel: [] >> __link_path_walk+0xa01/0xf5b >> Mar 8 03:33:20 web1 kernel: [] link_path_walk+0x42/0xb2 >> Mar 8 03:33:20 web1 kernel: [] do_path_lookup+0x275/0x2f1 >> Mar 8 03:33:23 web1 kernel: [] getname+0x15b/0x1c2 >> Mar 8 03:33:23 web1 kernel: [] __user_walk_fd+0x37/0x4c >> Mar 8 03:33:23 web1 kernel: [] vfs_stat_fd+0x1b/0x4a >> Mar 8 03:33:23 web1 kernel: [] sys_newstat+0x19/0x31 >> Mar 8 03:33:23 web1 kernel: [] system_call+0x7e/0x83 >> >> If anyone has some advice on where to go from here it would be greatly >> appreciated. > > Do a fsck of the file system wget is writing to as there might be a > corruption it hits only on the 3:30am run as that's when the other vendor > dumps data to be downloaded. > > You could also check to see if a RAID patrol read (scrub/predictive failure > detection) is happening around this time as well and disable/reschedule it. > > -Ross > > ___ > CentOS mailing list > CentOS@centos.org > http://lists.centos.org/mailman/listinfo/centos > ___ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos