Re: BUG: rcu_sched detected stalls on CPUs
On Mon, May 22, 2017 at 02:56:50PM +0300, mi...@stz-bg.com wrote: > On 2017-05-10 17:32, Paul E. McKenney wrote: > >On Wed, May 10, 2017 at 02:00:32PM +0300, mi...@stz-bg.com wrote: > >>On 2017-05-09 17:49, Paul E. McKenney wrote: > >>>On Tue, May 09, 2017 at 10:53:25AM +0300, mi...@stz-bg.com wrote: > On 2017-05-08 16:24, Paul E. McKenney wrote: > >On Mon, May 08, 2017 at 09:43:15AM +0300, mi...@stz-bg.com wrote: > >>On 2017-01-20 17:19, Steven Rostedt wrote: > >>>On Fri, 20 Jan 2017 10:43:50 +0200 > >>>mi...@stz-bg.com wrote: > >>> > [1.] One line summary of the problem: > > rcu_sched detected stalls on CPUs and few minutes server not respond. > >>> > >>>Is this reproducible? Or was this a one time ordeal? > >>> > > [2.] Full description of the problem/report: > > Load of my server (postgres database) isnt big less then 0.50 and when > error occured rcu_sched detected stalls on CPUs > server freeze and nothing is work for 3-5 minute. > No network, no video signal, no keyboard, no mouse. Nothing is worked. > After these few minutes everything continue normal. > This usual is happend once per day. When I check in google find a lots > of ppl complain of this error, but no solution. > Do any one know can help me to resolve it ? I spoke with few > friends and > they trying to convince me the problem is in CPU. > I did not believe after a 3 years working CPU suddenly stop working > correctly, but I might be wrong. > > [3.] Keywords (i.e., modules, networking, kernel): > > kernel > > [4.] Kernel information > [4.1.] Kernel version (from /proc/version): > > Linux version 4.4.38 (root@hive64) (gcc version 5.4.0 (GCC) ) #2 > SMP Sun > Dec 11 16:11:02 CST 2016 > > >>> > >>>Have you tried a newer version of the kernel? > >>> > >>>-- Steve > >> > >> > >>Hello, > >>yesterday I change to new kernel: 4.9.26 and still no effect. > >> > >>I trying to figure out what I need to buy because I read on google a > >>lots of posts about that problem, > >>some ppl suggest is BIOS firmware bug, some ppl tell that when they > >>swap CPU problem is resolved. > >>May be problem was started when I first time boot 4.x kernel and > >>there have cpu microcode updates. > >>Im 3.x kernels this feature was not in kernel, but I don't know, > >>only guess. > >> > >>Can some one point me clearly: You need to change this one and > >>problem will be solved ? > > > >Looking back at your earlier email, I see the following: > > > >[Wed Jan 4 10:19:12 2017] rcu_sched kthread starved for 60004 > >jiffies! g61967937 c61967936 f0x0 s3 ->state=0x1 > > > >That indicates that the task "rcu_sched" isn't being allowed to run. > >The "->state=0x1" indicates that this task is currently asleep and the > >"s3" indicates that it is looking to start a grace period, but has not > >progressed through this task. My guess is that your system get very > >busy, and that rcu_sched's normal scheduling priority is not sufficient > >for it to be allowed to run. > > > >I therefore suggest that you try raising the rcu_sched task's priority. > > > >You can find this task like this: > > > > ps -ef | grep rcu_sched > > > >On my laptop, I get this: > > > > $ ps -ef | grep rcu_sched > > root 7 2 0 Apr11 ?00:19:19 [rcu_sched] > > paulmck 18307 22926 0 06:11 pts/35 00:00:00 grep --color=auto > >rcu_sche > > > >You can use the chrt command to set the priority. For example, > >given that > >on my laptop, rcu_sched is PID 7: > > > > sudo chrt -f -p 1 7 > > > >You can double-check as follows: > > > > $ chrt -p 7 > > pid 7's current scheduling policy: SCHED_FIFO > > pid 7's current scheduling priority: 1 > > > >Does that help? > > > > Thanx, Paul > > > > > Hi, > interesting ... I change priority as you suggest and will keep you > in touch > if the problem happens again because I cant reproduce it. Replay on > your next email, I don't have any > virtual servers / guest os on the server. It's server with only one > process - redis server > and in this time when error it's happened actually one of other > servers start loading data > into redis sever. Data is small 3-4 mil. keys every key contain > around 150 symbols. > I have that problem on two servers, next server is database server > with postgresql > and that problem happened more often, usual once per day or for few > days. > I never play before with changing prior
Re: BUG: rcu_sched detected stalls on CPUs
On 2017-05-10 17:32, Paul E. McKenney wrote: On Wed, May 10, 2017 at 02:00:32PM +0300, mi...@stz-bg.com wrote: On 2017-05-09 17:49, Paul E. McKenney wrote: >On Tue, May 09, 2017 at 10:53:25AM +0300, mi...@stz-bg.com wrote: >>On 2017-05-08 16:24, Paul E. McKenney wrote: >>>On Mon, May 08, 2017 at 09:43:15AM +0300, mi...@stz-bg.com wrote: On 2017-01-20 17:19, Steven Rostedt wrote: >On Fri, 20 Jan 2017 10:43:50 +0200 >mi...@stz-bg.com wrote: > >>[1.] One line summary of the problem: >> >>rcu_sched detected stalls on CPUs and few minutes server not respond. > >Is this reproducible? Or was this a one time ordeal? > >> >>[2.] Full description of the problem/report: >> >>Load of my server (postgres database) isnt big less then 0.50 and when >>error occured rcu_sched detected stalls on CPUs >>server freeze and nothing is work for 3-5 minute. >>No network, no video signal, no keyboard, no mouse. Nothing is worked. >>After these few minutes everything continue normal. >>This usual is happend once per day. When I check in google find a lots >>of ppl complain of this error, but no solution. >>Do any one know can help me to resolve it ? I spoke with few >>friends and >>they trying to convince me the problem is in CPU. >>I did not believe after a 3 years working CPU suddenly stop working >>correctly, but I might be wrong. >> >>[3.] Keywords (i.e., modules, networking, kernel): >> >>kernel >> >>[4.] Kernel information >>[4.1.] Kernel version (from /proc/version): >> >>Linux version 4.4.38 (root@hive64) (gcc version 5.4.0 (GCC) ) #2 >>SMP Sun >>Dec 11 16:11:02 CST 2016 >> > >Have you tried a newer version of the kernel? > >-- Steve Hello, yesterday I change to new kernel: 4.9.26 and still no effect. I trying to figure out what I need to buy because I read on google a lots of posts about that problem, some ppl suggest is BIOS firmware bug, some ppl tell that when they swap CPU problem is resolved. May be problem was started when I first time boot 4.x kernel and there have cpu microcode updates. Im 3.x kernels this feature was not in kernel, but I don't know, only guess. Can some one point me clearly: You need to change this one and problem will be solved ? >>> >>>Looking back at your earlier email, I see the following: >>> >>>[Wed Jan 4 10:19:12 2017] rcu_sched kthread starved for 60004 >>>jiffies! g61967937 c61967936 f0x0 s3 ->state=0x1 >>> >>>That indicates that the task "rcu_sched" isn't being allowed to run. >>>The "->state=0x1" indicates that this task is currently asleep and the >>>"s3" indicates that it is looking to start a grace period, but has not >>>progressed through this task. My guess is that your system get very >>>busy, and that rcu_sched's normal scheduling priority is not sufficient >>>for it to be allowed to run. >>> >>>I therefore suggest that you try raising the rcu_sched task's priority. >>> >>>You can find this task like this: >>> >>>ps -ef | grep rcu_sched >>> >>>On my laptop, I get this: >>> >>>$ ps -ef | grep rcu_sched >>>root 7 2 0 Apr11 ?00:19:19 [rcu_sched] >>>paulmck 18307 22926 0 06:11 pts/35 00:00:00 grep --color=auto >>>rcu_sche >>> >>>You can use the chrt command to set the priority. For example, >>>given that >>>on my laptop, rcu_sched is PID 7: >>> >>>sudo chrt -f -p 1 7 >>> >>>You can double-check as follows: >>> >>>$ chrt -p 7 >>>pid 7's current scheduling policy: SCHED_FIFO >>>pid 7's current scheduling priority: 1 >>> >>>Does that help? >>> >>>Thanx, Paul >>> >> >> >>Hi, >>interesting ... I change priority as you suggest and will keep you >>in touch >>if the problem happens again because I cant reproduce it. Replay on >>your next email, I don't have any >>virtual servers / guest os on the server. It's server with only one >>process - redis server >>and in this time when error it's happened actually one of other >>servers start loading data >>into redis sever. Data is small 3-4 mil. keys every key contain >>around 150 symbols. >>I have that problem on two servers, next server is database server >>with postgresql >>and that problem happened more often, usual once per day or for few >>days. >>I never play before with changing priority of processes. When I check >>iLO log I saw logged error: power lost but server does not lost the >>power actually, >>it's just stop respond for 2-3 minutes and then continue normal >>operation with >>uptime more then 100 days. > >Does redis keep any useful log? If so, it would be worth looking into >what it thought that it was doing. No, only standard logs: 1333:M 06 May 04:36:43.019 * 1 changes in 60 seconds. Saving... 1333:M 06 May 04:36:43.025 * Background saving started by pid 4325 4325:C 06 May 04:36:4
Re: BUG: rcu_sched detected stalls on CPUs
On Wed, May 10, 2017 at 02:00:32PM +0300, mi...@stz-bg.com wrote: > On 2017-05-09 17:49, Paul E. McKenney wrote: > >On Tue, May 09, 2017 at 10:53:25AM +0300, mi...@stz-bg.com wrote: > >>On 2017-05-08 16:24, Paul E. McKenney wrote: > >>>On Mon, May 08, 2017 at 09:43:15AM +0300, mi...@stz-bg.com wrote: > On 2017-01-20 17:19, Steven Rostedt wrote: > >On Fri, 20 Jan 2017 10:43:50 +0200 > >mi...@stz-bg.com wrote: > > > >>[1.] One line summary of the problem: > >> > >>rcu_sched detected stalls on CPUs and few minutes server not respond. > > > >Is this reproducible? Or was this a one time ordeal? > > > >> > >>[2.] Full description of the problem/report: > >> > >>Load of my server (postgres database) isnt big less then 0.50 and when > >>error occured rcu_sched detected stalls on CPUs > >>server freeze and nothing is work for 3-5 minute. > >>No network, no video signal, no keyboard, no mouse. Nothing is worked. > >>After these few minutes everything continue normal. > >>This usual is happend once per day. When I check in google find a lots > >>of ppl complain of this error, but no solution. > >>Do any one know can help me to resolve it ? I spoke with few > >>friends and > >>they trying to convince me the problem is in CPU. > >>I did not believe after a 3 years working CPU suddenly stop working > >>correctly, but I might be wrong. > >> > >>[3.] Keywords (i.e., modules, networking, kernel): > >> > >>kernel > >> > >>[4.] Kernel information > >>[4.1.] Kernel version (from /proc/version): > >> > >>Linux version 4.4.38 (root@hive64) (gcc version 5.4.0 (GCC) ) #2 > >>SMP Sun > >>Dec 11 16:11:02 CST 2016 > >> > > > >Have you tried a newer version of the kernel? > > > >-- Steve > > > Hello, > yesterday I change to new kernel: 4.9.26 and still no effect. > > I trying to figure out what I need to buy because I read on google a > lots of posts about that problem, > some ppl suggest is BIOS firmware bug, some ppl tell that when they > swap CPU problem is resolved. > May be problem was started when I first time boot 4.x kernel and > there have cpu microcode updates. > Im 3.x kernels this feature was not in kernel, but I don't know, > only guess. > > Can some one point me clearly: You need to change this one and > problem will be solved ? > >>> > >>>Looking back at your earlier email, I see the following: > >>> > >>>[Wed Jan 4 10:19:12 2017] rcu_sched kthread starved for 60004 > >>>jiffies! g61967937 c61967936 f0x0 s3 ->state=0x1 > >>> > >>>That indicates that the task "rcu_sched" isn't being allowed to run. > >>>The "->state=0x1" indicates that this task is currently asleep and the > >>>"s3" indicates that it is looking to start a grace period, but has not > >>>progressed through this task. My guess is that your system get very > >>>busy, and that rcu_sched's normal scheduling priority is not sufficient > >>>for it to be allowed to run. > >>> > >>>I therefore suggest that you try raising the rcu_sched task's priority. > >>> > >>>You can find this task like this: > >>> > >>> ps -ef | grep rcu_sched > >>> > >>>On my laptop, I get this: > >>> > >>> $ ps -ef | grep rcu_sched > >>> root 7 2 0 Apr11 ?00:19:19 [rcu_sched] > >>> paulmck 18307 22926 0 06:11 pts/35 00:00:00 grep --color=auto > >>>rcu_sche > >>> > >>>You can use the chrt command to set the priority. For example, > >>>given that > >>>on my laptop, rcu_sched is PID 7: > >>> > >>> sudo chrt -f -p 1 7 > >>> > >>>You can double-check as follows: > >>> > >>> $ chrt -p 7 > >>> pid 7's current scheduling policy: SCHED_FIFO > >>> pid 7's current scheduling priority: 1 > >>> > >>>Does that help? > >>> > >>> Thanx, Paul > >>> > >> > >> > >>Hi, > >>interesting ... I change priority as you suggest and will keep you > >>in touch > >>if the problem happens again because I cant reproduce it. Replay on > >>your next email, I don't have any > >>virtual servers / guest os on the server. It's server with only one > >>process - redis server > >>and in this time when error it's happened actually one of other > >>servers start loading data > >>into redis sever. Data is small 3-4 mil. keys every key contain > >>around 150 symbols. > >>I have that problem on two servers, next server is database server > >>with postgresql > >>and that problem happened more often, usual once per day or for few > >>days. > >>I never play before with changing priority of processes. When I check > >>iLO log I saw logged error: power lost but server does not lost the > >>power actually, > >>it's just stop respond for 2-3 minutes and then continue normal > >>operation with > >>uptime more then 100 days. > > > >Does redis keep any useful log? If so, it would be worth looking into > >what it thought th
Re: BUG: rcu_sched detected stalls on CPUs
On 2017-05-09 17:49, Paul E. McKenney wrote: On Tue, May 09, 2017 at 10:53:25AM +0300, mi...@stz-bg.com wrote: On 2017-05-08 16:24, Paul E. McKenney wrote: >On Mon, May 08, 2017 at 09:43:15AM +0300, mi...@stz-bg.com wrote: >>On 2017-01-20 17:19, Steven Rostedt wrote: >>>On Fri, 20 Jan 2017 10:43:50 +0200 >>>mi...@stz-bg.com wrote: >>> [1.] One line summary of the problem: rcu_sched detected stalls on CPUs and few minutes server not respond. >>> >>>Is this reproducible? Or was this a one time ordeal? >>> [2.] Full description of the problem/report: Load of my server (postgres database) isnt big less then 0.50 and when error occured rcu_sched detected stalls on CPUs server freeze and nothing is work for 3-5 minute. No network, no video signal, no keyboard, no mouse. Nothing is worked. After these few minutes everything continue normal. This usual is happend once per day. When I check in google find a lots of ppl complain of this error, but no solution. Do any one know can help me to resolve it ? I spoke with few friends and they trying to convince me the problem is in CPU. I did not believe after a 3 years working CPU suddenly stop working correctly, but I might be wrong. [3.] Keywords (i.e., modules, networking, kernel): kernel [4.] Kernel information [4.1.] Kernel version (from /proc/version): Linux version 4.4.38 (root@hive64) (gcc version 5.4.0 (GCC) ) #2 SMP Sun Dec 11 16:11:02 CST 2016 >>> >>>Have you tried a newer version of the kernel? >>> >>>-- Steve >> >> >>Hello, >>yesterday I change to new kernel: 4.9.26 and still no effect. >> >>I trying to figure out what I need to buy because I read on google a >>lots of posts about that problem, >>some ppl suggest is BIOS firmware bug, some ppl tell that when they >>swap CPU problem is resolved. >>May be problem was started when I first time boot 4.x kernel and >>there have cpu microcode updates. >>Im 3.x kernels this feature was not in kernel, but I don't know, >>only guess. >> >>Can some one point me clearly: You need to change this one and >>problem will be solved ? > >Looking back at your earlier email, I see the following: > >[Wed Jan 4 10:19:12 2017] rcu_sched kthread starved for 60004 >jiffies! g61967937 c61967936 f0x0 s3 ->state=0x1 > >That indicates that the task "rcu_sched" isn't being allowed to run. >The "->state=0x1" indicates that this task is currently asleep and the >"s3" indicates that it is looking to start a grace period, but has not >progressed through this task. My guess is that your system get very >busy, and that rcu_sched's normal scheduling priority is not sufficient >for it to be allowed to run. > >I therefore suggest that you try raising the rcu_sched task's priority. > >You can find this task like this: > >ps -ef | grep rcu_sched > >On my laptop, I get this: > >$ ps -ef | grep rcu_sched >root 7 2 0 Apr11 ?00:19:19 [rcu_sched] >paulmck 18307 22926 0 06:11 pts/35 00:00:00 grep --color=auto >rcu_sche > >You can use the chrt command to set the priority. For example, >given that >on my laptop, rcu_sched is PID 7: > >sudo chrt -f -p 1 7 > >You can double-check as follows: > >$ chrt -p 7 >pid 7's current scheduling policy: SCHED_FIFO >pid 7's current scheduling priority: 1 > >Does that help? > >Thanx, Paul > Hi, interesting ... I change priority as you suggest and will keep you in touch if the problem happens again because I cant reproduce it. Replay on your next email, I don't have any virtual servers / guest os on the server. It's server with only one process - redis server and in this time when error it's happened actually one of other servers start loading data into redis sever. Data is small 3-4 mil. keys every key contain around 150 symbols. I have that problem on two servers, next server is database server with postgresql and that problem happened more often, usual once per day or for few days. I never play before with changing priority of processes. When I check iLO log I saw logged error: power lost but server does not lost the power actually, it's just stop respond for 2-3 minutes and then continue normal operation with uptime more then 100 days. Does redis keep any useful log? If so, it would be worth looking into what it thought that it was doing. No, only standard logs: 1333:M 06 May 04:36:43.019 * 1 changes in 60 seconds. Saving... 1333:M 06 May 04:36:43.025 * Background saving started by pid 4325 4325:C 06 May 04:36:46.745 * DB saved on disk 4325:C 06 May 04:36:46.749 * RDB: 40 MB of memory used by copy-on-write 1333:M 06 May 04:36:46.825 * Background saving terminated with success 1333:M 06 May 04:37:47.018 * 1 changes in 60 seconds. Saving... 1333:M 06 May 04:37:47.026 * Background saving started by pid 4326 4326:C 06 May 04:37:50.902 * DB saved on disk 4326:C 06 Ma
Re: BUG: rcu_sched detected stalls on CPUs
On Tue, May 09, 2017 at 10:53:25AM +0300, mi...@stz-bg.com wrote: > On 2017-05-08 16:24, Paul E. McKenney wrote: > >On Mon, May 08, 2017 at 09:43:15AM +0300, mi...@stz-bg.com wrote: > >>On 2017-01-20 17:19, Steven Rostedt wrote: > >>>On Fri, 20 Jan 2017 10:43:50 +0200 > >>>mi...@stz-bg.com wrote: > >>> > [1.] One line summary of the problem: > > rcu_sched detected stalls on CPUs and few minutes server not respond. > >>> > >>>Is this reproducible? Or was this a one time ordeal? > >>> > > [2.] Full description of the problem/report: > > Load of my server (postgres database) isnt big less then 0.50 and when > error occured rcu_sched detected stalls on CPUs > server freeze and nothing is work for 3-5 minute. > No network, no video signal, no keyboard, no mouse. Nothing is worked. > After these few minutes everything continue normal. > This usual is happend once per day. When I check in google find a lots > of ppl complain of this error, but no solution. > Do any one know can help me to resolve it ? I spoke with few > friends and > they trying to convince me the problem is in CPU. > I did not believe after a 3 years working CPU suddenly stop working > correctly, but I might be wrong. > > [3.] Keywords (i.e., modules, networking, kernel): > > kernel > > [4.] Kernel information > [4.1.] Kernel version (from /proc/version): > > Linux version 4.4.38 (root@hive64) (gcc version 5.4.0 (GCC) ) #2 > SMP Sun > Dec 11 16:11:02 CST 2016 > > >>> > >>>Have you tried a newer version of the kernel? > >>> > >>>-- Steve > >> > >> > >>Hello, > >>yesterday I change to new kernel: 4.9.26 and still no effect. > >> > >>I trying to figure out what I need to buy because I read on google a > >>lots of posts about that problem, > >>some ppl suggest is BIOS firmware bug, some ppl tell that when they > >>swap CPU problem is resolved. > >>May be problem was started when I first time boot 4.x kernel and > >>there have cpu microcode updates. > >>Im 3.x kernels this feature was not in kernel, but I don't know, > >>only guess. > >> > >>Can some one point me clearly: You need to change this one and > >>problem will be solved ? > > > >Looking back at your earlier email, I see the following: > > > >[Wed Jan 4 10:19:12 2017] rcu_sched kthread starved for 60004 > >jiffies! g61967937 c61967936 f0x0 s3 ->state=0x1 > > > >That indicates that the task "rcu_sched" isn't being allowed to run. > >The "->state=0x1" indicates that this task is currently asleep and the > >"s3" indicates that it is looking to start a grace period, but has not > >progressed through this task. My guess is that your system get very > >busy, and that rcu_sched's normal scheduling priority is not sufficient > >for it to be allowed to run. > > > >I therefore suggest that you try raising the rcu_sched task's priority. > > > >You can find this task like this: > > > > ps -ef | grep rcu_sched > > > >On my laptop, I get this: > > > > $ ps -ef | grep rcu_sched > > root 7 2 0 Apr11 ?00:19:19 [rcu_sched] > > paulmck 18307 22926 0 06:11 pts/35 00:00:00 grep --color=auto > >rcu_sche > > > >You can use the chrt command to set the priority. For example, > >given that > >on my laptop, rcu_sched is PID 7: > > > > sudo chrt -f -p 1 7 > > > >You can double-check as follows: > > > > $ chrt -p 7 > > pid 7's current scheduling policy: SCHED_FIFO > > pid 7's current scheduling priority: 1 > > > >Does that help? > > > > Thanx, Paul > > > > > Hi, > interesting ... I change priority as you suggest and will keep you > in touch > if the problem happens again because I cant reproduce it. Replay on > your next email, I don't have any > virtual servers / guest os on the server. It's server with only one > process - redis server > and in this time when error it's happened actually one of other > servers start loading data > into redis sever. Data is small 3-4 mil. keys every key contain > around 150 symbols. > I have that problem on two servers, next server is database server > with postgresql > and that problem happened more often, usual once per day or for few > days. > I never play before with changing priority of processes. When I check > iLO log I saw logged error: power lost but server does not lost the > power actually, > it's just stop respond for 2-3 minutes and then continue normal > operation with > uptime more then 100 days. Does redis keep any useful log? If so, it would be worth looking into what it thought that it was doing. Thanx, Paul > Thanks for your suggestion. > Regards, > Mitko > > > > > >>[239940.067938] clocksource: timekeeping watchdog on CPU26: Marking > >>clocksource 'tsc' as unstable because the skew is too large: > >>[239940.067943] clocksource: 'hpet' wd_now:
Re: BUG: rcu_sched detected stalls on CPUs
On 2017-05-08 16:24, Paul E. McKenney wrote: On Mon, May 08, 2017 at 09:43:15AM +0300, mi...@stz-bg.com wrote: On 2017-01-20 17:19, Steven Rostedt wrote: >On Fri, 20 Jan 2017 10:43:50 +0200 >mi...@stz-bg.com wrote: > >>[1.] One line summary of the problem: >> >>rcu_sched detected stalls on CPUs and few minutes server not respond. > >Is this reproducible? Or was this a one time ordeal? > >> >>[2.] Full description of the problem/report: >> >>Load of my server (postgres database) isnt big less then 0.50 and when >>error occured rcu_sched detected stalls on CPUs >>server freeze and nothing is work for 3-5 minute. >>No network, no video signal, no keyboard, no mouse. Nothing is worked. >>After these few minutes everything continue normal. >>This usual is happend once per day. When I check in google find a lots >>of ppl complain of this error, but no solution. >>Do any one know can help me to resolve it ? I spoke with few >>friends and >>they trying to convince me the problem is in CPU. >>I did not believe after a 3 years working CPU suddenly stop working >>correctly, but I might be wrong. >> >>[3.] Keywords (i.e., modules, networking, kernel): >> >>kernel >> >>[4.] Kernel information >>[4.1.] Kernel version (from /proc/version): >> >>Linux version 4.4.38 (root@hive64) (gcc version 5.4.0 (GCC) ) #2 >>SMP Sun >>Dec 11 16:11:02 CST 2016 >> > >Have you tried a newer version of the kernel? > >-- Steve Hello, yesterday I change to new kernel: 4.9.26 and still no effect. I trying to figure out what I need to buy because I read on google a lots of posts about that problem, some ppl suggest is BIOS firmware bug, some ppl tell that when they swap CPU problem is resolved. May be problem was started when I first time boot 4.x kernel and there have cpu microcode updates. Im 3.x kernels this feature was not in kernel, but I don't know, only guess. Can some one point me clearly: You need to change this one and problem will be solved ? Looking back at your earlier email, I see the following: [Wed Jan 4 10:19:12 2017] rcu_sched kthread starved for 60004 jiffies! g61967937 c61967936 f0x0 s3 ->state=0x1 That indicates that the task "rcu_sched" isn't being allowed to run. The "->state=0x1" indicates that this task is currently asleep and the "s3" indicates that it is looking to start a grace period, but has not progressed through this task. My guess is that your system get very busy, and that rcu_sched's normal scheduling priority is not sufficient for it to be allowed to run. I therefore suggest that you try raising the rcu_sched task's priority. You can find this task like this: ps -ef | grep rcu_sched On my laptop, I get this: $ ps -ef | grep rcu_sched root 7 2 0 Apr11 ?00:19:19 [rcu_sched] paulmck 18307 22926 0 06:11 pts/35 00:00:00 grep --color=auto rcu_sche You can use the chrt command to set the priority. For example, given that on my laptop, rcu_sched is PID 7: sudo chrt -f -p 1 7 You can double-check as follows: $ chrt -p 7 pid 7's current scheduling policy: SCHED_FIFO pid 7's current scheduling priority: 1 Does that help? Thanx, Paul Hi, interesting ... I change priority as you suggest and will keep you in touch if the problem happens again because I cant reproduce it. Replay on your next email, I don't have any virtual servers / guest os on the server. It's server with only one process - redis server and in this time when error it's happened actually one of other servers start loading data into redis sever. Data is small 3-4 mil. keys every key contain around 150 symbols. I have that problem on two servers, next server is database server with postgresql and that problem happened more often, usual once per day or for few days. I never play before with changing priority of processes. When I check iLO log I saw logged error: power lost but server does not lost the power actually, it's just stop respond for 2-3 minutes and then continue normal operation with uptime more then 100 days. Thanks for your suggestion. Regards, Mitko [239940.067938] clocksource: timekeeping watchdog on CPU26: Marking clocksource 'tsc' as unstable because the skew is too large: [239940.067943] clocksource: 'hpet' wd_now: ecb521ce wd_last: ca45912d mask: [239940.067946] clocksource: 'tsc' cs_now: 2d12df5f88d08 cs_last: 29fffbe2d36d6 mask: [239940.068357] clocksource: Switched to clocksource hpet [24.066457] INFO: rcu_sched detected stalls on CPUs/tasks: [24.066488] 2-...: (38 GPs behind) idle=f1a/0/0 softirq=2417028/2417028 fqs=0 [24.066491] 3-...: (27 GPs behind) idle=0bc/0/0 softirq=3076046/3076047 fqs=0 [24.066494] 4-...: (1006 GPs behind) idle=308/0/0 softirq=1474922/1474922 fqs=0 [24.066497] 5-...: (8034 GPs behind) idle=7b4/0/0 softirq=69
Re: BUG: rcu_sched detected stalls on CPUs
On Mon, May 08, 2017 at 06:24:23AM -0700, Paul E. McKenney wrote: > On Mon, May 08, 2017 at 09:43:15AM +0300, mi...@stz-bg.com wrote: > > On 2017-01-20 17:19, Steven Rostedt wrote: [ . . . ] > > Hello, > > yesterday I change to new kernel: 4.9.26 and still no effect. > > > > I trying to figure out what I need to buy because I read on google a > > lots of posts about that problem, > > some ppl suggest is BIOS firmware bug, some ppl tell that when they > > swap CPU problem is resolved. > > May be problem was started when I first time boot 4.x kernel and > > there have cpu microcode updates. > > Im 3.x kernels this feature was not in kernel, but I don't know, > > only guess. > > > > Can some one point me clearly: You need to change this one and > > problem will be solved ? > > Looking back at your earlier email, I see the following: > > [Wed Jan 4 10:19:12 2017] rcu_sched kthread starved for 60004 jiffies! > g61967937 c61967936 f0x0 s3 ->state=0x1 > > That indicates that the task "rcu_sched" isn't being allowed to run. > The "->state=0x1" indicates that this task is currently asleep and the > "s3" indicates that it is looking to start a grace period, but has not > progressed through this task. My guess is that your system get very > busy, and that rcu_sched's normal scheduling priority is not sufficient > for it to be allowed to run. > > I therefore suggest that you try raising the rcu_sched task's priority. > > You can find this task like this: > > ps -ef | grep rcu_sched > > On my laptop, I get this: > > $ ps -ef | grep rcu_sched > root 7 2 0 Apr11 ?00:19:19 [rcu_sched] > paulmck 18307 22926 0 06:11 pts/35 00:00:00 grep --color=auto > rcu_sche > > You can use the chrt command to set the priority. For example, given that > on my laptop, rcu_sched is PID 7: > > sudo chrt -f -p 1 7 > > You can double-check as follows: > > $ chrt -p 7 > pid 7's current scheduling policy: SCHED_FIFO > pid 7's current scheduling priority: 1 > > Does that help? And I should add that this assumes that the interference is from within this same OS instance. If you are instead experiencing massive interference due to other guest OSes on this same system, well, we are working on it. Thanx, Paul > > [239940.067938] clocksource: timekeeping watchdog on CPU26: Marking > > clocksource 'tsc' as unstable because the skew is too large: > > [239940.067943] clocksource: 'hpet' wd_now: > > ecb521ce wd_last: ca45912d mask: > > [239940.067946] clocksource: 'tsc' cs_now: > > 2d12df5f88d08 cs_last: 29fffbe2d36d6 mask: > > [239940.068357] clocksource: Switched to clocksource hpet > > [24.066457] INFO: rcu_sched detected stalls on CPUs/tasks: > > [24.066488] 2-...: (38 GPs behind) idle=f1a/0/0 > > softirq=2417028/2417028 fqs=0 > > [24.066491] 3-...: (27 GPs behind) idle=0bc/0/0 > > softirq=3076046/3076047 fqs=0 > > [24.066494] 4-...: (1006 GPs behind) idle=308/0/0 > > softirq=1474922/1474922 fqs=0 > > [24.066497] 5-...: (8034 GPs behind) idle=7b4/0/0 > > softirq=69165/69166 fqs=0 > > [24.066499] 6-...: (1665 GPs behind) idle=716/0/0 > > softirq=16465/16465 fqs=0 > > [24.066502] 7-...: (8034 GPs behind) idle=bd0/0/0 > > softirq=15923/15924 fqs=0 > > [24.066505] 8-...: (1665 GPs behind) idle=a62/0/0 > > softirq=93861/93861 fqs=0 > > [24.066507] 9-...: (1258 GPs behind) idle=a8c/0/0 > > softirq=21608/21608 fqs=0 > > [24.066510] 11-...: (1665 GPs behind) idle=8e6/0/0 > > softirq=18844/18844 fqs=0 > > [24.066512] 13-...: (1665 GPs behind) idle=ab8/0/0 > > softirq=21278/21278 fqs=0 > > [24.066515] 14-...: (4854 GPs behind) idle=578/0/0 > > softirq=25624/25626 fqs=0 > > [24.066517] 15-...: (8034 GPs behind) idle=f5e/0/0 > > softirq=18114/18115 fqs=0 > > [24.066520] 16-...: (1665 GPs behind) idle=b66/0/0 > > softirq=19511/19511 fqs=0 > > [24.066528] 17-...: (0 ticks this GP) idle=e18/0/0 > > softirq=16163/16163 fqs=0 > > [24.066531] 18-...: (1665 GPs behind) idle=a18/0/0 > > softirq=16424/16424 fqs=0 > > [24.066533] 19-...: (1665 GPs behind) idle=944/0/0 > > softirq=17920/17920 fqs=0 > > [24.066549] 20-...: (20991 GPs behind) idle=750/0/0 > > softirq=17549/17552 fqs=0 > > [24.066552] 21-...: (0 ticks this GP) idle=9e0/0/0 > > softirq=18937/18937 fqs=0 > > [24.066555] 22-...: (14519 GPs behind) idle=95e/0/0 > > softirq=16122/16124 fqs=0 > > [24.066557] 23-...: (1665 GPs behind) idle=bbc/0/0 > > softirq=16316/16316 fqs=0 > > [24.066560] 24-...: (0 ticks this GP) idle=c56/0/0 > > softirq=20070/20070 fqs=0 > > [24.066563] 27-...: (8034 GPs behind) idle=bdc/0/0 > > softirq=18291/1
Re: BUG: rcu_sched detected stalls on CPUs
On Mon, May 08, 2017 at 09:43:15AM +0300, mi...@stz-bg.com wrote: > On 2017-01-20 17:19, Steven Rostedt wrote: > >On Fri, 20 Jan 2017 10:43:50 +0200 > >mi...@stz-bg.com wrote: > > > >>[1.] One line summary of the problem: > >> > >>rcu_sched detected stalls on CPUs and few minutes server not respond. > > > >Is this reproducible? Or was this a one time ordeal? > > > >> > >>[2.] Full description of the problem/report: > >> > >>Load of my server (postgres database) isnt big less then 0.50 and when > >>error occured rcu_sched detected stalls on CPUs > >>server freeze and nothing is work for 3-5 minute. > >>No network, no video signal, no keyboard, no mouse. Nothing is worked. > >>After these few minutes everything continue normal. > >>This usual is happend once per day. When I check in google find a lots > >>of ppl complain of this error, but no solution. > >>Do any one know can help me to resolve it ? I spoke with few > >>friends and > >>they trying to convince me the problem is in CPU. > >>I did not believe after a 3 years working CPU suddenly stop working > >>correctly, but I might be wrong. > >> > >>[3.] Keywords (i.e., modules, networking, kernel): > >> > >>kernel > >> > >>[4.] Kernel information > >>[4.1.] Kernel version (from /proc/version): > >> > >>Linux version 4.4.38 (root@hive64) (gcc version 5.4.0 (GCC) ) #2 > >>SMP Sun > >>Dec 11 16:11:02 CST 2016 > >> > > > >Have you tried a newer version of the kernel? > > > >-- Steve > > > Hello, > yesterday I change to new kernel: 4.9.26 and still no effect. > > I trying to figure out what I need to buy because I read on google a > lots of posts about that problem, > some ppl suggest is BIOS firmware bug, some ppl tell that when they > swap CPU problem is resolved. > May be problem was started when I first time boot 4.x kernel and > there have cpu microcode updates. > Im 3.x kernels this feature was not in kernel, but I don't know, > only guess. > > Can some one point me clearly: You need to change this one and > problem will be solved ? Looking back at your earlier email, I see the following: [Wed Jan 4 10:19:12 2017] rcu_sched kthread starved for 60004 jiffies! g61967937 c61967936 f0x0 s3 ->state=0x1 That indicates that the task "rcu_sched" isn't being allowed to run. The "->state=0x1" indicates that this task is currently asleep and the "s3" indicates that it is looking to start a grace period, but has not progressed through this task. My guess is that your system get very busy, and that rcu_sched's normal scheduling priority is not sufficient for it to be allowed to run. I therefore suggest that you try raising the rcu_sched task's priority. You can find this task like this: ps -ef | grep rcu_sched On my laptop, I get this: $ ps -ef | grep rcu_sched root 7 2 0 Apr11 ?00:19:19 [rcu_sched] paulmck 18307 22926 0 06:11 pts/35 00:00:00 grep --color=auto rcu_sche You can use the chrt command to set the priority. For example, given that on my laptop, rcu_sched is PID 7: sudo chrt -f -p 1 7 You can double-check as follows: $ chrt -p 7 pid 7's current scheduling policy: SCHED_FIFO pid 7's current scheduling priority: 1 Does that help? Thanx, Paul > [239940.067938] clocksource: timekeeping watchdog on CPU26: Marking > clocksource 'tsc' as unstable because the skew is too large: > [239940.067943] clocksource: 'hpet' wd_now: > ecb521ce wd_last: ca45912d mask: > [239940.067946] clocksource: 'tsc' cs_now: > 2d12df5f88d08 cs_last: 29fffbe2d36d6 mask: > [239940.068357] clocksource: Switched to clocksource hpet > [24.066457] INFO: rcu_sched detected stalls on CPUs/tasks: > [24.066488] 2-...: (38 GPs behind) idle=f1a/0/0 > softirq=2417028/2417028 fqs=0 > [24.066491] 3-...: (27 GPs behind) idle=0bc/0/0 > softirq=3076046/3076047 fqs=0 > [24.066494] 4-...: (1006 GPs behind) idle=308/0/0 > softirq=1474922/1474922 fqs=0 > [24.066497] 5-...: (8034 GPs behind) idle=7b4/0/0 > softirq=69165/69166 fqs=0 > [24.066499] 6-...: (1665 GPs behind) idle=716/0/0 > softirq=16465/16465 fqs=0 > [24.066502] 7-...: (8034 GPs behind) idle=bd0/0/0 > softirq=15923/15924 fqs=0 > [24.066505] 8-...: (1665 GPs behind) idle=a62/0/0 > softirq=93861/93861 fqs=0 > [24.066507] 9-...: (1258 GPs behind) idle=a8c/0/0 > softirq=21608/21608 fqs=0 > [24.066510] 11-...: (1665 GPs behind) idle=8e6/0/0 > softirq=18844/18844 fqs=0 > [24.066512] 13-...: (1665 GPs behind) idle=ab8/0/0 > softirq=21278/21278 fqs=0 > [24.066515] 14-...: (4854 GPs behind) idle=578/0/0 > softirq=25624/25626 fqs=0 > [24.066517] 15-...: (8034 GPs behind) idle=f5e/0/0 > softirq=18114/18115 fqs=0 > [24.066520] 16-...: (1665 GPs behind) idle=b66/0/0 > sof
Re: BUG: rcu_sched detected stalls on CPUs
On 2017-01-20 17:19, Steven Rostedt wrote: On Fri, 20 Jan 2017 10:43:50 +0200 mi...@stz-bg.com wrote: [1.] One line summary of the problem: rcu_sched detected stalls on CPUs and few minutes server not respond. Is this reproducible? Or was this a one time ordeal? [2.] Full description of the problem/report: Load of my server (postgres database) isnt big less then 0.50 and when error occured rcu_sched detected stalls on CPUs server freeze and nothing is work for 3-5 minute. No network, no video signal, no keyboard, no mouse. Nothing is worked. After these few minutes everything continue normal. This usual is happend once per day. When I check in google find a lots of ppl complain of this error, but no solution. Do any one know can help me to resolve it ? I spoke with few friends and they trying to convince me the problem is in CPU. I did not believe after a 3 years working CPU suddenly stop working correctly, but I might be wrong. [3.] Keywords (i.e., modules, networking, kernel): kernel [4.] Kernel information [4.1.] Kernel version (from /proc/version): Linux version 4.4.38 (root@hive64) (gcc version 5.4.0 (GCC) ) #2 SMP Sun Dec 11 16:11:02 CST 2016 Have you tried a newer version of the kernel? -- Steve Hello, yesterday I change to new kernel: 4.9.26 and still no effect. I trying to figure out what I need to buy because I read on google a lots of posts about that problem, some ppl suggest is BIOS firmware bug, some ppl tell that when they swap CPU problem is resolved. May be problem was started when I first time boot 4.x kernel and there have cpu microcode updates. Im 3.x kernels this feature was not in kernel, but I don't know, only guess. Can some one point me clearly: You need to change this one and problem will be solved ? [239940.067938] clocksource: timekeeping watchdog on CPU26: Marking clocksource 'tsc' as unstable because the skew is too large: [239940.067943] clocksource: 'hpet' wd_now: ecb521ce wd_last: ca45912d mask: [239940.067946] clocksource: 'tsc' cs_now: 2d12df5f88d08 cs_last: 29fffbe2d36d6 mask: [239940.068357] clocksource: Switched to clocksource hpet [24.066457] INFO: rcu_sched detected stalls on CPUs/tasks: [24.066488] 2-...: (38 GPs behind) idle=f1a/0/0 softirq=2417028/2417028 fqs=0 [24.066491] 3-...: (27 GPs behind) idle=0bc/0/0 softirq=3076046/3076047 fqs=0 [24.066494] 4-...: (1006 GPs behind) idle=308/0/0 softirq=1474922/1474922 fqs=0 [24.066497] 5-...: (8034 GPs behind) idle=7b4/0/0 softirq=69165/69166 fqs=0 [24.066499] 6-...: (1665 GPs behind) idle=716/0/0 softirq=16465/16465 fqs=0 [24.066502] 7-...: (8034 GPs behind) idle=bd0/0/0 softirq=15923/15924 fqs=0 [24.066505] 8-...: (1665 GPs behind) idle=a62/0/0 softirq=93861/93861 fqs=0 [24.066507] 9-...: (1258 GPs behind) idle=a8c/0/0 softirq=21608/21608 fqs=0 [24.066510] 11-...: (1665 GPs behind) idle=8e6/0/0 softirq=18844/18844 fqs=0 [24.066512] 13-...: (1665 GPs behind) idle=ab8/0/0 softirq=21278/21278 fqs=0 [24.066515] 14-...: (4854 GPs behind) idle=578/0/0 softirq=25624/25626 fqs=0 [24.066517] 15-...: (8034 GPs behind) idle=f5e/0/0 softirq=18114/18115 fqs=0 [24.066520] 16-...: (1665 GPs behind) idle=b66/0/0 softirq=19511/19511 fqs=0 [24.066528] 17-...: (0 ticks this GP) idle=e18/0/0 softirq=16163/16163 fqs=0 [24.066531] 18-...: (1665 GPs behind) idle=a18/0/0 softirq=16424/16424 fqs=0 [24.066533] 19-...: (1665 GPs behind) idle=944/0/0 softirq=17920/17920 fqs=0 [24.066549] 20-...: (20991 GPs behind) idle=750/0/0 softirq=17549/17552 fqs=0 [24.066552] 21-...: (0 ticks this GP) idle=9e0/0/0 softirq=18937/18937 fqs=0 [24.066555] 22-...: (14519 GPs behind) idle=95e/0/0 softirq=16122/16124 fqs=0 [24.066557] 23-...: (1665 GPs behind) idle=bbc/0/0 softirq=16316/16316 fqs=0 [24.066560] 24-...: (0 ticks this GP) idle=c56/0/0 softirq=20070/20070 fqs=0 [24.066563] 27-...: (8034 GPs behind) idle=bdc/0/0 softirq=18291/18292 fqs=0 [24.066566] 28-...: (8034 GPs behind) idle=4f8/0/0 softirq=22313/22314 fqs=0 [24.066568] 29-...: (3443 GPs behind) idle=22a/0/0 softirq=24975/24976 fqs=0 [24.066571] 30-...: (1665 GPs behind) idle=940/0/0 softirq=89382/89382 fqs=0 [24.066572] (detected by 10, t=60002 jiffies, g=4843721, c=4843720, q=659) [24.066578] Task dump for CPU 2: [24.066581] swapper/2 R running task0 0 1 0x0020 [24.066588] 0083 a5cecd79 8aca3eea1e00 a64cab40 [24.066593] 0005 a64cad38 a96c80073eb0 a5ba9d93 [24.066597] 8aca3eea1e00 0e94fec6 8aca38535940 0002
Re: BUG: rcu_sched detected stalls on CPUs
On 2017-01-20 17:19, Steven Rostedt wrote: On Fri, 20 Jan 2017 10:43:50 +0200 mi...@stz-bg.com wrote: [1.] One line summary of the problem: rcu_sched detected stalls on CPUs and few minutes server not respond. Is this reproducible? Or was this a one time ordeal? It's happened usual once per day and can't be reproducible. If server load average is more then 0.70 it's will happened twice per day. [2.] Full description of the problem/report: Load of my server (postgres database) isnt big less then 0.50 and when error occured rcu_sched detected stalls on CPUs server freeze and nothing is work for 3-5 minute. No network, no video signal, no keyboard, no mouse. Nothing is worked. After these few minutes everything continue normal. This usual is happend once per day. When I check in google find a lots of ppl complain of this error, but no solution. Do any one know can help me to resolve it ? I spoke with few friends and they trying to convince me the problem is in CPU. I did not believe after a 3 years working CPU suddenly stop working correctly, but I might be wrong. [3.] Keywords (i.e., modules, networking, kernel): kernel [4.] Kernel information [4.1.] Kernel version (from /proc/version): Linux version 4.4.38 (root@hive64) (gcc version 5.4.0 (GCC) ) #2 SMP Sun Dec 11 16:11:02 CST 2016 Have you tried a newer version of the kernel? No, for the moment I didnt because I can't find new one in slackware repos. My last kernel compile was almost 7 years ago and I do not feel safe now, I guess a lot of things have changed. -- Steve Regards, Mitko
Re: BUG: rcu_sched detected stalls on CPUs
On Fri, 20 Jan 2017 10:43:50 +0200 mi...@stz-bg.com wrote: > [1.] One line summary of the problem: > > rcu_sched detected stalls on CPUs and few minutes server not respond. Is this reproducible? Or was this a one time ordeal? > > [2.] Full description of the problem/report: > > Load of my server (postgres database) isnt big less then 0.50 and when > error occured rcu_sched detected stalls on CPUs > server freeze and nothing is work for 3-5 minute. > No network, no video signal, no keyboard, no mouse. Nothing is worked. > After these few minutes everything continue normal. > This usual is happend once per day. When I check in google find a lots > of ppl complain of this error, but no solution. > Do any one know can help me to resolve it ? I spoke with few friends and > they trying to convince me the problem is in CPU. > I did not believe after a 3 years working CPU suddenly stop working > correctly, but I might be wrong. > > [3.] Keywords (i.e., modules, networking, kernel): > > kernel > > [4.] Kernel information > [4.1.] Kernel version (from /proc/version): > > Linux version 4.4.38 (root@hive64) (gcc version 5.4.0 (GCC) ) #2 SMP Sun > Dec 11 16:11:02 CST 2016 > Have you tried a newer version of the kernel? -- Steve