usertime stale at about 371k seconds

2012-05-30 Thread Andrey Zonov

Hi,

I have long running process for which `ps -o usertime -p $pid' shows 
always the same time - 6190:07.65, `ps -o cputime -p $pid' for the same 
process continue to grow and now it's 21538:53.61.  It looks like 
overflow in resource usage code or something.


Any ideas?

--
Andrey Zonov
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: usertime stale at about 371k seconds

2012-05-31 Thread Andrey Zonov

On 5/30/12 11:27 PM, Andrey Zonov wrote:

Hi,

I have long running process for which `ps -o usertime -p $pid' shows
always the same time - 6190:07.65, `ps -o cputime -p $pid' for the same
process continue to grow and now it's 21538:53.61. It looks like
overflow in resource usage code or something.



I reproduced that problem with attached program.  I ran it with 23 
threads on machine with 24 CPUs and after night I see this:


$ ps -o usertime,time -p 24134 && sleep 60 && ps -o usertime,time -p 24134
  USERTIMETIME
6351:24.74 14977:35.19
  USERTIMETIME
6351:24.74 15000:34.53

Per thread user-time counts correct:

$ ps -H -o usertime,time -p 24134
 USERTIME  TIME
  0:00.00   0:00.00
652:35.84 652:38.59
652:34.75 652:37.97
652:50.46 652:51.97
652:38.93 652:43.08
652:39.73 652:43.36
652:44.09 652:47.36
652:56.49 652:57.94
652:51.84 652:54.41
652:37.48 652:41.57
652:36.61 652:40.90
652:39.41 652:42.52
653:03.72 653:06.72
652:49.96 652:53.25
652:45.92 652:49.03
652:40.33 652:42.05
652:46.53 652:49.31
652:44.77 652:47.33
653:00.54 653:02.24
652:33.31 652:36.13
652:51.03 652:52.91
652:50.73 652:52.71
652:41.32 652:44.64
652:59.86 653:03.25

(kgdb) p $my->p_rux
$14 = {rux_runtime = 2171421985692826, rux_uticks = 114886093, 
rux_sticks = 8353, rux_iticks = 0, rux_uu = 381084736784, rux_su = 
65773652, rux_tu = 904571706136}

(kgdb) p $my->p_rux
$15 = {rux_runtime = 2191831516209186, rux_uticks = 115966087, 
rux_sticks = 8444, rux_iticks = 0, rux_uu = 381084736784, rux_su = 
66458587, rux_tu = 913099969825}


As you can see rux_uu stale, but rux_uticks still ticks.  I think the 
problem is in calcru1().  This expression


uu = (tu * ut) / tt

overflows.

I applied the following patch:

Index: /usr/src/sys/kern/kern_resource.c
===
--- /usr/src/sys/kern/kern_resource.c   (revision 235394)
+++ /usr/src/sys/kern/kern_resource.c   (working copy)
@@ -885,7 +885,7 @@ calcru1(struct proc *p, struct rusage_ext *ruxp, s
 struct timeval *sp)
 {
/* {user, system, interrupt, total} {ticks, usec}: */
-   uint64_t ut, uu, st, su, it, tt, tu;
+   uint64_t ut, uu, st, su, it, tt, tu, tmp;

ut = ruxp->rux_uticks;
st = ruxp->rux_sticks;
@@ -909,10 +909,20 @@ calcru1(struct proc *p, struct rusage_ext *ruxp, s
 * The normal case, time increased.
 * Enforce monotonicity of bucketed numbers.
 */
-   uu = (tu * ut) / tt;
+   if (ut == 0)
+   uu = 0;
+   else {
+   tmp = tt / ut;
+   uu = tmp ? tu / tmp : 0;
+   }
if (uu < ruxp->rux_uu)
uu = ruxp->rux_uu;

and now ran test again.

--
Andrey Zonov
/*
 * Andrey Zonov (c) 2012
 */

#include 
#include 
#include 

void *func(void *arg);

int
main(int argc, char **argv)
{
int i;
int threads;
int *tid;
pthread_t *tds;

if (argc != 2)
errx(1, "usage: usertime ");

threads = atoi(argv[1]);
tid = malloc(sizeof(int) * threads);
tds = malloc(sizeof(pthread_t) * threads);

for (i = 0; i < threads; i++) {
tid[i] = i;
if (pthread_create(&tds[i], NULL, func, &tid[i]) != 0)
err(1, "pthread_create(%d)", i);
}

for (i = 0; i < threads; i++)
if (pthread_join(tds[i], NULL) != 0)
err(1, "pthread_join(%d)", i);

exit(0);
}

void *
func(void *arg __unused)
{
int i;

#define MAX (1<<20)

for (i = 0; i < MAX; i++) {
if ((i % (MAX - 1)) == 0) {
i = 0;
/*usleep(1);*/
}
}

pthread_exit(NULL);
}
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: usertime stale at about 371k seconds

2012-05-31 Thread Adrian Chadd
Hi,

Would you please file a PR with all of your work? :-)

Thanks,


Adrian
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: usertime stale at about 371k seconds

2012-06-11 Thread Eric van Gyzen

On 05/31/2012 02:34, Andrey Zonov wrote:

On 5/30/12 11:27 PM, Andrey Zonov wrote:

Hi,

I have long running process for which `ps -o usertime -p $pid' shows
always the same time - 6190:07.65, `ps -o cputime -p $pid' for the same
process continue to grow and now it's 21538:53.61. It looks like
overflow in resource usage code or something.



I reproduced that problem with attached program. I ran it with 23
threads on machine with 24 CPUs and after night I see this:

$ ps -o usertime,time -p 24134 && sleep 60 && ps -o usertime,time -p 24134
USERTIME TIME
6351:24.74 14977:35.19
USERTIME TIME
6351:24.74 15000:34.53

Per thread user-time counts correct:

$ ps -H -o usertime,time -p 24134
USERTIME TIME
0:00.00 0:00.00
652:35.84 652:38.59
652:34.75 652:37.97
652:50.46 652:51.97
652:38.93 652:43.08
652:39.73 652:43.36
652:44.09 652:47.36
652:56.49 652:57.94
652:51.84 652:54.41
652:37.48 652:41.57
652:36.61 652:40.90
652:39.41 652:42.52
653:03.72 653:06.72
652:49.96 652:53.25
652:45.92 652:49.03
652:40.33 652:42.05
652:46.53 652:49.31
652:44.77 652:47.33
653:00.54 653:02.24
652:33.31 652:36.13
652:51.03 652:52.91
652:50.73 652:52.71
652:41.32 652:44.64
652:59.86 653:03.25

(kgdb) p $my->p_rux
$14 = {rux_runtime = 2171421985692826, rux_uticks = 114886093,
rux_sticks = 8353, rux_iticks = 0, rux_uu = 381084736784, rux_su =
65773652, rux_tu = 904571706136}
(kgdb) p $my->p_rux
$15 = {rux_runtime = 2191831516209186, rux_uticks = 115966087,
rux_sticks = 8444, rux_iticks = 0, rux_uu = 381084736784, rux_su =
66458587, rux_tu = 913099969825}

As you can see rux_uu stale, but rux_uticks still ticks. I think the
problem is in calcru1(). This expression

uu = (tu * ut) / tt

overflows.

I applied the following patch:

Index: /usr/src/sys/kern/kern_resource.c
===
--- /usr/src/sys/kern/kern_resource.c (revision 235394)
+++ /usr/src/sys/kern/kern_resource.c (working copy)
@@ -885,7 +885,7 @@ calcru1(struct proc *p, struct rusage_ext *ruxp, s
struct timeval *sp)
{
/* {user, system, interrupt, total} {ticks, usec}: */
- uint64_t ut, uu, st, su, it, tt, tu;
+ uint64_t ut, uu, st, su, it, tt, tu, tmp;

ut = ruxp->rux_uticks;
st = ruxp->rux_sticks;
@@ -909,10 +909,20 @@ calcru1(struct proc *p, struct rusage_ext *ruxp, s
* The normal case, time increased.
* Enforce monotonicity of bucketed numbers.
*/
- uu = (tu * ut) / tt;
+ if (ut == 0)
+ uu = 0;
+ else {
+ tmp = tt / ut;
+ uu = tmp ? tu / tmp : 0;
+ }
if (uu < ruxp->rux_uu)
uu = ruxp->rux_uu;

and now ran test again.


This looks related to, and possibly identical to, PR kern/76972:

http://www.freebsd.org/cgi/query-pr.cgi?pr=76972

If you filed a PR, please submit a follow-up to both PRs so they 
reference each other.


Thanks,

Eric
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: usertime stale at about 371k seconds

2012-06-12 Thread Andrey Zonov

On 6/11/12 7:33 PM, Eric van Gyzen wrote:

On 05/31/2012 02:34, Andrey Zonov wrote:

On 5/30/12 11:27 PM, Andrey Zonov wrote:

Hi,

I have long running process for which `ps -o usertime -p $pid' shows
always the same time - 6190:07.65, `ps -o cputime -p $pid' for the same
process continue to grow and now it's 21538:53.61. It looks like
overflow in resource usage code or something.



I reproduced that problem with attached program. I ran it with 23
threads on machine with 24 CPUs and after night I see this:

$ ps -o usertime,time -p 24134 && sleep 60 && ps -o usertime,time -p
24134
USERTIME TIME
6351:24.74 14977:35.19
USERTIME TIME
6351:24.74 15000:34.53

Per thread user-time counts correct:

$ ps -H -o usertime,time -p 24134
USERTIME TIME
0:00.00 0:00.00
652:35.84 652:38.59
652:34.75 652:37.97
652:50.46 652:51.97
652:38.93 652:43.08
652:39.73 652:43.36
652:44.09 652:47.36
652:56.49 652:57.94
652:51.84 652:54.41
652:37.48 652:41.57
652:36.61 652:40.90
652:39.41 652:42.52
653:03.72 653:06.72
652:49.96 652:53.25
652:45.92 652:49.03
652:40.33 652:42.05
652:46.53 652:49.31
652:44.77 652:47.33
653:00.54 653:02.24
652:33.31 652:36.13
652:51.03 652:52.91
652:50.73 652:52.71
652:41.32 652:44.64
652:59.86 653:03.25

(kgdb) p $my->p_rux
$14 = {rux_runtime = 2171421985692826, rux_uticks = 114886093,
rux_sticks = 8353, rux_iticks = 0, rux_uu = 381084736784, rux_su =
65773652, rux_tu = 904571706136}
(kgdb) p $my->p_rux
$15 = {rux_runtime = 2191831516209186, rux_uticks = 115966087,
rux_sticks = 8444, rux_iticks = 0, rux_uu = 381084736784, rux_su =
66458587, rux_tu = 913099969825}

As you can see rux_uu stale, but rux_uticks still ticks. I think the
problem is in calcru1(). This expression

uu = (tu * ut) / tt

overflows.

I applied the following patch:

Index: /usr/src/sys/kern/kern_resource.c
===
--- /usr/src/sys/kern/kern_resource.c (revision 235394)
+++ /usr/src/sys/kern/kern_resource.c (working copy)
@@ -885,7 +885,7 @@ calcru1(struct proc *p, struct rusage_ext *ruxp, s
struct timeval *sp)
{
/* {user, system, interrupt, total} {ticks, usec}: */
- uint64_t ut, uu, st, su, it, tt, tu;
+ uint64_t ut, uu, st, su, it, tt, tu, tmp;

ut = ruxp->rux_uticks;
st = ruxp->rux_sticks;
@@ -909,10 +909,20 @@ calcru1(struct proc *p, struct rusage_ext *ruxp, s
* The normal case, time increased.
* Enforce monotonicity of bucketed numbers.
*/
- uu = (tu * ut) / tt;
+ if (ut == 0)
+ uu = 0;
+ else {
+ tmp = tt / ut;
+ uu = tmp ? tu / tmp : 0;
+ }
if (uu < ruxp->rux_uu)
uu = ruxp->rux_uu;

and now ran test again.


This looks related to, and possibly identical to, PR kern/76972:

http://www.freebsd.org/cgi/query-pr.cgi?pr=76972


Yes, that's the same.



If you filed a PR, please submit a follow-up to both PRs so they
reference each other.


No, I didn't.  I want to fix the problem not just file a PR and wait for 
years.




Thanks,



Thank you.


Eric



--
Andrey Zonov
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: usertime stale at about 371k seconds

2012-06-12 Thread Andrey Zonov

On 5/31/12 11:34 AM, Andrey Zonov wrote:

On 5/30/12 11:27 PM, Andrey Zonov wrote:

Hi,

I have long running process for which `ps -o usertime -p $pid' shows
always the same time - 6190:07.65, `ps -o cputime -p $pid' for the same
process continue to grow and now it's 21538:53.61. It looks like
overflow in resource usage code or something.



I reproduced that problem with attached program. I ran it with 23
threads on machine with 24 CPUs and after night I see this:

$ ps -o usertime,time -p 24134 && sleep 60 && ps -o usertime,time -p 24134
USERTIME TIME
6351:24.74 14977:35.19
USERTIME TIME
6351:24.74 15000:34.53

Per thread user-time counts correct:

$ ps -H -o usertime,time -p 24134
USERTIME TIME
0:00.00 0:00.00
652:35.84 652:38.59
652:34.75 652:37.97
652:50.46 652:51.97
652:38.93 652:43.08
652:39.73 652:43.36
652:44.09 652:47.36
652:56.49 652:57.94
652:51.84 652:54.41
652:37.48 652:41.57
652:36.61 652:40.90
652:39.41 652:42.52
653:03.72 653:06.72
652:49.96 652:53.25
652:45.92 652:49.03
652:40.33 652:42.05
652:46.53 652:49.31
652:44.77 652:47.33
653:00.54 653:02.24
652:33.31 652:36.13
652:51.03 652:52.91
652:50.73 652:52.71
652:41.32 652:44.64
652:59.86 653:03.25

(kgdb) p $my->p_rux
$14 = {rux_runtime = 2171421985692826, rux_uticks = 114886093,
rux_sticks = 8353, rux_iticks = 0, rux_uu = 381084736784, rux_su =
65773652, rux_tu = 904571706136}
(kgdb) p $my->p_rux
$15 = {rux_runtime = 2191831516209186, rux_uticks = 115966087,
rux_sticks = 8444, rux_iticks = 0, rux_uu = 381084736784, rux_su =
66458587, rux_tu = 913099969825}

As you can see rux_uu stale, but rux_uticks still ticks. I think the
problem is in calcru1(). This expression

uu = (tu * ut) / tt

overflows.

I applied the following patch:



I've made some explorations and found that this expression 
'(uint64_t)a*(uint64_t)b/(uint64_t)c' can be replaced with this '(a/c)*b 
+ (a%c)*(b/c) + (a%c)*(b%c)/c' and will be perfect for 0as 'c' is sum of ticks, overflow occurs after 
2^32/128(stathz)/60(sec)/60(min)/24(hours) = 388 days! or after 16 days 
on machine with 24 cores.  That's better than we got now.


In user-land I can use (__uint128_t)a*b/c for this purpose but kernel 
doesn't built with it.  If you know good algorithm how to calculate 
'(uint64_t)a*(uint64_t)b/(uint64_t)c' for 'c > 2^32' please let me know.


--
Andrey Zonov
Index: sys/kern/kern_resource.c
===
--- sys/kern/kern_resource.c(revision 234600)
+++ sys/kern/kern_resource.c(working copy)
@@ -880,6 +880,8 @@ rufetchtd(struct thread *td, struct rusage *ru)
calcru1(p, &td->td_rux, &ru->ru_utime, &ru->ru_stime);
 }
 
+#definemul_div(a, b, c)(a/c)*b + (a%c)*(b/c) + (a%c)*(b%c)/c
+
 static void
 calcru1(struct proc *p, struct rusage_ext *ruxp, struct timeval *up,
 struct timeval *sp)
@@ -909,10 +911,10 @@ calcru1(struct proc *p, struct rusage_ext *ruxp, s
 * The normal case, time increased.
 * Enforce monotonicity of bucketed numbers.
 */
-   uu = (tu * ut) / tt;
+   uu = mul_div(tu, ut, tt);
if (uu < ruxp->rux_uu)
uu = ruxp->rux_uu;
-   su = (tu * st) / tt;
+   su = mul_div(tu, st, tt);
if (su < ruxp->rux_su)
su = ruxp->rux_su;
} else if (tu + 3 > ruxp->rux_tu || 101 * tu > 100 * ruxp->rux_tu) {
@@ -941,8 +943,8 @@ calcru1(struct proc *p, struct rusage_ext *ruxp, s
"to %ju usec for pid %d (%s)\n",
(uintmax_t)ruxp->rux_tu, (uintmax_t)tu,
p->p_pid, p->p_comm);
-   uu = (tu * ut) / tt;
-   su = (tu * st) / tt;
+   uu = mul_div(tu, ut, tt);
+   su = mul_div(tu, st, tt);
}
 
ruxp->rux_uu = uu;
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: usertime stale at about 371k seconds

2012-06-12 Thread Andrey Zonov

On 6/13/12 1:21 AM, Mark Linimon wrote:

On Wed, Jun 13, 2012 at 12:30:08AM +0400, Andrey Zonov wrote:

No, I didn't.  I want to fix the problem not just file a PR and wait
for years.


I do understand your frustration, but we have some new people interested
in picking up and handling src-related PRs, so I see the situation as
improving a bit.



Hi Mark,

Please look at the date of PR/76972.  More than 7 years past since it 
was filed and I can't see any progress.  I've got more PRs (not only 
mine) that was filed but never touched for years.  That's about frustration.


On the other hand I can read and write in C, I can understand FreeBSD 
code and can solve some problems by myself.  My patches go forward to 
maintainers.  That works better than just file a PR and wait.


--
Andrey Zonov
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: usertime stale at about 371k seconds

2012-06-12 Thread Mark Linimon
On Wed, Jun 13, 2012 at 12:30:08AM +0400, Andrey Zonov wrote:
> No, I didn't.  I want to fix the problem not just file a PR and wait
> for years.

I do understand your frustration, but we have some new people interested
in picking up and handling src-related PRs, so I see the situation as
improving a bit.

mcl
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: usertime stale at about 371k seconds

2012-06-12 Thread Adrian Chadd
Hi,

The best thing IMHO to do is:

* file a PR with a description of the problem, how to reproduce it,
and include a patch;
* then poke the maintainer directly if you know who it is;
* then keep gently poking them if they forget.

Having a good description of a problem, including how to reproduce it,
makes it quite a bit more useful to me as a software archeologist when
I want to try and understand the why behind a fix. The what may be
easy to figure out, but not the original reason. Doubly so when I
decide to take over a subsystem and would like to understand some of
the history/evolution of the codebase.

2c, YMMV, etc


Adrian
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"