Re: avgtime - Small D util for your everyday benchmarking needs

2012-03-23 Thread James Miller
On 23 March 2012 17:53, Juan Manuel Cabo juanmanuel.c...@gmail.com wrote:
 But I think the most important change is that I'm now showing
 the 95% and 99% confidence intervals. (For the confidence intervals
 to mean anything, please everyone, remember to control
 your variables (don't defrag and benchmark :-) !!) so that apples
 are still apples and don't become oranges, and make sure N30).

 More info on histogram and confidence intervals in the
 usage help.

Dude, this is awesome. I tend to just use time, but if I was doing
anything more complicated, I'd use this. I would suggest changing the
name while you still can. avgtime is not that informative a name given
that it now does more than just Average times.

--
James Miller


Re: avgtime - Small D util for your everyday benchmarking needs

2012-03-23 Thread Juan Manuel Cabo
On Friday, 23 March 2012 at 05:16:20 UTC, Andrei Alexandrescu 
wrote:

[.]

(man, the gaussian curve is everywhere, it never ceases to
perplex me).


I'm actually surprised. I'm working on benchmarking lately and 
the distributions I get are very concentrated around the 
minimum.


Andrei



Well, the shape of the curve depends a lot on
how the random noise gets inside the measurement.

I like  'ls -lR'  because the randomness comes
from everywhere, and its quite bell shaped.
I guess there is a lot of I/O mess (even if
I/O is all cached, there are lots of opportunities
for kernel mutexes to mess everything I guess).

When testing /bin/sleep 0.5, it will be quite
a pretty boring histogram.

And I guess than when testing something thats only
CPU bound and doesn't make too much syscalls,
the shape is more concentrated in a few values.


On the other hand, I'm getting some weird bimodal
(two peaks) curves sometimes, like the one I put on
the README.md.
It's definitely because of my laptop's CPU throttling,
because it went away when I disabled it (for the curious
ones, in ubuntu 64bit, here is a way to disable
throttling (WARNING: might get hot until you undo or reboot):

echo 160  
/sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq


echo 160  
/sys/devices/system/cpu/cpu1/cpufreq/scaling_min_freq


(yes my cpu is 1.6GHz, but it rocks).


--jm





Re: avgtime - Small D util for your everyday benchmarking needs

2012-03-23 Thread Juan Manuel Cabo

On Thursday, 22 March 2012 at 17:13:58 UTC, Manfred Nowak wrote:

Juan Manuel Cabo wrote:


like the unix 'time' command


`version linux' is missing.

-manfred



Linux only for now. Will make it work in windows this weekend.

I hope that's what you meant.

--jm




Re: avgtime - Small D util for your everyday benchmarking needs

2012-03-23 Thread Juan Manuel Cabo

On Friday, 23 March 2012 at 06:51:48 UTC, James Miller wrote:

Dude, this is awesome. I tend to just use time, but if I was 
doing
anything more complicated, I'd use this. I would suggest 
changing the
name while you still can. avgtime is not that informative a 
name given

that it now does more than just Average times.

--
James Miller




Dude, this is awesome.


Thanks!! I appreciate your feedback!


I would suggest changing the name while you still can.


Suggestions welcome!!

--jm



Re: avgtime - Small D util for your everyday benchmarking needs

2012-03-23 Thread Juan Manuel Cabo

On Friday, 23 March 2012 at 05:51:40 UTC, Manfred Nowak wrote:


| For samples, if it is known that they are drawn from a 
symmetric
| distribution, the sample mean can be used as an estimate of 
the

| population mode.


I'm not printing the population mode, I'm printing the 'sample 
mode'.
It has a very clear meaning: most frequent value. To have 
frequency,

I group into 'bins' by precision: 12.345 and 12.3111 will both
go to the 12.3 bin.



and the program computes the variance as if the values of the 
sample

follow a normal distribution, which is symmetric.


This program doesn't compute the variance. Maybe you are talking
about another program. This program computes the standard 
deviation

of the sample. The sample doesn't need to of any distribution
to have a standard deviation. It is not a distribution parameter,
it is a statistic.

Therefore the mode of the sample is of interest only, when the 
variance

is calculated wrongly.


???

The 'sample mode', 'median' and 'average' can quickly tell you
something about the shape of the histogram, without
looking at it.
If the three coincide, then maybe you are in normal distribution 
land.


The only place where I assume normal distribution is for the
confidence intervals. And it's in the usage help.

If you want to support estimating weird probability
distributions parameters, forking and pull requests are
welcome. Rewrites too. Good luck detecting distribution
shapes  ;-)




-manfred


PS: I should use the t student to make the confidence intervals,
and for computing that I should use the sample standard
deviation (/n-1), but that is a completely different story.
The z normal with n30 aproximation is quite good.
(I would have to embed a table for the t student tail factors,
pull reqs velcome).

PS2: I now fixed the confusion with the confidence interval
of the variable and the confidence interval of the mu average,
I simply now show both. (release 0.4).

PS3: Statistics estimate distribution parameters.

--jm





Re: avgtime - Small D util for your everyday benchmarking needs

2012-03-23 Thread James Miller
On 23 March 2012 21:37, Juan Manuel Cabo juanmanuel.c...@gmail.com wrote:
 PS: I should use the t student to make the confidence intervals,
 and for computing that I should use the sample standard
 deviation (/n-1), but that is a completely different story.
 The z normal with n30 aproximation is quite good.
 (I would have to embed a table for the t student tail factors,
 pull reqs velcome).

If its possible to calculate it, then you can generate a table at
compile-time using CTFE. Less error-prone, and controllable accuracy.

--
James Miller


Re: avgtime - Small D util for your everyday benchmarking needs

2012-03-23 Thread Don Clugston

On 23/03/12 09:37, Juan Manuel Cabo wrote:

On Friday, 23 March 2012 at 05:51:40 UTC, Manfred Nowak wrote:


| For samples, if it is known that they are drawn from a symmetric
| distribution, the sample mean can be used as an estimate of the
| population mode.


I'm not printing the population mode, I'm printing the 'sample mode'.
It has a very clear meaning: most frequent value. To have frequency,
I group into 'bins' by precision: 12.345 and 12.3111 will both
go to the 12.3 bin.



and the program computes the variance as if the values of the sample
follow a normal distribution, which is symmetric.


This program doesn't compute the variance. Maybe you are talking
about another program. This program computes the standard deviation
of the sample. The sample doesn't need to of any distribution
to have a standard deviation. It is not a distribution parameter,
it is a statistic.


Therefore the mode of the sample is of interest only, when the variance
is calculated wrongly.


???

The 'sample mode', 'median' and 'average' can quickly tell you
something about the shape of the histogram, without
looking at it.
If the three coincide, then maybe you are in normal distribution land.

The only place where I assume normal distribution is for the
confidence intervals. And it's in the usage help.

If you want to support estimating weird probability
distributions parameters, forking and pull requests are
welcome. Rewrites too. Good luck detecting distribution
shapes ;-)




-manfred


PS: I should use the t student to make the confidence intervals,
and for computing that I should use the sample standard
deviation (/n-1), but that is a completely different story.
The z normal with n30 aproximation is quite good.
(I would have to embed a table for the t student tail factors,
pull reqs velcome).


No, it's easy. Student t is in std.mathspecial.




PS2: I now fixed the confusion with the confidence interval
of the variable and the confidence interval of the mu average,
I simply now show both. (release 0.4).

PS3: Statistics estimate distribution parameters.

--jm







Re: avgtime - Small D util for your everyday benchmarking needs

2012-03-23 Thread Don Clugston

On 23/03/12 11:20, Don Clugston wrote:

On 23/03/12 09:37, Juan Manuel Cabo wrote:

On Friday, 23 March 2012 at 05:51:40 UTC, Manfred Nowak wrote:


| For samples, if it is known that they are drawn from a symmetric
| distribution, the sample mean can be used as an estimate of the
| population mode.


I'm not printing the population mode, I'm printing the 'sample mode'.
It has a very clear meaning: most frequent value. To have frequency,
I group into 'bins' by precision: 12.345 and 12.3111 will both
go to the 12.3 bin.



and the program computes the variance as if the values of the sample
follow a normal distribution, which is symmetric.


This program doesn't compute the variance. Maybe you are talking
about another program. This program computes the standard deviation
of the sample. The sample doesn't need to of any distribution
to have a standard deviation. It is not a distribution parameter,
it is a statistic.


Therefore the mode of the sample is of interest only, when the variance
is calculated wrongly.


???

The 'sample mode', 'median' and 'average' can quickly tell you
something about the shape of the histogram, without
looking at it.
If the three coincide, then maybe you are in normal distribution land.

The only place where I assume normal distribution is for the
confidence intervals. And it's in the usage help.

If you want to support estimating weird probability
distributions parameters, forking and pull requests are
welcome. Rewrites too. Good luck detecting distribution
shapes ;-)




-manfred


PS: I should use the t student to make the confidence intervals,
and for computing that I should use the sample standard
deviation (/n-1), but that is a completely different story.
The z normal with n30 aproximation is quite good.
(I would have to embed a table for the t student tail factors,
pull reqs velcome).


No, it's easy. Student t is in std.mathspecial.


Aargh, I didn't get around to copying it in. But this should do it.

/** Inverse of Student's t distribution
 *
 * Given probability p and degrees of freedom nu,
 * finds the argument t such that the one-sided
 * studentsDistribution(nu,t) is equal to p.
 *
 * Params:
 * nu = degrees of freedom. Must be 1
 * p  = probability. 0  p  1
 */
real studentsTDistributionInv(int nu, real p )
in {
   assert(nu0);
   assert(p=0.0L  p=1.0L);
}
body
{
if (p==0) return -real.infinity;
if (p==1) return real.infinity;

real rk, z;
rk =  nu;

if ( p  0.25L  p  0.75L ) {
if ( p == 0.5L ) return 0;
z = 1.0L - 2.0L * p;
z = betaIncompleteInv( 0.5L, 0.5L*rk, fabs(z) );
real t = sqrt( rk*z/(1.0L-z) );
if( p  0.5L )
t = -t;
return t;
}
int rflg = -1; // sign of the result
if (p = 0.5L) {
p = 1.0L - p;
rflg = 1;
}
z = betaIncompleteInv( 0.5L*rk, 0.5L, 2.0L*p );

if (z0) return rflg * real.infinity;
return rflg * sqrt( rk/z - rk );
}


Re: avgtime - Small D util for your everyday benchmarking needs

2012-03-23 Thread Andrei Alexandrescu

On 3/23/12 12:51 AM, Manfred Nowak wrote:

Andrei Alexandrescu wrote:


You may want to also print the mode of the distribution,
nontrivial but informative


In case of this implementation and according to the given link: trivial
and noninformative, because

| For samples, if it is known that they are drawn from a symmetric
| distribution, the sample mean can be used as an estimate of the
| population mode.

and the program computes the variance as if the values of the sample
follow a normal distribution, which is symmetric.

Therefore the mode of the sample is of interest only, when the variance
is calculated wrongly.


Again, benchmarks I've seen are always asymmetric. Not sure why those 
shown here are symmetric. The mode should be very close to the minimum 
(and in fact I think taking the minimum is a pretty good approximation 
of the sought-after time).


Andrei




Re: avgtime - Small D util for your everyday benchmarking needs

2012-03-23 Thread Andrei Alexandrescu

On 3/23/12 3:02 AM, Juan Manuel Cabo wrote:

On Friday, 23 March 2012 at 05:16:20 UTC, Andrei Alexandrescu wrote:
[.]

(man, the gaussian curve is everywhere, it never ceases to
perplex me).


I'm actually surprised. I'm working on benchmarking lately and the
distributions I get are very concentrated around the minimum.

Andrei



Well, the shape of the curve depends a lot on
how the random noise gets inside the measurement.

[snip]

Hmm, well the way I see it, the observed measurements have the following 
composition:


X = T + Q + N

where T  0 (a constant) is the real time taken by the processing, Q  
0 is the quantization noise caused by the limited resolution of the 
clock (can be considered 0 if the resolution is much smaller than the 
actual time), and N is noise caused by a variety of factors (other 
processes, throttling, interrupts, networking, memory hierarchy effects, 
and many more). The challenge is estimating T given a bunch of X samples.


N can be probably approximated to a Gaussian, although for short timings 
I noticed it's more like bursts that just cause outliers. But note that 
N is always positive (therefore not 100% Gaussian), i.e. there's no way 
to insert some noise that makes the code seem artificially faster. It's 
all additive.


Taking the mode of the distribution will estimate T + mode(N), which is 
informative because after all there's no way to eliminate noise. 
However, if the focus is improving T, we want an estimate as close to T 
as possible. In the limit, taking the minimum over infinitely many 
measurements of X would yield T.



Andrei


Re: avgtime - Small D util for your everyday benchmarking needs

2012-03-23 Thread Andrei Alexandrescu

On 3/23/12 5:51 AM, Don Clugston wrote:

No, it's easy. Student t is in std.mathspecial.


Aargh, I didn't get around to copying it in. But this should do it.

[snip]

Shouldn't put this stuff in std.numeric, or create a std.stat module? I 
think also some functions for t-test would be useful.


Andrei


Walter on reddit with an older article

2012-03-23 Thread Andrei Alexandrescu

http://www.reddit.com/r/programming/comments/r9p4c/walter_bright_on_c_compilation_speed/

Andrei


GSoC: Linear Algebra and the SciD library

2012-03-23 Thread Cullen Seaton

Hello,
I'm a third year undergraduate at the University of Chicago 
majoring in mathematics. I'm very interested in working on the 
Matrix library through Google summer of code. The ideas page 
mentions that progress has already been made but that goals 
weren't completely met. What kind of support is already in place? 
Are there any specific types of functions that you would like to 
see added to the library? Although I'm relatively new to coding, 
I have a strong background in mathematics (including linear 
algebra). I've coded mainly in C but also in java, python, and 
very little in racket. Is this project appropriate for an 
enthusiastic participant who is not yet an expert hacker? Thanks 
for your time,


Cullen Seaton
University of Chicago
Class of 2013


Re: avgtime - Small D util for your everyday benchmarking needs

2012-03-23 Thread Juan Manuel Cabo
On Friday, 23 March 2012 at 15:33:18 UTC, Andrei Alexandrescu 
wrote:

On 3/23/12 3:02 AM, Juan Manuel Cabo wrote:
On Friday, 23 March 2012 at 05:16:20 UTC, Andrei Alexandrescu 
wrote:

[.]

(man, the gaussian curve is everywhere, it never ceases to
perplex me).


I'm actually surprised. I'm working on benchmarking lately 
and the

distributions I get are very concentrated around the minimum.

Andrei



Well, the shape of the curve depends a lot on
how the random noise gets inside the measurement.

[snip]

Hmm, well the way I see it, the observed measurements have the 
following composition:


X = T + Q + N

where T  0 (a constant) is the real time taken by the 
processing, Q  0 is the quantization noise caused by the 
limited resolution of the clock (can be considered 0 if the 
resolution is much smaller than the actual time), and N is 
noise caused by a variety of factors (other processes, 
throttling, interrupts, networking, memory hierarchy effects, 
and many more). The challenge is estimating T given a bunch of 
X samples.


N can be probably approximated to a Gaussian, although for 
short timings I noticed it's more like bursts that just cause 
outliers. But note that N is always positive (therefore not 
100% Gaussian), i.e. there's no way to insert some noise that 
makes the code seem artificially faster. It's all additive.


Taking the mode of the distribution will estimate T + mode(N), 
which is informative because after all there's no way to 
eliminate noise. However, if the focus is improving T, we want 
an estimate as close to T as possible. In the limit, taking the 
minimum over infinitely many measurements of X would yield T.



Andrei


In general, I agree with your reasoning. And I appreciate you
taking the time to put it so eloquently!!

But I think that your considering T as a constant, and
preferring the minimum misses something. This might work
very well for benchmarking mostly CPU bound processes,
but all those other things that you consider noise
(disk I/O, network, memory hierarchy, etc.) are part
of the elements that make an algorithm or program faster
than other, and I would consider them inside T for
some applications.

Consider the case depicted in this wonderful (ranty) article
that was posted elsewhere in this thread:
http://zedshaw.com/essays/programmer_stats.html
In a part of the article, the guy talks about a
system that worked fast most of the time, but would halt
for a good 1 or 2 minutes sometimes.

The minimum time for such a system might be a few ms, but
the standard deviation would be big. This properly shifts
the average time away from the minimum.

If programA does the same task than programB with less I/O,
or with better memory layout, etc. its average will be
better, and maybe its timings won't be so spread out. But
the minimum will be the same.

So, in the end, I'm just happy that I could share this
little avgtime with you all, and as usual there is
no one-answer fits all. For some applications, the
minimum will be enough. For others, it's esential to look
at how spread the sample is.


On the symmetry/asymmetry of the distribution topic:
I realize as you said that T never gets faster than
a certain point.
But, depending on the nature of the program under test,
the good utilization of disk I/O, network, memory,
motherboard buses, etc. is what you want inside the
test too, and those come with gaussian like noises
which might dominate over T or not.

A program that avoids that other big noise is a better
program (all else the same), so I would tend to consider
the whole.

Thanks for the eloquency/insightfulness in your post!
I'll consider adding chi-squared confidence intervals
in the future. (and open to more info or if another
distribution might be better).

--jm





Re: avgtime - Small D util for your everyday benchmarking needs

2012-03-23 Thread Juan Manuel Cabo

On Friday, 23 March 2012 at 10:51:37 UTC, Don Clugston wrote:


No, it's easy. Student t is in std.mathspecial.


Aargh, I didn't get around to copying it in. But this should do 
it.


/** Inverse of Student's t distribution
 *
 [.]


Great!!! Thank you soo much Don!!!
--jm




Re: avgtime - Small D util for your everyday benchmarking needs

2012-03-23 Thread Juan Manuel Cabo

On Friday, 23 March 2012 at 05:26:54 UTC, Nick Sabalausky wrote:


Wow, that's just fantastic! Really, this should be a standard 
system tool.


I think this guy would be proud:
http://zedshaw.com/essays/programmer_stats.html


Thanks for the good vibes!

Hahahhah, that article is so ing hillarious!
I love the maddox tone.

--jm




Re: avgtime - Small D util for your everyday benchmarking needs

2012-03-23 Thread Manfred Nowak
Andrei Alexandrescu wrote:

 In the limit, taking the minimum over infinitely many 
 measurements of X would yield T.

True, if the thoretical variance of the distribution of T is close to 
zero. But horrible wrong, if T depends on an algorithm that is fast 
only under amortized analysis, because the worst case scenario will be 
hidden.

-manfred



Re: avgtime - Small D util for your everyday benchmarking needs

2012-03-23 Thread Nick Sabalausky
Juan Manuel Cabo juanmanuel.c...@gmail.com wrote in message 
news:bqrlhcggehbrzyuhz...@forum.dlang.org...
 On Friday, 23 March 2012 at 06:51:48 UTC, James Miller wrote:

 Dude, this is awesome. I tend to just use time, but if I was doing
 anything more complicated, I'd use this. I would suggest changing the
 name while you still can. avgtime is not that informative a name given
 that it now does more than just Average times.

 --
 James Miller


 Dude, this is awesome.

 Thanks!! I appreciate your feedback!

 I would suggest changing the name while you still can.

 Suggestions welcome!!


timestats?