Re: Sampling algorithms for D

2012-04-18 Thread Somedude
Le 18/04/2012 05:15, Joseph Rushton Wakeling a écrit :
 On 13/04/12 10:04, Dmitry Olshansky wrote:
 OK, I'll see what I can do. I'd like to discuss and refine the design a
 bit further before making any pull request -- should I take things over
 to the Phobos mailing list for this ... ?

 I'm no authority but there is this d.D newsgroup which is perfectly
 fine for
 this kind of thing. A lot of nice people just don't (have time to)
 mess with
 unwashed masses in D.learn :)
 
 Tweaked version, an revision of RandomSample from std.random, is
 available from
 https://github.com/WebDrake/RandomSample
 
 Feedback on code details and on tests would be very welcome.  It seems
 about 10% slower than the original code I wrote, which I guess is down
 to factors in the design of the RandomSample struct.

This looks very C++ ish.

I can't comment very much, being myself quite noob, but I think you can
do this, for style as well as performance:
 - add pure nothrow everywhere you can.
 - move some imports within the static if when possible, as close to
their actual usage, and it's better to do:
import core.time:clock_t, clock;

You may also want to use a bit of contract programming for input and output.

Also not sure the _ convention for private attributes is recommended in D.


Re: Sampling algorithms for D

2012-04-18 Thread Joseph Rushton Wakeling

On 18/04/12 17:03, Somedude wrote:

This looks very C++ ish.


Yea, it's a rewrite of code originally in C (not even ++ ... !).


I can't comment very much, being myself quite noob, but I think you can
do this, for style as well as performance:
  - add pure nothrow everywhere you can.


I think there's very little if anything that could be marked pure -- too much of 
the code either calls random number generation, or calls functions that call 
random number generation.


Not sure about nothrow.  I kind of think that this stuff _should_ throw 
exceptions if something goes wrong.



  - move some imports within the static if when possible, as close to
their actual usage, and it's better to do:
import core.time:clock_t, clock;


Useful to know.  I'm not going to do it in this case because that's only 
temporary code to demo the speed differences.  The imports are pretty much a 
copy-paste from std.random as this stuff is aimed for incorporation into 
std.random once it's been tested properly.



You may also want to use a bit of contract programming for input and output.


Ahhh, good point, I'll look into that.


Also not sure the _ convention for private attributes is recommended in D.


I'm copying what's already in place in std.random


Re: Sampling algorithms for D

2012-04-18 Thread Dmitry Olshansky

On 18.04.2012 19:03, Somedude wrote:
[snip]


Tweaked version, an revision of RandomSample from std.random, is
available from
https://github.com/WebDrake/RandomSample

Feedback on code details and on tests would be very welcome.  It seems
about 10% slower than the original code I wrote, which I guess is down
to factors in the design of the RandomSample struct.


This looks very C++ ish.

I can't comment very much, being myself quite noob, but I think you can
do this, for style as well as performance:
  - add pure nothrow everywhere you can.


Yup, unless it's a template as its properties are deduced anyway.


  - move some imports within the static if when possible, as close to
their actual usage, and it's better to do:
import core.time:clock_t, clock;


Dubious.



You may also want to use a bit of contract programming for input and output.




Also not sure the _ convention for private attributes is recommended in D.


It's OK. The c++ is where it's questionable and primarily because it may 
collide with some internal macros of you compiler! (__cplusplus anyone?)


--
Dmitry Olshansky


Re: Sampling algorithms for D

2012-04-18 Thread Joseph Rushton Wakeling

On 18/04/12 18:54, Dmitry Olshansky wrote:

I can't comment very much, being myself quite noob, but I think you can
do this, for style as well as performance:
- add pure nothrow everywhere you can.


Yup, unless it's a template as its properties are deduced anyway.


I really don't see anywhere I can do this -- every function is either inherently 
impure (it calls random number generation), calls an impure function, or relies 
on data external to itself (though not external to the struct).



Also not sure the _ convention for private attributes is recommended in D.


It's OK. The c++ is where it's questionable and primarily because it may collide
with some internal macros of you compiler! (__cplusplus anyone?)


Re the _variables, I just copied the convention already present in std.random, 
which I presumed was how it was supposed to be done ... :-)


Re: Sampling algorithms for D

2012-04-13 Thread Dmitry Olshansky

On 13.04.2012 1:48, Joseph Rushton Wakeling wrote:

On 12/04/12 21:54, bearophile wrote:



for( t=recordsRemaining-1; t=limit; --t)
y2 *= top--/bottom--;


Generally packing mutation of variables inside expressions is quite
bad style. It makes code less easy to understand and translate, and
currently it's not defined, just as in C (it will be defined, but it
will keep being bad style).


OK. There's one other that should be a point of concern, where I have

return currentRecord++;

... in one function; I've added a selectedRecord variable to get round
this.


I believe it's something that reasonable people may disagree on. To me 
it's perfectly easy to see what return x++; does. So is arr[i++] = ..., 
so is arr1[i++] = arr2[j++]. But the downward loops look hairy all the 
time ;)





Also in D there is (I hope it will not be deprecated)
foreach_reverse(), maybe to replace this for().




or use some std.range primitives ( I think iota does a [begin, end) range)
foreach( x ; iota(recordRemaining-1, limit+1, -1)){
y2 *= top--/bottom--;
}

[snip]


--
Dmitry Olshansky


Re: Sampling algorithms for D

2012-04-13 Thread Dmitry Olshansky

On 13.04.2012 2:50, Joseph Rushton Wakeling wrote:

On 12/04/12 23:34, Dmitry Olshansky wrote:

Aye, and in general community does appreciate any enhancements via
pull requests
on github:
https://github.com/D-Programming-Language


OK, I'll see what I can do. I'd like to discuss and refine the design a
bit further before making any pull request -- should I take things over
to the Phobos mailing list for this ... ?



I'm no authority but there is this d.D newsgroup which is perfectly fine 
for this kind of thing. A lot of nice people just don't (have time to) 
mess with unwashed masses in D.learn :)


--
Dmitry Olshansky


Re: Sampling algorithms for D

2012-04-13 Thread Joseph Rushton Wakeling

On 13/04/12 01:44, bearophile wrote:

final size_t select(ref UniformRNG urng)
in {
 assert(_recordsRemaining  0);
 assert(_sampleRemaining  0);
} body {
 ...
}


OK.  I'm confused by these asserts, because if I go beyond what is acceptable by 
calling select() even after I've collected a complete sample, no error is 
thrown.  e.g. if I put in place a function:


void sampling_test_simple(SamplerType, UniformRNG)
 (size_t records, size_t samples, ref UniformRNG urng)
{
  auto s = new SamplerType(records,samples,urng);
  while(s.sampleRemaining  0) {
write(\trecord selected: , s.select(urng), .);
write(\trecords remaining: , s.recordsRemaining, .);
writeln(\tstill to sample: , s.sampleRemaining, .);
  }
  // Let's see if we can bust this ...

  writeln(Selecting 1 more just for luck: , s.select(urng), , records 
remaining: , s.recordsRemaining, , still to sample: , s.sampleRemaining);
  writeln(Selecting 1 more just for luck: , s.select(urng), , records 
remaining: , s.recordsRemaining, , still to sample: , s.sampleRemaining);
  writeln(Selecting 1 more just for luck: , s.select(urng), , records 
remaining: , s.recordsRemaining, , still to sample: , s.sampleRemaining);
  writeln(Selecting 1 more just for luck: , s.select(urng), , records 
remaining: , s.recordsRemaining, , still to sample: , s.sampleRemaining);
  writeln(Selecting 1 more just for luck: , s.select(urng), , records 
remaining: , s.recordsRemaining, , still to sample: , s.sampleRemaining);
  writeln(Selecting 1 more just for luck: , s.select(urng), , records 
remaining: , s.recordsRemaining, , still to sample: , s.sampleRemaining);
  writeln(Selecting 1 more just for luck: , s.select(urng), , records 
remaining: , s.recordsRemaining, , still to sample: , s.sampleRemaining);
  writeln(Selecting 1 more just for luck: , s.select(urng), , records 
remaining: , s.recordsRemaining, , still to sample: , s.sampleRemaining);
  writeln(Selecting 1 more just for luck: , s.select(urng), , records 
remaining: , s.recordsRemaining, , still to sample: , s.sampleRemaining);

}

(... see current GitHub code: https://github.com/WebDrake/SampleD )

It doesn't make any difference if I compile with -debug enabled:

   gdmd -debug -oftest sampled.d


Re: Sampling algorithms for D

2012-04-13 Thread bearophile
Dmitry Olshansky:

 I believe it's something that reasonable people may disagree on.
 To me it's perfectly easy to see what return x++; does.

I agree that return x++; is not too bad for a human reader, but code with 
mutation inside expressions (mostly written by other people) has caused me tons 
of troubles (and the semantics of ++ and -- are as much undefined in D as in C, 
still). So I usually kit it with fire as soon as I see it.
I'd like to modify the -- and ++ to make them return void. I think Go is 
designed like that.


 or use some std.range primitives ( I think iota does a [begin, end) range)
 foreach( x ; iota(recordRemaining-1, limit+1, -1)){
   y2 *= top--/bottom--;
 }

Only if you don't care a lot for the performance of that specific loop. (DMD is 
sometimes not even able to optimize foreach() loops as well as for() loops. Go 
figure what it does on iota()).

Bye,
bearophile


Re: Sampling algorithms for D

2012-04-13 Thread bearophile
Joseph Rushton Wakeling:

  final size_t select(ref UniformRNG urng)
  in {
   assert(_recordsRemaining  0);
   assert(_sampleRemaining  0);
  } body {
   ...
  }
 
 OK.  I'm confused by these asserts,

What's confusing? I don't understand. It's contract-based programming, the code 
is essentially the same as before:
http://dlang.org/dbc.html

Bye,
bearophile


Re: Sampling algorithms for D

2012-04-13 Thread Joseph Rushton Wakeling

On 13/04/12 13:10, bearophile wrote:

What's confusing? I don't understand. It's contract-based programming, the code 
is essentially the same as before:
http://dlang.org/dbc.html


No, I understand the principle; I just don't understand why the code is running 
without errors being displayed when I violate the bounds of those asserts.


Re: Sampling algorithms for D

2012-04-13 Thread Ali Çehreli

On 04/13/2012 02:41 AM, Joseph Rushton Wakeling wrote:
 On 13/04/12 01:44, bearophile wrote:
 final size_t select(ref UniformRNG urng)
 in {
 assert(_recordsRemaining 0);
 assert(_sampleRemaining 0);
 } body {
 ...
 }

 OK. I'm confused by these asserts, because if I go beyond what is
 acceptable by calling select() even after I've collected a complete
 sample, no error is thrown.

This is a complicated issue that touches how in contracts are not 
inherited. I think your issue is because the interface does not define 
any in contracts, effectively allowing every call to select(). Please 
start reading here: :)


  http://d.puremagic.com/issues/show_bug.cgi?id=6856

As a general reminder, there is also the invariant() blocks, and 
sometimes enforce() is more appropriate than assert.


Ali




Re: Sampling algorithms for D

2012-04-13 Thread Joseph Rushton Wakeling

On 13/04/12 19:49, Ali Çehreli wrote:

This is a complicated issue that touches how in contracts are not inherited. I
think your issue is because the interface does not define any in contracts,
effectively allowing every call to select(). Please start reading here: :)


You're absolutely right.  I think I will kill the interface; it's not really 
clear that it adds anything to the whole setup.


Thanks very much!


Sampling algorithms for D

2012-04-12 Thread Joseph Rushton Wakeling

Hello all,

My programming experience is mostly in C with some C++, so much of the 
programming style of D is not familiar to me.  With that in mind, I thought I'd 
set up a little pet project to try and learn good D programming style.


What I thought I'd do is implement some clever algorithms for random sampling 
which I've already done in a C based on the GNU Scientific Library.


The D implementation is (in very primitive form) at:
https://github.com/WebDrake/SampleD

... and I'm using GDC 4.6.3 on Ubuntu 12.04 as my compiler.

The basic challenge here is that, given a collection of N records, we want to 
take a sample of size n  N such that the chance of any given record being 
picked is equal.  The algorithms implemented are based on the notion of 
generating a series of random skips across the sequence of records, picking 
the records you land on.


The idea is that one initializes the algorithm by giving the total number of 
records and the sample size required.  The sampler provides a method, select(), 
which when called picks the next entry in the sample, up to the total number 
required; if you keep calling it when the sample is complete, it throws an error.


Consequently, these algorithms don't store the entire sample, or the collection 
of records (which are just assumed to be labelled 0, 1, 2, ..., N-1).  Rather, 
it just outputs, one by one as called upon, the indices of the selected records.


So, introduction away, here are the main questions I came up with in creating 
the code.


(1) Effective handling of random numbers.

The design concept was for the sampler classes to use a specified uniform random 
number generator if provided, otherwise to use rndGen.  This turned out to be 
trickier than anticipated, and I'm not very happy with the current solution.


My first attempt just templatized class methods select() and skip().  However, 
one of the algorithms (class VitterD) requires random initialization as well, 
and trying to template this() ran into this bug: 
http://d.puremagic.com/issues/show_bug.cgi?id=435


I'm really not too happy with turning the whole class into a template just for 
the sake of the random number generator, so can anyone suggest an effective way 
to resolve the problem?


(I may be biased here by C/GSL experience, where I've got used to passing the 
RNG as a pointer, meaning runtime polymorphism.)


(2) init() function?

I was unsure whether it would be good design practice for the sampler classes to 
be throwaway (1 instance = 1 use) or to put in place an init() function that 
would reset it with new record and sample sizes.  Any thoughts?


(3) Uniform random number on (0,1)

The algorithms' specification explicitly refers to uniform random numbers on the 
open interval, which I take to mean (0,1) i.e. excluding zero.  Phobos currently 
provides only a half-open uniform distribution on [a,b).


I've implemented a simple uniform_open() function which should return a uniform 
distribution on (a,b).  The question is, are there any plans for an 
open-interval uniform distribution in Phobos?  Can I rely on this functionality 
being provided in the long term in D's standard library?


(4) Truncation

The code uses C's trunc() function to return the integer part of a floating 
point number.  D doesn't seem to like it very much when I then set a size_t as 
equal to the result of trunc(), and I have to use a type cast.  Is there a more 
elegant D-ish way of achieving the same result?  roundTo() isn't adequate, as it 
rounds to the _nearest_ integer rather than the integer part.


(5) ... any general stylistic comments? :-)

Thanks and best wishes,

-- Joe


Re: Sampling algorithms for D

2012-04-12 Thread Dmitry Olshansky

On 12.04.2012 18:45, Joseph Rushton Wakeling wrote:

Hello all,


Hello there,

[snip]



So, introduction away, here are the main questions I came up with in
creating the code.

(1) Effective handling of random numbers.

The design concept was for the sampler classes to use a specified
uniform random number generator if provided, otherwise to use rndGen.
This turned out to be trickier than anticipated, and I'm not very happy
with the current solution.

My first attempt just templatized class methods select() and skip().
However, one of the algorithms (class VitterD) requires random
initialization as well, and trying to template this() ran into this bug:
http://d.puremagic.com/issues/show_bug.cgi?id=435

I'm really not too happy with turning the whole class into a template
just for the sake of the random number generator, so can anyone suggest
an effective way to resolve the problem?

(I may be biased here by C/GSL experience, where I've got used to
passing the RNG as a pointer, meaning runtime polymorphism.)


I think that if you like function pointer thing, you'd love D's 
delegates and lambdas. If templates are too pervasive it might be better 
just use little bit of dynamic stuff.

e.g.

Mt19937 gen;
double delegate() rng = (){
auto x = gen.front;
gen.popFron();
return x;
}

use it somewhere later on:
writeln(rng(), rng()); // print 2 random numbers



(2) init() function?

I was unsure whether it would be good design practice for the sampler
classes to be throwaway (1 instance = 1 use) or to put in place an
init() function that would reset it with new record and sample sizes.
Any thoughts?


Use structs? They are cheap throwway value types allocated on stack (by 
default).




(3) Uniform random number on (0,1)

The algorithms' specification explicitly refers to uniform random
numbers on the open interval, which I take to mean (0,1) i.e. excluding
zero. Phobos currently provides only a half-open uniform distribution on
[a,b).

I've implemented a simple uniform_open() function which should return a
uniform distribution on (a,b). The question is, are there any plans for
an open-interval uniform distribution in Phobos? Can I rely on this
functionality being provided in the long term in D's standard library?

(4) Truncation

The code uses C's trunc() function to return the integer part of a
floating point number. D doesn't seem to like it very much when I then
set a size_t as equal to the result of trunc(), and I have to use a type
cast. Is there a more elegant D-ish way of achieving the same result?
roundTo() isn't adequate, as it rounds to the _nearest_ integer rather
than the integer part.



It's OK. there are no implicit conversions of FP-- integer in the 
language I think.



(5) ... any general stylistic comments? :-)



Looks fine ;)


Thanks and best wishes,

-- Joe



--
Dmitry Olshansky


Re: Sampling algorithms for D

2012-04-12 Thread James Miller
* Joseph Rushton Wakeling joseph.wakel...@webdrake.net [2012-04-12 16:45:34 
+0200]:

 (3) Uniform random number on (0,1)
 
 The algorithms' specification explicitly refers to uniform random
 numbers on the open interval, which I take to mean (0,1) i.e.
 excluding zero.  Phobos currently provides only a half-open uniform
 distribution on [a,b).
 
 I've implemented a simple uniform_open() function which should
 return a uniform distribution on (a,b).  The question is, are there
 any plans for an open-interval uniform distribution in Phobos?  Can
 I rely on this functionality being provided in the long term in D's
 standard library?

There is support for fully all 4 types of intervals using
std.random.uniform. You just specify the type of interval using a
template parameter.

The default is this: uniform!([))(a,b);
And you want this:   uniform!([])(a,b);

You can also do () and (] to obtain the other two intervals

--
James Miller


Re: Sampling algorithms for D

2012-04-12 Thread Joseph Rushton Wakeling

On 12/04/12 18:26, James Miller wrote:

There is support for fully all 4 types of intervals using
std.random.uniform. You just specify the type of interval using a
template parameter.

The default is this: uniform!([))(a,b);
And you want this:   uniform!([])(a,b);

You can also do () and (] to obtain the other two intervals


Deary me -- I did know that, once upon a time, and I'd completely forgotten. 
Thanks so much for reminding me.


It was () I wanted ... :-)


Re: Sampling algorithms for D

2012-04-12 Thread Joseph Rushton Wakeling

On 12/04/12 16:45, Joseph Rushton Wakeling wrote:

What I thought I'd do is implement some clever algorithms for random sampling
which I've already done in a C based on the GNU Scientific Library.


I noticed that there is already a randomSample class in std.random, which by the 
look of it is using Knuth's Algorithm S.  Would there be any interest in my 
adapting Vitter's Algorithms A and D (as in my library) for Phobos in order to 
make this function more efficient?


Re: Sampling algorithms for D

2012-04-12 Thread bearophile
Joseph Rushton Wakeling:

 https://github.com/WebDrake/SampleD

Some comments on the details of your code:

 import std.c.time;

In D there there is also the syntax:
 import std.c.time: foo, bar, baz;

That tells the person that reads the code what names are used.

--

 interface Sampler(UniformRandomNumberGenerator) {

The UniformRNG name induces shorter lines in your module and I think it's 
easy enough to understand.

--

 double uniform_open(UniformRandomNumberGenerator)
 (double a, double b, ref UniformRandomNumberGenerator urng)
 {

In D function/method names are pascalCased. I also suggest to indent the second 
line, and unless you plan to modifying a and b inside that, to always mark them 
as in (or const, or even immutable):

double uniformOpen(UniformRNG)
  (double a, double b, ref UniformRNG urng)
{

--

do {
x = uniform(a,b,urng);
} while (x==a);

I suggest to put a space after a comma and before and after an operator:

do {
x = uniform(a, b, urng);
} while (x == a);

--

 class Skipper(UniformRandomNumberGenerator) : 
 Sampler!(UniformRandomNumberGenerator) {

When there is only one template argument you are allowed to omit the (), and 
generally in English there is no need for a space before a ::

class Skipper(UniformRNG): Sampler!UniformRNG {

--

size_t S = skip(urng);

sampleRemaining--;
recordsRemaining -= (S+1);
currentRecord += S;

Unless you plan to modify a variable, then on default define it as immutable, 
and where that's not possible, as const, this avoids mistakes later, makes the 
code simpler to understand, and helps the compiler too:

immutable size_t S = skip(urng);

sampleRemaining--;
recordsRemaining -= S + 1;
currentRecord += S;

--

 final size_t records_remaining() { return recordsRemaining; }
 final size_t records_total() { return recordsTotal; }

Generally where possible add const, pure and nothrow attributes to 
functions/methods every where you can (and in some cases @safe too, if you 
want):

 final size_t records_remaining() const pure nothrow { return 
recordsRemaining; }
 final size_t records_total() const pure nothrow { return recordsTotal; }

--

 protected:
 size_t currentRecord = 0;
 size_t recordsRemaining;

All D variables are initialized (unless you ask them to not be initialized, 
using = void), so the first = 0 is not needed (but initializing D floating 
point values to zero is often necessary):

protected:
size_t currentRecord;
size_t recordsRemaining;

--

 y1 = pow( (uniform_open(0.0,1.0, urng) * (cast(double) 
 recordsRemaining) / qu1),
   (1.0/(sampleRemaining - 1)) );

In D there is also the ^^ operator.

--

 for( t=recordsRemaining-1; t=limit; --t)
 y2 *= top--/bottom--;

Generally packing mutation of variables inside expressions is quite bad style. 
It makes code less easy to understand and translate, and currently it's not 
defined, just as in C (it will be defined, but it will keep being bad style).

for (t = recordsRemaining - 1; t = limit; t--) {
y2 *= top / bottom;
top--;
bottom--;
}


Also in D there is (I hope it will not be deprecated) foreach_reverse(), maybe 
to replace this for().

--

 writeln(Hello!  I'm a very simple and stupid implementation of some 
 clever);
 writeln(sampling algorithms.);
 writeln();
 writeln(What I'm going to do first is just take a sample of 5 records 
 out of);
 writeln(100 using each of the algorithms.);
 writeln();

D string literals are multi-line too:

writeln(
Hello!  I'm a very simple and stupid implementation of some clever
sampling algorithms.

What I'm going to do first is just take a sample of 5 records out of
100 using each of the algorithms.
);

--

 sampling_test_simple!(VitterA!Random,Random)(100,5,urng);

Currently your code doesn't work if you want to use a Xorshift generator.

--

 sampling_test_aggregate!(VitterA!Random,Random)(100,5,urng,1000,true);
 writeln();
 sampling_test_aggregate!(VitterD!Random,Random)(100,5,urng,1000,true);
 writeln();

I suggest to define one enum inside the main(), like this, with underscores 
too, to improve readability, and use it in all the function calls:

enum int N = 10_000_000;
samplingTestAggregate!(VitterA!Random, Random)(100, 5, urng, N, true);

Bye,
bearophile


Re: Sampling algorithms for D

2012-04-12 Thread SomeDude
On Thursday, 12 April 2012 at 16:59:31 UTC, Joseph Rushton 
Wakeling wrote:

On 12/04/12 16:45, Joseph Rushton Wakeling wrote:
What I thought I'd do is implement some clever algorithms for 
random sampling
which I've already done in a C based on the GNU Scientific 
Library.


I noticed that there is already a randomSample class in 
std.random, which by the look of it is using Knuth's Algorithm 
S.  Would there be any interest in my adapting Vitter's 
Algorithms A and D (as in my library) for Phobos in order to 
make this function more efficient?


I think there is always interest for better efficiency. You may 
submit your patches to Don Clugston for review.


Re: Sampling algorithms for D

2012-04-12 Thread Dmitry Olshansky

On 12.04.2012 20:59, Joseph Rushton Wakeling wrote:

On 12/04/12 16:45, Joseph Rushton Wakeling wrote:

What I thought I'd do is implement some clever algorithms for random
sampling
which I've already done in a C based on the GNU Scientific Library.


I noticed that there is already a randomSample class in std.random,
which by the look of it is using Knuth's Algorithm S. Would there be any
interest in my adapting Vitter's Algorithms A and D (as in my library)
for Phobos in order to make this function more efficient?


Aye, and in general community does appreciate any enhancements via pull 
requests on github:

https://github.com/D-Programming-Language

--
Dmitry Olshansky


Re: Sampling algorithms for D

2012-04-12 Thread Joseph Rushton Wakeling

On 12/04/12 21:54, bearophile wrote:

Some comments on the details of your code:


Thanks ever so much for the extensive review.


import std.c.time;


In D there there is also the syntax:

import std.c.time: foo, bar, baz;


That tells the person that reads the code what names are used.


Is this advised for all D modules, or just for stuff using the C standard 
library?


Generally where possible add const, pure and nothrow attributes to 
functions/methods every where you can (and in some cases @safe too, if you want):

  final size_t records_remaining() const pure nothrow { return 
recordsRemaining; }
  final size_t records_total() const pure nothrow { return recordsTotal; }


Whether I can do this would depend on whether I put in place a re-initialization 
function to allow the sampler to be reset with new values, no?



All D variables are initialized (unless you ask them to not be initialized, using = 
void), so the first = 0 is not needed (but initializing D floating point values 
to zero is often necessary):


I want to be _certain_ that value is zero. ;-)


In D there is also the ^^ operator.


Ahh, another thing I once knew but had forgotten since the last time I looked 
at D.


 for( t=recordsRemaining-1; t=limit; --t)
 y2 *= top--/bottom--;


Generally packing mutation of variables inside expressions is quite bad style. 
It makes code less easy to understand and translate, and currently it's not 
defined, just as in C (it will be defined, but it will keep being bad style).


OK.  There's one other that should be a point of concern, where I have

   return currentRecord++;

... in one function; I've added a selectedRecord variable to get round this.

  final size_t select(ref UniformRNG urng)
  {
assert(_recordsRemaining  0);
assert(_sampleRemaining  0);

immutable size_t S = skip(urng);
immutable size_t selectedRecord = _currentRecord + S;

_sampleRemaining--;
_recordsRemaining -= (S+1);
_currentRecord += (S+1);

return selectedRecord;
  }


Also in D there is (I hope it will not be deprecated) foreach_reverse(), maybe 
to replace this for().


I'll see what I can do with that.  There may be a better way of expressing this 
anyway -- as is it's pretty much a literal transcription of the Pascal-esque 
pseudo-code in the original research paper describing the algorithm.



 sampling_test_simple!(VitterA!Random,Random)(100,5,urng);


Currently your code doesn't work if you want to use a Xorshift generator.


Ack, enabling that is precisely why I bothered templatizing it in the first 
place.  Can you suggest a fix ... ?  I don't understand why as-is it wouldn't work.



--


 sampling_test_aggregate!(VitterA!Random,Random)(100,5,urng,1000,true);
 writeln();
 sampling_test_aggregate!(VitterD!Random,Random)(100,5,urng,1000,true);
 writeln();


I suggest to define one enum inside the main(), like this, with underscores 
too, to improve readability, and use it in all the function calls:

 enum int N = 10_000_000;
 samplingTestAggregate!(VitterA!Random, Random)(100, 5, urng, N, true);


Fair enough.  I'd forgotten about the underscore notation for large numbers. 
This stuff I'm not too worried about, because it's only a stupid demo thing, not 
the important code ... :-)


Thanks again for the careful review, it's been very useful.

Best wishes,

-- Joe



Re: Sampling algorithms for D

2012-04-12 Thread Joseph Rushton Wakeling

On 12/04/12 21:54, bearophile wrote:

 sampling_test_simple!(VitterA!Random,Random)(100,5,urng);


Currently your code doesn't work if you want to use a Xorshift generator.


Ahhh, I see what you mean now -- the sampler classes are fine, but the way the 
main() function is written means you can't just tweak the RNG.


The original code I tried writing was something like,

sampling_test_simple!(VitterA, Random)(100, 5, urng);

... with the sampler object being created in the function with,

auto s = new SamplerType!UniformRNG(records, samples, urng);

... but the compiler objected to that with the error,

sampled.d:216: Error: template instance
sampling_test_simple!(VitterA,XorshiftEngine!(uint,128,11,8,19))
does not match template declaration
sampling_test_simple(SamplerType,UniformRNG)

Here's the complete (wrong) function:

void sampling_test_simple(SamplerType, UniformRNG)
 (size_t records, size_t samples, ref UniformRNG urng)
{
  auto s = new SamplerType!UniformRNG(records,samples,urng);
  while(s.sampleRemaining  0) {
write(\trecord selected: , s.select(urng), .);
write(\trecords remaining: , s.recordsRemaining, .);
writeln(\tstill to sample: , s.sampleRemaining, .);
  }
}

It's not clear to me why this fails or what I can do that would equate to this.


Re: Sampling algorithms for D

2012-04-12 Thread bearophile

Joseph Rushton Wakeling:


Thanks ever so much for the extensive review.


They are shallow comments, mostly about syntax, etc. So writing 
them requires little time.



Is this advised for all D modules, or just for stuff using the 
C standard library?


I have said there is also the syntax instead it's better to do 
this because some D programmers don't like to do that. You have 
to be aware that syntax exists, and that's it's better to write:

import std.foo: bar, baz;
Than:
import std.foo; // for bar and baz functions

then personally I sometimes like to qualify my imports (writing 
what I import from them), for various reasons, but it also 
increases the amount of work to do when you want to modify the 
code. So it's a personal choice.
Also, the more important is for you to not put bugs in a specific 
module, the more tight/strongly statically typed is better to 
write the code (like Ada language code). If for you the 
correctness of a specific module is not very important, then 
using a lazier coding style is acceptable :-) If you write code 
lot of money/hardware or even people lives will depend on, you 
have to adopt a progressively more and more error-proof style of 
coding, eventually even using software to proof the correctness 
of critical routines.




Whether I can do this would depend on


It depends on many things, including DMD bugs, D language design 
faults, Phobos incomplete coverage of those annotations, and of 
course on the semantics of your code, etc.




I want to be _certain_ that value is zero. ;-)


Then assign on more time, to be really sure :-)
int x = 0;
x = 0; // It must be ZERO, dammit.



   final size_t select(ref UniformRNG urng)
   {
 assert(_recordsRemaining  0);
 assert(_sampleRemaining  0);


Probably it's better to move those asserts in 
preconditions/postconditions or in class/struct invariants.

And I think your code needs some unittests{} too.


Ack, enabling that is precisely why I bothered templatizing it 
in the first place.  Can you suggest a fix ... ?  I don't 
understand why as-is it wouldn't work.


If you try to use a Xorshift the compiler will tell you the 
problems, I think with some work you will be able to fix them and 
create a more flexible module:


void main() {
auto urng = Xorshift(23);


This stuff I'm not too worried about, because it's only a 
stupid demo thing, not the important code ... :-)


In my opinion demo code is a bit like unittests. And unittests 
require the same coding precision as the rest of the code :-)


Bye,
bearophile


Re: Sampling algorithms for D

2012-04-12 Thread Joseph Rushton Wakeling

On 12/04/12 23:34, Dmitry Olshansky wrote:

Aye, and in general community does appreciate any enhancements via pull requests
on github:
https://github.com/D-Programming-Language


OK, I'll see what I can do.  I'd like to discuss and refine the design a bit 
further before making any pull request -- should I take things over to the 
Phobos mailing list for this ... ?




Re: Sampling algorithms for D

2012-04-12 Thread Joseph Rushton Wakeling

On 13/04/12 00:48, bearophile wrote:

final size_t select(ref UniformRNG urng)
{
assert(_recordsRemaining  0);
assert(_sampleRemaining  0);


Probably it's better to move those asserts in preconditions/postconditions or in
class/struct invariants.


Those asserts are deliberately intended for the function.  To see why it 
matters, consider setting up an instance of the class to select a sample of size 
5, and then calling select() 6 times.


Of course, there may be a better way of handling that which I'm not aware of :-)


And I think your code needs some unittests{} too.


Agreed.


Ack, enabling that is precisely why I bothered templatizing it in the first
place. Can you suggest a fix ... ? I don't understand why as-is it wouldn't 
work.


If you try to use a Xorshift the compiler will tell you the problems, I think
with some work you will be able to fix them and create a more flexible module:

void main() {
auto urng = Xorshift(23);


I realized afterwards what you were talking about, but am not sure how to fix 
(cf. my other email).  My first coding attempt was designed to be more flexible 
but the compiler didn't like it. ;-)



In my opinion demo code is a bit like unittests. And unittests require the same
coding precision as the rest of the code :-)


Agree :-)  They are imperfect tests for now, IMO, which is why I'm being lazy...



Re: Sampling algorithms for D

2012-04-12 Thread bearophile
Joseph Rushton Wakeling:

  final size_t select(ref UniformRNG urng)
  {
  assert(_recordsRemaining  0);
  assert(_sampleRemaining  0);
 
  Probably it's better to move those asserts in preconditions/postconditions 
  or in
  class/struct invariants.
 
 Those asserts are deliberately intended for the function.

Then use a function/method precondition:

final size_t select(ref UniformRNG urng)
in {
assert(_recordsRemaining  0);
assert(_sampleRemaining  0);
} body {
...
}

Bye,
bearophile