I was not only thinking of mathematical figures, I was also thinking of
graphics, some format may be zip containing XML stuff for instance.

But we don't need it here, so why should we care about it too much?

I was just digressing about the main subject:-) Having some graphics in the doc would help here and there, though.

I do understand that. I'm trying to explain that "threshold" is in fact completely disconnected from min and max, as the transformation scales the data to [-1,1] like this

   2.0 * (i - min - mu + 0.5) / (max - min + 1)

and only then the 'threshold' coefficient is applied. And if I read the Box-Muller transformation correctly, it generates data with standard Normal distribution from [-threshold, threshold] and then transforms them to the right mean etc.

Yep, the threshold parameter is designed to be somehow independent of the actual [min max] range.

But maybe that's what the first sentence is trying to say? I mean this:

   For a Gaussian distribution, the interval is mapped onto a standard
   normal distribution (the classical bell-shaped Gaussian curve)
   truncated at -threshold on the left and +threshold on the right.

Yep, that looks like it.

I'm asking about this because it wasn't to me immediately clear whether I need to tweak this for data sets with different scales, but apparently not.

Indeed, This is the idea of how the parameter is used.

After reading the docs again I think that's also clear from last sentence that relates threshold and 67% and 95%.

Yep.

Anyway, the references to "standard normal distribution" are a bit sloppy - "standard" usually means normal distribution with exactly mu=0 and sigma=1. So it's a bit strange to say

   standard normal distribution, with mean mu defined as (max+min)/2.0

because that's not a standard normal distribution at all. I propose to fix this by removing the "standard".

Hmmm, probably fine if it is both more precise and shorter!

[...]
 CDF2(x) = PHI(2.0 * threshold * ...) / (2.0 * PHI(threshold) - 1.0)

and then the probability of "i" is

 P(X=i) = CDF2(i+0.5) - CDF2(i-0.5)

I agree that defining the shifted/scaled CDF and using it afterwards looks cleaner.

Which is what I meant by simplifying the equation. Not that it'd make easier to imagine the shape, though ...

Sure. This is the part about providing the "precise" information, what is the actual probability of drawing i depending on the parameters.

Maybe. Another thing is that "middle quarter" and "middle half" seems a bit strange - if you split data into 1/4s there's no middle one (sure, I understand what the sentence is meant to say).

Improvements are welcome!

Ok. I think that the fact that it relies on the Box-Muller transform is
relevant, because there are other methods to generate a gaussian
distribution, and I would say that there is no reason to have to go to
the source code to check that. But I would not provide further details.
So I'm fine with the current status.

There are alternative methods for almost every non-trivial piece of code, and we generally don't mention that in user docs. Why should we mention it in this case? Why would the user care which particular PRNG was used to generate the numbers? Maybe there really is a reason for that, I don't know.

If that was security, because one has just been announced to be broken and you want to know whether you depend on it.

As a scientist, I like it when follow scientists who achieved useful things have their name cited:-).

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to