On Jun 29, 2014, at 0:28, Kai Krakow <hurikha...@gmail.com> wrote:
> 
> Matti Nykyri <matti.nyk...@iki.fi> schrieb:
> 
>>> On Jun 27, 2014, at 0:00, Kai Krakow <hurikha...@gmail.com> wrote:
>>> 
>>> Matti Nykyri <matti.nyk...@iki.fi> schrieb:
>>> 
>>>> If you are looking a mathematically perfect solution there is a simple
>>>> one even if your list is not in the power of 2! Take 6 bits at a time of
>>>> the random data. If the result is 62 or 63 you will discard the data and
>>>> get the next 6 bits. This selectively modifies the random data but keeps
>>>> the probabilities in correct balance. Now the probability for index of
>>>> 0-61 is 1/62 because the probability to get 62-63 out of 64 if 0.
>>> 
>>> Why not do just something like this?
>>> 
>>> index = 0;
>>> while (true) {
>>> index = (index + get_6bit_random()) % 62;
>>> output << char_array[index];
>>> }
>>> 
>>> Done, no bits wasted. Should have perfect distribution also. We also
>>> don't have to throw away random data just to stay within unaligned
>>> boundaries. The unalignment is being taken over into the next loop so the
>>> "error" corrects itself over time (it becomes distributed over the whole
>>> set).
>> 
>> Distribution will not be perfect. The same original problem persists.
>> Probability for index 0 to 1 will be 2/64 and for 2 to 61 it will be 1/64.
>> Now the addition changes this so that index 0 to 1 reflects to previous
>> character and not the original index.
>> 
>> The distribution of like 10GB of data should be quite even but not on a
>> small scale. The next char will depend on previous char. It is 100% more
>> likely that the next char is the same or one index above the previous char
>> then any of the other ones in the series. So it is likely that you will
>> have long sets of same character.
> 
> I cannot follow your reasoning here - but I'd like to learn. Actually, I ran 
> this multiple times and never saw long sets of the same character, even no 
> short sets of the same character. The 0 or 1 is always rolled over into the 
> next random addition. I would only get sets of the same character if rand() 
> returned zero multiple times after each other - which wouldn't be really 
> random. ;-)

In your example that isn't true. You will get the same character if 6bit random 
number is 0 or if it is 62! This is what makes the flaw!

You will also get the next character if random number is 1 or 63.

That is why the possibility for 0 and 1 (after modulo 62) is twice as large 
compared to all other values (2-61).

By definition random means that the probability for every value should be the 
same. So if you have 62 options and even distribution of probability the 
probability for each of them is 1/62. 

> Keep in mind: The last index will be reused whenever you'd enter the 
> function - it won't reset to zero. But still that primitive implementation 
> had a flaw: It will tend to select characters beyond the current offset, if 
> it is >= 1/2 into the complete set, otherwise it will prefer selecting 
> characters before the offset.

If you modify the sequence so that if looks random it is pseudo random. 

> In my tests I counted how ofter new_index > index and new_index < index, and 
> it had a clear bias for the first. So I added swapping of the selected index 
> with offset=0 in the set. Now the characters will be swapped and start to 
> distribute that flaw. The distribution, however, didn't change.

Try counting how of often new_index = index and new_index = (index + 1) % 62 
and new_index = (index + 2) % 62. With your algorithm the last one should be 
significantly less then the first two in large sample.

> Of course I'm no mathematician, I don't know how I'd calculate the 
> probabilities for my implementation because it is sort of a recursive 
> function (for get_rand()) when looking at it over time:
> 
> int get_rand() {
>  static int index = 0;
>  return (index = (index + get_6bit_rand()) % 62);
> }
> 
> char get_char() {
>  int index = get_rand();
>  char tmp = chars[index];
>  chars[index] = chars[0];
>  return (chars[0] = tmp);
> }
> 
> However, get_char() should return evenly distributes results.
> 
> What this shows, is, that while distribution is even among the result set, 
> the implementation may still be flawed because results could be predictable 
> for a subset of results. Or in other words: Simply looking at the 
> distribution of results is not an indicator for randomness. I could change 
> get_rand() in the following way:
> 
> int get_rand() {
>  static int index = 0;
>  return (index = (index + 1) % 62);
> }
> 
> Results would be distributed even, but clearly it is not random.
> 
> -- 
> Replies to list only preferred.
> 
> 

Reply via email to