[ https://issues.apache.org/jira/browse/NUTCH-122 ]


    Chris Lambertus deleted comment on NUTCH-122:
    ---------------------------------------

was (Author: JIRAUSER284099):
The product is berry flavor, and it tastes very good

[https://provensupplement.org/|https://provensupplement.org/]

> block numbers need a better random number generator
> ---------------------------------------------------
>
>                 Key: NUTCH-122
>                 URL: https://issues.apache.org/jira/browse/NUTCH-122
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher, indexer
>    Affects Versions: 0.8
>            Reporter: Paul Baclace
>            Priority: Major
>         Attachments: MersenneTwister.java, MersenneTwister.java
>
>
> In order to support billions of block numbers, a better PRNG than 
> java.util.Random is needed.  To reach billions with low probability of 
> collision, 64 bit random numbers are needed (the Birthday Problem is the 
> model for the number of bits needed; the result is that twice as many bits 
> are needed as the number of bits to count the expected number of items.) The 
> built-in java.util.Random keeps only 48 bits of state which is only 
> sufficient for 2^24 items.  Using repeated calls to or more than one instance 
> of Random does not increase its total entropy.  
>   Analysis
>     util.Random is a linear congruential generator (LCG) identical to drand48.
>       util.Random keeps 48 bits of state and gangs together 2 consecutive 
> values to return 64 bit values.
>       LCGs suffer from periodicity in the low order bits which would make 
> modular binning less than random.
>         "low order bits" could mean least significant byte.
>       LCGs have periods in the range 106 to 109 when using 32 bit words, a 
> range of poor to fair.
>       seed = ( 0x5DEECE66DL * seed + 0xBL ) & ((1L << 48) - 1);
>         the origin of 0x5DEECE66D, a non-prime, is shrouded in the mists of 
> time.
>       Results of the Birthday Spacings Test look good.
>      References
>       http://www.math.utah.edu/~beebe/java/random/README
>       http://www.pierssen.com/arcview/upload/esoterica/randomizer.html
> Recommended alternative:    MersenneTwister
>       Matsumoto and Nishimura (1998).
>       Longest period of any known generator 2^19937 or about 10^6001.
>       A period that exceeds the number of unique values seems ideal; 
> obviously a shorter period than the number of unique values (like 
> util.Random)  is a problem).
>       Faster than java.util.Random (Random was recent tweaked, however).
>       Excellent result for Diehard Birthday Spacings Test.
>       Can be seeded with up to 624 32 bit integers.
> Doug Cutting wrote on nutch-dev:
> > It just occurred to me that perhaps we could simply use sequential block 
> > numbering. 
> >  All block ids are generated centrally on the namenode.  
> Response from Paul Baclace:
> I'm not sure what the advantage of sequential block numbers would be
> since long period PRNG block numbering does not even need to store
> it's state, just pick a new starting place.
> Sequential block numbering does have the downside that picking a datanode 
> based on (BlockNum % DataNodeCount) would devolve into round robin.  Any 
> attempt to pass the sequence through a hash ends up becoming a random number 
> generator.
> Sequential numbering provides contiguous numbers, but after G.C. that would 
> be lost, so no advantage there.
> When human beings eyeball block numbers, many with small differences are more 
> likely to be misread than many that are totally different.
> If block numbering is sequential, then there is a temptation to use 32 bits 
> instead of 64, but 32 bits leads to wrap-around and uh oh. 
> FSNamesystem uses Random to help pick a target datanode, but it could just 
> use the randomness of block numbers.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to