[ https://issues.apache.org/jira/browse/NUTCH-122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481119#comment-17481119 ]
pankaj kumar singh commented on NUTCH-122: ------------------------------------------ It’s actually almost as sweet as real fruit juice [https://leptoconnectstore.org/|https://leptoconnectstore.org/] > block numbers need a better random number generator > --------------------------------------------------- > > Key: NUTCH-122 > URL: https://issues.apache.org/jira/browse/NUTCH-122 > Project: Nutch > Issue Type: Bug > Components: fetcher, indexer > Affects Versions: 0.8 > Reporter: Paul Baclace > Priority: Major > Attachments: MersenneTwister.java, MersenneTwister.java > > > In order to support billions of block numbers, a better PRNG than > java.util.Random is needed. To reach billions with low probability of > collision, 64 bit random numbers are needed (the Birthday Problem is the > model for the number of bits needed; the result is that twice as many bits > are needed as the number of bits to count the expected number of items.) The > built-in java.util.Random keeps only 48 bits of state which is only > sufficient for 2^24 items. Using repeated calls to or more than one instance > of Random does not increase its total entropy. > Analysis > util.Random is a linear congruential generator (LCG) identical to drand48. > util.Random keeps 48 bits of state and gangs together 2 consecutive > values to return 64 bit values. > LCGs suffer from periodicity in the low order bits which would make > modular binning less than random. > "low order bits" could mean least significant byte. > LCGs have periods in the range 106 to 109 when using 32 bit words, a > range of poor to fair. > seed = ( 0x5DEECE66DL * seed + 0xBL ) & ((1L << 48) - 1); > the origin of 0x5DEECE66D, a non-prime, is shrouded in the mists of > time. > Results of the Birthday Spacings Test look good. > References > http://www.math.utah.edu/~beebe/java/random/README > http://www.pierssen.com/arcview/upload/esoterica/randomizer.html > Recommended alternative: MersenneTwister > Matsumoto and Nishimura (1998). > Longest period of any known generator 2^19937 or about 10^6001. > A period that exceeds the number of unique values seems ideal; > obviously a shorter period than the number of unique values (like > util.Random) is a problem). > Faster than java.util.Random (Random was recent tweaked, however). > Excellent result for Diehard Birthday Spacings Test. > Can be seeded with up to 624 32 bit integers. > Doug Cutting wrote on nutch-dev: > > It just occurred to me that perhaps we could simply use sequential block > > numbering. > > All block ids are generated centrally on the namenode. > Response from Paul Baclace: > I'm not sure what the advantage of sequential block numbers would be > since long period PRNG block numbering does not even need to store > it's state, just pick a new starting place. > Sequential block numbering does have the downside that picking a datanode > based on (BlockNum % DataNodeCount) would devolve into round robin. Any > attempt to pass the sequence through a hash ends up becoming a random number > generator. > Sequential numbering provides contiguous numbers, but after G.C. that would > be lost, so no advantage there. > When human beings eyeball block numbers, many with small differences are more > likely to be misread than many that are totally different. > If block numbering is sequential, then there is a temptation to use 32 bits > instead of 64, but 32 bits leads to wrap-around and uh oh. > FSNamesystem uses Random to help pick a target datanode, but it could just > use the randomness of block numbers. -- This message was sent by Atlassian Jira (v8.20.1#820001)