On Mon, Feb 9, 2009 at 11:57 AM, Andrew Dalke <da...@dalkescientific.com> wrote: > >> If you do find yourself cursing the speed of the fingerprint >> generation, it might be worth taking a look at using the alternate RNG >> that is applied in the layered fingerprint code (line 229 of >> Fingerprints.cpp). Some profiling I did while implementing those fps >> showed that I was spending a disproportionate (and unecessary) amount >> of time in the RNG seeding process. The adjusted params for the >> layered fingerprint RNG seemed to solve that problem. > > That's another advantage of the linear paths: CDK for example > converts the path to a string (as in "C-C-O-C=N") and uses the > normal Java hashString to get the initial seed to the system > PRNG. But I suspect they could use a better hash algorithm and > skip the PRNG completely. Unlike RDKit, CDK only uses a single > number from the PRNG.
Yeah, it seems clear that it's a win to use hash function if you are only setting a single bit per path. Assuming, of course, that you have an adequate hashing function. In playing with the layered fingerprints I tried simply using the hash to set bits, but I ran into collisions in my testing code, so I abandoned that approach. There's a comment about this in the code. > I saw that RDKit uses the Mersenne Twister as its PRNG, with > special parameters which I don't understand. The comment is > > // The standard parameters (used to create boost::mt19937) > // result in an RNG that's much too computationally intensive > // to seed. > > which I'm a bit cautious of. From the Mersenne Twister Wikipedia page: >> Unlike Blum Blum Shub, the algorithm in its native form is not >> suitable for cryptography. Observing a sufficient number of >> iterates (624 in the case of MT19937) allows one to predict all >> future iterates. >> >> Another issue is that it can take a long time to turn a non-random >> initial state into output that passes randomness tests, due to its >> size. A small lagged Fibonacci generator or linear congruential >> generator gets started much quicker and usually is used to seed the >> Mersenne Twister. If only a few numbers are required and standards >> aren't high it is simpler to use the seed generator. But the >> Mersenne Twister will still work. >> >> > > and I don't know if what you've done reduces the randomness. The > result if that were the case wouldn't be wrong, just less useful. The magic that lies behind PRNGs is something I've never managed to absorb, so I admit that I was flying somewhat blind here. I'm fully willing to believe that the changes I made to the default parameters for the MT dramatically reduced the cycle length. Fortunately in this case, I'm only looking for a single number, so I think it's ok. > I've considered using a crytographic hash, like taking the same seed > as input to SHA-512. That would give up to 16 values (hash size 512 / > 32 bits per value) for the fingerprint, give good random values, and > likely be faster. But that's a thought without anything to back it up. I like the idea. -greg