On Mon, Feb 9, 2009 at 11:57 AM, Andrew Dalke <da...@dalkescientific.com> wrote:
>
>> If you do find yourself cursing the speed of the fingerprint
>> generation, it might be worth taking a look at using the alternate RNG
>> that is applied in the layered fingerprint code (line 229 of
>> Fingerprints.cpp). Some profiling I did while implementing those fps
>> showed that I was spending a disproportionate (and unecessary) amount
>> of time in the RNG seeding process. The adjusted params for the
>> layered fingerprint RNG seemed to solve that problem.
>
> That's another advantage of the linear paths: CDK for example
> converts the path to a string (as in "C-C-O-C=N") and uses the
> normal Java hashString to get the initial seed to the system
> PRNG. But I suspect they could use a better hash algorithm and
> skip the PRNG completely. Unlike RDKit, CDK only uses a single
> number from the PRNG.

Yeah, it seems clear that it's a win to use hash function if you are
only setting a single bit per path. Assuming, of course, that you have
an adequate hashing function. In playing with the layered fingerprints
I tried simply using the hash to set bits, but I ran into collisions
in my testing code, so I abandoned that approach. There's a comment
about this in the code.

> I saw that RDKit uses the Mersenne Twister as its PRNG, with
> special parameters which I don't understand. The comment is
>
>     // The standard parameters (used to create boost::mt19937)
>     // result in an RNG that's much too computationally intensive
>     // to seed.
>
> which I'm a bit cautious of. From the Mersenne Twister Wikipedia page:
>> Unlike Blum Blum Shub, the algorithm in its native form is not
>> suitable for cryptography. Observing a sufficient number of
>> iterates (624 in the case of MT19937) allows one to predict all
>> future iterates.
>>
>> Another issue is that it can take a long time to turn a non-random
>> initial state into output that passes randomness tests, due to its
>> size. A small lagged Fibonacci generator or linear congruential
>> generator gets started much quicker and usually is used to seed the
>> Mersenne Twister. If only a few numbers are required and standards
>> aren't high it is simpler to use the seed generator. But the
>> Mersenne Twister will still work.
>>
>>
>
> and I don't know if what you've done reduces the randomness. The
> result if that were the case wouldn't be wrong, just less useful.

The magic that lies behind PRNGs is something I've never managed to
absorb, so I admit that I was flying somewhat blind here. I'm fully
willing to believe that the changes I made to the default parameters
for the MT dramatically reduced the cycle length. Fortunately in this
case, I'm only looking for a single number, so I think it's ok.

> I've considered using a crytographic hash, like taking the same seed
> as input to SHA-512. That would give up to 16 values (hash size 512 /
> 32 bits per value) for the fingerprint, give good random values, and
> likely be faster. But that's a thought without anything to back it up.

I like the idea.

-greg

Reply via email to