Hi Wojtek,

Our findings are the same.  There is a Morgan fingerprint generator for 64 bits, which Python uses by default.  When you call it the functions that actually set the bits in the 64 bit fingerprint (MorganFingerprints::getConnectivityInvariants and MorganFingerprints::getFeatureInvariants) will only ever set the first 32 bits.

So you have a 64 bit fingerprint, but only the first 32 bits are ever set.

On 4/22/2021 12:20 PM, Wojtek Plonka wrote:
Hi Gareth,

Your findings are a bit contrary to mine, so the truth must be somewhere in between :) I downloaded the RDKit sources and some support for 64 bit Morgan Fingerprints seems to be there:

Search "getMorganGenerator<std::uint64_t>" (7 hits in 4 files of 661 searched)
  C:\RDKit\rdkit\Code\GraphMol\Fingerprints\catch_tests.cpp (1 hit)
Line 152: MorganFingerprint::getMorganGenerator<std::uint64_t>(radius));
C:\RDKit\rdkit\Code\GraphMol\Fingerprints\FingerprintGenerator.cpp (4 hits) Line 461:       generator = MorganFingerprint::getMorganGenerator<std::uint64_t>(2); Line 497:       generator = MorganFingerprint::getMorganGenerator<std::uint64_t>(2); Line 533:       generator = MorganFingerprint::getMorganGenerator<std::uint64_t>(2); Line 569:       generator = MorganFingerprint::getMorganGenerator<std::uint64_t>(2); C:\RDKit\rdkit\Code\GraphMol\Fingerprints\testFingerprintGenerators.cpp

(1 hit)
Line 2387: MorganFingerprint::getMorganGenerator<std::uint64_t>(2),
C:\RDKit\rdkit\Code\GraphMol\Fingerprints\Wrap\MorganWrapper.cpp (1 hit)
Line 78:       "GetMorganGenerator", getMorganGenerator<std::uint64_t>,

I will have a closer look at that.
I don't need to write my code in Python, C++ (with Google's help) is fine, too, as long as I can compile it with Linux tools of MSVC Community Edition.
Maybe simply 64 bit stuff is not complete or not interfaced to Python yet?
Thanks!

Wojtek Plonka
+48885756652
wojtekplonka.com <http://www.wojtekplonka.com>
fb.com/wojtek.plonka <https://fb.com/wojtek.plonka>



On Thu, Apr 22, 2021 at 7:17 PM Gareth Jones <[email protected] <mailto:[email protected]>> wrote:


    Hi Wojtek,

    From looking at the RDKit code base my take is that is is
    currently not possible to generate 64 bit Morgan fingerprints.

    The Python fingerprint generator defaults to 64bit:

    In [36]: fp.GetLength()
    Out[36]: 18446744073709551615

    Unfortunately, the C++ Morgan fingerprint generator only ever sets
    the first 32 bits even if the fingerprint is 64bit.  If you look
    at MorganFingerprints::getConnectivityInvariants and
    MorganFingerprints::getFeatureInvariants in
    Code/GraphMol/Fingerprints/FingerprintUtil.cpp the generated
    invariants (that are used to set the fingerprint bits) are
    unsigned 32 bit ints.

    Some RDKit development would be needed to template those functions
    so that they would work with both 32 and 64 bit fingerprints.

    Cheers,

    Gareth


    On 4/21/2021 10:10 PM, Wojtek Plonka wrote:
    Hi Gareth,

    Thank you. I do exactly as you wrote. That's not the issue.
    Please note, that all the keys in elements are in range of 2**32
    - the main hash function used is definitely 32 bit

    According to
    https://www.rdkit.org/docs/source/rdkit.Chem.rdFingerprintGenerator.html
    <https://www.rdkit.org/docs/source/rdkit.Chem.rdFingerprintGenerator.html>
    both /class
    /|rdkit.Chem.rdFingerprintGenerator.||FingerprintGenerator32|
    and /class
    /|rdkit.Chem.rdFingerprintGenerator.||FingerprintGenerator64|
    exist.

    However with my limited knowledge I don't know how to access the
    64 bit version and that is my problem.
    Kindest regards,

    Wojtek

    Wojtek Plonka
    +48885756652
    wojtekplonka.com <http://www.wojtekplonka.com>
    fb.com/wojtek.plonka <https://fb.com/wojtek.plonka>



    On Thu, Apr 22, 2021 at 1:27 AM Gareth Jones
    <[email protected] <mailto:[email protected]>> wrote:

        Wojtek,

        You can use GetNonzeroelements() to convert the sparse
        fingerprint to a Python Dict of hash to count.

        Cheers,
        Gareth


        In [7]: mol = Chem.MolFromSmiles('Cn1cnc2n(C)c(=O)n(C)c(=O)c12')

        In [8]: fp = AllChem.GetMorganFingerprint(mol, 2)

        In [9]: elements = fp.GetNonzeroElements();

        In [10]: elements
        Out[10]:
        {10565946: 2,
         348155210: 1,
         476388586: 1,
         540046244: 1,
         553412256: 1,
         864942730: 2,
         909857231: 1,
         1100037548: 1,
         1333761024: 1,
         1512818157: 1,
         1981181107: 1,
         2030573601: 1,
         2041434490: 1,
         2092489639: 3,
         2246728737: 3,
         2370996728: 1,
         2877515035: 1,
         2971716993: 1,
         2975126068: 2,
         3140581776: 1,
         3217380708: 4,
         3218693969: 1,
         3462333187: 1,
         3657471097: 3,
         3796970912: 1}

        In [11]:

        On 4/21/2021 5:44 AM, Wojtek Plonka wrote:
        Dear All

        Do any of you have a working example of getting Morgan
        Fingerprints, as sparse bit vector (non-hashed) in the 64
        bit version using Python?
        I'm looking into the issue of collisions on the "main hash"
        on large (100+ million molecules) data
        Thank you very much!
        Kindest regards,

        Wojtek Plonka
        +48885756652
        wojtekplonka.com <http://www.wojtekplonka.com>
        fb.com/wojtek.plonka <https://fb.com/wojtek.plonka>



        _______________________________________________
        Rdkit-discuss mailing list
        [email protected]  
<mailto:[email protected]>
        https://lists.sourceforge.net/lists/listinfo/rdkit-discuss  
<https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>
        _______________________________________________
        Rdkit-discuss mailing list
        [email protected]
        <mailto:[email protected]>
        https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
        <https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>



    _______________________________________________
    Rdkit-discuss mailing list
    [email protected]  
<mailto:[email protected]>
    https://lists.sourceforge.net/lists/listinfo/rdkit-discuss  
<https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>
    _______________________________________________
    Rdkit-discuss mailing list
    [email protected]
    <mailto:[email protected]>
    https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
    <https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>



_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to