Hello,
based on this article:
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0195-1
I have been trying to make what they call a 'database fingerprint'.
The first step seems to require obtaining the frequencies of each fingerprint
bit in a database of molecules.
To do that, I calculated the fingerprints of a list of molecules (much larger
than the one below; this is just an example):
ms = [Chem.MolFromSmiles(s) for s in ['c1ccccc1','CCC','CCCO']]
fps = [rdMolDescriptors.GetMorganFingerprint(m, 3, useCounts = False) for m in
ms]
My first attempt to obtain the database fingerprint was by looping trough the
fps and summing (+=), as that is reported to be an allowed operation for these
fingerprints.
This worked, but was very slow.
My next attempt was to convert each fingerprint to a dictionary, and build the
dictionary corresponding to the database fingerprint:
database_fp_new = dict()
for i,fp in enumerate(fps):
for fpbit in fp.GetNonzeroElements():
if fpbit in database_fp_new:
database_fp_new[fpbit] += 1
else:
database_fp_new[fpbit] = 1
This worked, too, gave the same result as the '#=' approach, and was much
faster.
{98513984: 1,
2763854213: 1,
3218693969: 1,
3741631696: 1,
2068133184: 1,
2245384272: 2,
2246728737: 2,
3542456614: 2,
864662311: 1,
1173125914: 1,
1365892349: 1,
1535166686: 1,
4023654873: 1}
However, then I have a dictionary.
But I need a fingerprint, because I want to do operations like similarity
calculations (e.g.
https://www.rdkit.org/docs/source/rdkit.DataStructs.cDataStructs.html?highlight=bulktanimoto#rdkit.DataStructs.cDataStructs.BulkTanimotoSimilarity
).
Would anyone be able suggest if and how the dictionary can be turned back into
a fingerprint, or perhaps advise how to make the database fingerprint in a
different way, if the one I figured out is not optimal?
Thank you
--
This e-mail and its attachment(s) (if any) may contain confidential and/or
proprietary information and is intended for its addressee(s) only. Any
unauthorized use of the information contained herein (including, but not
limited to, alteration, reproduction, communication, distribution or any other
form of dissemination) is strictly prohibited. If you are not the intended
addressee, please notify the originator promptly and delete this e-mail and its
attachment(s) (if any) subsequently.
Neither Galapagos nor any of its affiliates shall be liable for direct,
special, indirect or consequential damages arising from alteration of the
contents of this message (by a third party) or as a result of a virus being
passed on.
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss