On 09/09/2020 09:35, Lewis Martin wrote:
Hi RDKit,

Looking for advice on an rdkit-adjacent problem please. Ultimately I'd
like to fit an approximate-nearest neighbors index on a dataset of 100
million ligands, featurized by morgan fingerprint. The text file of
the smiles is ~6gb but this blows out when loaded with
pandas.read_csv() or f.readlines() due to weird memory allocation
issues.

It would take 45hrs to process the file in serial (i.e. read line,
create mol, fingerprint, convert to np.arr or sparse arrays) in a
streaming manner so now I'd like to parallelize the job with joblib,
which would multiply the memory requirements by the number of
processes running at a time.

So: what is the smallest possible representation for a binary
fingerprint? Using `sys.getsizeof` on a
rdkit.DataStructs.cDataStructs.ExplicitBitVect object tells me it is
96 bytes, but I'm not sure whether to believe that since, like
csr_matrix, the size depends on accurately returning the object's
data. Here's an example demonstrating this:

from rdkit import Chem
from rdkit.Chem import rdFingerprintGenerator
smi = 'COCCN(CCO)C(=O)/C=C\\c1cccnc1'
mol = Chem.MolFromSmiles(smi)
gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2,
fpSize=512)

Obviously, if you ask for fpSize = 512, the smallest uncompressed
representation of the fingerprint will be 512 bits (64 bytes).

10M of such fingerprints, if there is not any overhead added by the programming language,
would fit into 6GB of RAM.

But, the really fun things will start when you want to search fast into so many molecules. :) There are many published methods, some open-source software (like Dalke's chemfp) and even some commercial ones which claim they are lightning fast (even reaching real-time search speed!).

e.g.
https://chemaxon.com/products/madfast
https://www.nextmovesoftware.com/arthor.html

Regards,
F.

fp = gen_mo.GetFingerprint(mol)
sparse_fp = sparse.csr_matrix(fp)

print('ExplicitBitVect object size:', getsizeof(fp))
print('Sparse matrix size (naive):', getsizeof(sparse_fp))
print('Sparse matrix size (real):',
sparse_fp.data.nbytes+sparse_fp.indices.nbytes+sparse_fp.indptr.nbytes)
print('fp.ToBinary size:', getsizeof(fp.ToBinary()))
print('fp.ToBinary size:', getsizeof(fp.ToBase64()))


ExplicitBitVect object size: 96
Sparse matrix size (naive): 64
Sparse matrix size (real): 476
fp.ToBinary size: 85
fp.ToBinary size: 121

Note that even the smallest of these multiplied by 100 million would
be about 8gb, still larger than the text file storing the smiles codes
- not sure if that is to be expected or not?

Thank for your time!
Lewis
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to