Hi RDKit, Looking for advice on an rdkit-adjacent problem please. Ultimately I'd like to fit an approximate-nearest neighbors index on a dataset of 100 million ligands, featurized by morgan fingerprint. The text file of the smiles is ~6gb but this blows out when loaded with pandas.read_csv() or f.readlines() due to weird memory allocation issues.
It would take 45hrs to process the file in serial (i.e. read line, create mol, fingerprint, convert to np.arr or sparse arrays) in a streaming manner so now I'd like to parallelize the job with joblib, which would multiply the memory requirements by the number of processes running at a time. So: what is the smallest possible representation for a binary fingerprint? Using `sys.getsizeof` on a rdkit.DataStructs.cDataStructs.ExplicitBitVect object tells me it is 96 bytes, but I'm not sure whether to believe that since, like csr_matrix, the size depends on accurately returning the object's data. Here's an example demonstrating this: from rdkit import Chem from rdkit.Chem import rdFingerprintGenerator smi = 'COCCN(CCO)C(=O)/C=C\\c1cccnc1' mol = Chem.MolFromSmiles(smi) gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=512) fp = gen_mo.GetFingerprint(mol) sparse_fp = sparse.csr_matrix(fp) print('ExplicitBitVect object size:', getsizeof(fp)) print('Sparse matrix size (naive):', getsizeof(sparse_fp)) print('Sparse matrix size (real):', sparse_fp.data.nbytes+sparse_fp.indices.nbytes+sparse_fp.indptr.nbytes) print('fp.ToBinary size:', getsizeof(fp.ToBinary())) print('fp.ToBinary size:', getsizeof(fp.ToBase64())) >>> ExplicitBitVect object size: 96 Sparse matrix size (naive): 64 Sparse matrix size (real): 476 fp.ToBinary size: 85 fp.ToBinary size: 121 Note that even the smallest of these multiplied by 100 million would be about 8gb, still larger than the text file storing the smiles codes - not sure if that is to be expected or not? Thank for your time! Lewis
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss