After a couple of hours work on comparing molecules with RDKit I
noticed that the algorithm is apparently struggling with structures
that consist of a repetitive substructure, such as linear alkanes
(azelaic acid C(CCCC(=O)O)CCCC(=O)O being the worst offender) or
molecules like gentian violet
CN(C)C1=CC=C(C=C1)C(=C2C=CC(=[N+](C)C)C=C2)C3=CC=C(C=C3)N(C)C and
malachite green         CN(C)c1ccc(cc1)C(c2ccccc2)=C3C=CC(C=C3)=[N+](C)C. In
those cases the daylight fingerprint tanimoto is 1.0, although
malachite green for example lacks a complete amine group. I assume
this is due to the hashing of only unique topological paths;
Similarity using atom pair descriptors / SparseIntVect returns more
reasonable results. Maybe someone can shed light on this matter.

Adrian

Reply via email to