After a couple of hours work on comparing molecules with RDKit I noticed that the algorithm is apparently struggling with structures that consist of a repetitive substructure, such as linear alkanes (azelaic acid C(CCCC(=O)O)CCCC(=O)O being the worst offender) or molecules like gentian violet CN(C)C1=CC=C(C=C1)C(=C2C=CC(=[N+](C)C)C=C2)C3=CC=C(C=C3)N(C)C and malachite green CN(C)c1ccc(cc1)C(c2ccccc2)=C3C=CC(C=C3)=[N+](C)C. In those cases the daylight fingerprint tanimoto is 1.0, although malachite green for example lacks a complete amine group. I assume this is due to the hashing of only unique topological paths; Similarity using atom pair descriptors / SparseIntVect returns more reasonable results. Maybe someone can shed light on this matter.
Adrian

