Hi Greg Thanks for looking into this. I think, but of course cannot prove, that the choice taken by Rogers was to include only such chirality, that can be disambiguated within the fragment itself in order to ensure that the fingerprints describe a real sub-fragment of the molecule independent on any information outside its radius. If the such a fragment, even if derived from a chiral molecule, is achiral, how can the chirality information be set, in order to ensure consistency and alignment independence? In your current implementation how does the chirality information get set, in case the substituents cannot be disambiguated within the Morgan radius?
With respect to the question that molecules that are truly different, but cannot be distinguished by Morgan fingerprints, that effect kicks in at a certain alkyl chain length anyway, So from CCCCCCO on the chain homologues cannot be distinguished any more by Morgan-2 (without counts that is), so not distinguishing in fragments sidechains outside of the radius I think is not something surprising. The answer to this is that you sometimes need to increase the radius in order to disambiguate longer repeats. Like in genomic sequence assembly, where also longer reads are needed to assemble repeat-rich genomes. I agree with your idea to make the original implementation a flag rather than changing the default, even if only for inter version compatibility reasons. Best regards Ansgar Ansgar Schuffenhauer Senior Investigator I T +41 79 608 9063 ansgar.schuffenha...@novartis.com<mailto:ansgar.schuffenha...@novartis.com> Novartis Pharma AG NIBR From: Greg Landrum <greg.land...@gmail.com> Sent: Montag, 2. Dezember 2019 10:25 To: Schuffenhauer, Ansgar <ansgar.schuffenha...@novartis.com> Cc: rdkit-discuss@lists.sourceforge.net Subject: Re: [Rdkit-discuss] FW: rdkit Chiral Moragn Fingerprint unexpected behaviour This is a really good question. I must admit that I find the ECFP behavior as published to be somewhat weird. It doesn't make sense to me that the chiral versions of the Morgan-2 fingerprints for CCC[CH](C)CCO, CCC[C@@H](C)CCO, and CCC[C@H](C)CCO would be identical. However, as you point out, we have tried to reproduce the details of the published algorithm and the way chirality is being handled currently does not do that. I don't think "fixing" the current behavior would be a great idea, but it would make sense to add an additional option to use the original chirality rules (along with some documentation explaining them). Here's the github issue: https://github.com/rdkit/rdkit/issues/2818<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_rdkit_rdkit_issues_2818&d=DwMFaQ&c=ZbgFmJjg4pdtrnL2HUJUDw&r=5QXEEnQo9VkJH7cIXFb_E4UmFhbbILws-P-WlR4_pzpv_6dQk_-xFQGH00p03i-I&m=_5gPv6bdkZE6XBGq7c1HtsyYinCaotY4ShvwcVaNd4s&s=5A3QyXuRVmQvi5UnvyHAUoGVFD2zYoA5YoU2lrEv7WU&e=> I didn't notice this discrepancy when I did the original comparison of similarities between RDKit's MorganFP and PPs ECFP implementation many years ago because I ran both of them without chirality being turned on. Thanks for pointing this out Ansgar! -greg On Mon, Nov 25, 2019 at 1:09 PM Schuffenhauer, Ansgar <ansgar.schuffenha...@novartis.com<mailto:ansgar.schuffenha...@novartis.com>> wrote: Dear all I have observed some unexpected behaviour with the chiral version of the Morgan Fingerprints in RDKit When reading the Rogers paper (http://doi.org/10.1021/ci100050t<https://urldefense.proofpoint.com/v2/url?u=http-3A__doi.org_10.1021_ci100050t&d=DwMFaQ&c=ZbgFmJjg4pdtrnL2HUJUDw&r=5QXEEnQo9VkJH7cIXFb_E4UmFhbbILws-P-WlR4_pzpv_6dQk_-xFQGH00p03i-I&m=_5gPv6bdkZE6XBGq7c1HtsyYinCaotY4ShvwcVaNd4s&s=t_MGValwqu5hfyuSZFhYPVoup_fRztuFkeGAVKVOLkU&e=> ) I find: “If the atom is a possible stereoatom but is not yet disambiguated, and all attachment atoms have different identifiers, then the atom is marked as disambiguated, and a stereochemical flag is appended to the array, depending on the marked stereochemistry. (Step 4 is only performed if stereochemical fingerprints are requested.)” In this aspect I believe that the rdkit implementation does not follow exactly the ECFP paper. As a test I calculated the pairwise similarity between the enatiomers of butan-2-ol, hexan-3-ol, octan-4-ol, decan-5-ol, ... Eventually the both alkyl chains should grow too long to become disambiguated within the fingerprint radius, there for the chirality on the chiral center should not be recognised any more, and the fingerprint of the enantiomers should become equal to 1, once the chains outgrow the fingerprint radius. Strangely that doesn’t happen, as can be seen in the attached notebook, all fingerprints with radius > 0 will always give similarities < 1.0 for the enantiomer pairs. This contrasts with the Pipeline Pilot implementation, where with the similarity of the enantiomers indeed becomes 1.0 once the chains outgrow the fingerprint radius. For your reference I added also fingerprints and similarity values obtained at different ECFP diameters Is this difference in behaviour intentional? I always assumed so far that rdkit Morgan and Pipeline Pilot ECFP would give identical similarity results. With best regards Ansgar Schuffenhauer Senior Investigator I T +41 79 608 9063 ansgar.schuffenha...@novartis.com<mailto:ansgar.schuffenha...@novartis.com> Novartis Pharma AG NIBR Novartis Campus Virchow 16-4.249.09 4056 Basel Switzerland ________________________________ _________________________ CONFIDENTIALITY NOTICE The information contained in this e-mail message is intended only for the exclusive use of the individual or entity named above and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivery of the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by e-mail and delete the material from any computer. Thank you. _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss<https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_rdkit-2Ddiscuss&d=DwMFaQ&c=ZbgFmJjg4pdtrnL2HUJUDw&r=5QXEEnQo9VkJH7cIXFb_E4UmFhbbILws-P-WlR4_pzpv_6dQk_-xFQGH00p03i-I&m=_5gPv6bdkZE6XBGq7c1HtsyYinCaotY4ShvwcVaNd4s&s=Y0tDTJ_1vUr9lJXor-houWZVJWhHWa6PyyLCEQUpTko&e=>
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss