Hi Greg,

thanks a lot for the fast and comprehensive answer. I will give it a try,
this looks simple enough.

To your point about local vs nonlocal information: You are of course
absolutely right about the apparent contradiction. My goal is to build a
system where I can visualize the Fingerprints using SMILES. In my first
attempts, I got weird results where the SMILES did not fully correspond to
the molecular subgraphs that were really used for the Fingerprints.

Bests,
Christian

*Dr. Christian Kramer*

Computer-Aided Drug Design (CADD)


F. Hoffmann-La Roche Ltd

Pharma Research and Early Development
Bldg. 092/4.92

CH-4070 Basel


Phone +41 61 682 2471

mailto: christian.kra...@roche.com


*Confidentiality Note: *This message is intended only for the use of the
named recipient(s) and may contain confidential and/or proprietary
information. If you are not the intended recipient, please contact the
sender and delete this message. Any unauthorized use of the information
contained in this message is prohibited.


On Thu, Jan 9, 2020 at 2:13 PM Greg Landrum <greg.land...@gmail.com> wrote:

> Hi Christian,
>
> The topic of how to specify atom invariants came up recently on the list
> here:
>
> https://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg09400.html
>
> Here's a gist that shows how to specify your own atom invariants based
> solely upon atomic number and, optionally, aromaticity:
> https://gist.github.com/greglandrum/d31ae7618cc5b7322a7121a529bf8190
> The key function is here:
>
> def get_simple_morgan(m,radius,includeAromaticity=False,**kwargs):
>     if not includeAromaticity:
>         invars = [x.GetAtomicNum() for x in m.GetAtoms()]
>     else:
>         invars = [x.GetAtomicNum()|(1000+x.GetIsAromatic()) for x in
> m.GetAtoms()]
>     return
> rdMolDescriptors.GetMorganFingerprint(m,radius,invariants=invars,**kwargs)
>
>
>
> The gist also shows how to use the SMARTS for each atom as its atom
> invariant:
>
> import hashlib
> def get_smiles_morgan(m,radius,**kwargs):
>     smis = [Chem.Atom.GetSmarts(x) for x in m.GetAtoms()]
>     invars = []
>     for x in m.GetAtoms():
>         # there's almost certainly a more performant way to do this,
> but....
>         h = hashlib.md5()
>         h.update(x.GetSmarts().encode())
>         invars.append(int.from_bytes(h.digest()[:4],'little'))
>     return
> rdMolDescriptors.GetMorganFingerprint(m,radius,invariants=invars,**kwargs)
>
>
>  Note that this is sensitive to things like atom map numbers (as shown in
> the gist).
>
> I am compelled to point out that, at least based on the way you phrase the
> question you are asking for two mutually contradictory things here:
> The first question asks about including information about aromaticity,
> which is determined by the properties of an entire ring system and is thus
> definitely *not* local. The second question wants things to be super local
> and not affected by atoms that aren't included in the radius being
> considered.
>
> -greg
>
>
>
>
> On Thu, Jan 9, 2020 at 11:06 AM Kramer, Christian via Rdkit-discuss <
> rdkit-discuss@lists.sourceforge.net> wrote:
>
>> Dear RDKit community,
>>
>> Happy new year!
>>
>> I am looking for a way to make the circular Morgen Fingerprints more
>> SMILES like. The background is that with the default definition of atom
>> invariants in the RDKit implementation, Morgan Fingerprints do not
>> explicitly take into account aromaticity, and use more information from
>> higher radii than what would be expected when sketching the substructures
>> indexed by the fingerprint. This becomes an issue when drawing the
>> substructures, or encoding them as SMILES. Here are two examples that
>> illustrate the points:
>>
>> 1.) Aromaticity:
>> At radius 1, the atoms in phenyl and the sp2 atoms in cyclohexene yield
>> exactly the same fingerprint, whereas the SMILES for those atoms is
>> different:
>>
>> In [1]: import rdkit
>> In [2]: from rdkit import Chem
>> In [3]: from rdkit.Chem import AllChem
>> In [4]: phenyl = "[*:1]c1ccccc1"
>> In [5]: cyclohexyl = "[*:1]C1=CCCCC1"
>> In [6]: mol1 = Chem.MolFromSmiles(phenyl)
>> In [7]: mol2 = Chem.MolFromSmiles(cyclohexyl)
>> In [8]: fp1 = AllChem.GetMorganFingerprint(mol1, 1, fromAtoms=[0])
>> In [9]: fp2 = AllChem.GetMorganFingerprint(mol2, 1, fromAtoms=[0])
>> In [10]: fp1==fp2
>> Out[10]: True
>>
>> Now in many cases there probably is a good reason why those atoms can be
>> considered identical, but then there are still other cases when aromaticity
>> makes a difference. For example, when encoding the substructure as a
>> SMILES, the two atoms are different ("c" and "C"), which can create
>> confusion when comparing to the fingerprint.
>>
>>
>> 2.) Information from higher radii
>> The Morgan Fingerprint has the concept of radius. For a radius of 2, I
>> would naively expect that only atom environments up to 2 atoms away from
>> the rooted atom are taken into account. However, this is not fully true, as
>> shown below:
>>
>> In [11]: toluene = "[*:1]c1ccccc1C"
>> In [12]: mol3 = Chem.MolFromSmiles(toluene)
>> In [13]: fp1 = AllChem.GetMorganFingerprint(mol1, 2, fromAtoms=[0])
>> In [14]: fp3 = AllChem.GetMorganFingerprint(mol3, 2, fromAtoms=[0])
>> In [15]: fp1==fp3
>> Out[15]: False
>>
>> Toluene and Phenyl differ in the one C ortho to the star atom. This C is
>> 3 bonds away from the star atom. Therefore, when calculating the
>> MorganFingerprint with radius 2 rooted on the star atom, I would expect the
>> two fingerprints derived from phenyl and toluene to be the same. I assume
>> this is not the case because the connectivity makes a difference between a
>> bond to a heavy atom and to a hydrogen.
>>
>>
>> It would be very helpful to get suggestions or even code snippets for how
>> to change the default behaviour of the Morgan Fingerprinter such that the
>> representation is closer to what one draws or encodes in SMILES for the
>> atoms in a given radius. The documentation says that atom invariants can be
>> defined, which I hope help here. If someone did this before, it would be
>> cool if you could share how to do it exactly.
>>
>> Thanks a lot,
>> Christian
>>
>>
>> *Dr. Christian Kramer*
>>
>> Computer-Aided Drug Design (CADD)
>>
>>
>> F. Hoffmann-La Roche Ltd
>>
>> Pharma Research and Early Development
>> Bldg. 092/4.92
>>
>> CH-4070 Basel
>>
>>
>> Phone +41 61 682 2471
>>
>> mailto: christian.kra...@roche.com
>>
>>
>> *Confidentiality Note: *This message is intended only for the use of the
>> named recipient(s) and may contain confidential and/or proprietary
>> information. If you are not the intended recipient, please contact the
>> sender and delete this message. Any unauthorized use of the information
>> contained in this message is prohibited.
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to