Hi Greg,

thanks a lot for the fast and comprehensive answer. I will give it a try,
this looks simple enough.

To your point about local vs nonlocal information: You are of course
absolutely right about the apparent contradiction. My goal is to build a
system where I can visualize the Fingerprints using SMILES. In my first
attempts, I got weird results where the SMILES did not fully correspond to
the molecular subgraphs that were really used for the Fingerprints.


On Thu, Jan 9, 2020 at 2:13 PM Greg Landrum <greg.land...@gmail.com> wrote:

> Hi Christian,
> The topic of how to specify atom invariants came up recently on the list
> here:
> https://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg09400.html
> Here's a gist that shows how to specify your own atom invariants based
> solely upon atomic number and, optionally, aromaticity:
> https://gist.github.com/greglandrum/d31ae7618cc5b7322a7121a529bf8190
> The key function is here:
> def get_simple_morgan(m,radius,includeAromaticity=False,**kwargs):
>     if not includeAromaticity:
>         invars = [x.GetAtomicNum() for x in m.GetAtoms()]
>     else:
>         invars = [x.GetAtomicNum()|(1000+x.GetIsAromatic()) for x in
> m.GetAtoms()]
>     return
> rdMolDescriptors.GetMorganFingerprint(m,radius,invariants=invars,**kwargs)
> The gist also shows how to use the SMARTS for each atom as its atom
> invariant:
> import hashlib
> def get_smiles_morgan(m,radius,**kwargs):
>     smis = [Chem.Atom.GetSmarts(x) for x in m.GetAtoms()]
>     invars = []
>     for x in m.GetAtoms():
>         # there's almost certainly a more performant way to do this,
> but....
>         h = hashlib.md5()
>         h.update(x.GetSmarts().encode())
>         invars.append(int.from_bytes(h.digest()[:4],'little'))
>     return
> rdMolDescriptors.GetMorganFingerprint(m,radius,invariants=invars,**kwargs)
>  Note that this is sensitive to things like atom map numbers (as shown in
> the gist).
> I am compelled to point out that, at least based on the way you phrase the
> question you are asking for two mutually contradictory things here:
> The first question asks about including information about aromaticity,
> which is determined by the properties of an entire ring system and is thus
> definitely *not* local. The second question wants things to be super local
> and not affected by atoms that aren't included in the radius being
> considered.
> -greg
> On Thu, Jan 9, 2020 at 11:06 AM Kramer, Christian via Rdkit-discuss <
> rdkit-discuss@lists.sourceforge.net> wrote:
>> Dear RDKit community,
>> Happy new year!
>> I am looking for a way to make the circular Morgen Fingerprints more
>> SMILES like. The background is that with the default definition of atom
>> invariants in the RDKit implementation, Morgan Fingerprints do not
>> explicitly take into account aromaticity, and use more information from
>> higher radii than what would be expected when sketching the substructures
>> indexed by the fingerprint. This becomes an issue when drawing the
>> substructures, or encoding them as SMILES. Here are two examples that
>> illustrate the points:
>> 1.) Aromaticity:
>> At radius 1, the atoms in phenyl and the sp2 atoms in cyclohexene yield
>> exactly the same fingerprint, whereas the SMILES for those atoms is
>> different:
>> In [1]: import rdkit
>> In [2]: from rdkit import Chem
>> In [3]: from rdkit.Chem import AllChem
>> In [4]: phenyl = "[*:1]c1ccccc1"
>> In [5]: cyclohexyl = "[*:1]C1=CCCCC1"
>> In [6]: mol1 = Chem.MolFromSmiles(phenyl)
>> In [7]: mol2 = Chem.MolFromSmiles(cyclohexyl)
>> In [8]: fp1 = AllChem.GetMorganFingerprint(mol1, 1, fromAtoms=[0])
>> In [9]: fp2 = AllChem.GetMorganFingerprint(mol2, 1, fromAtoms=[0])
>> In [10]: fp1==fp2
>> Out[10]: True
>> Now in many cases there probably is a good reason why those atoms can be
>> considered identical, but then there are still other cases when aromaticity
>> makes a difference. For example, when encoding the substructure as a
>> SMILES, the two atoms are different ("c" and "C"), which can create
>> confusion when comparing to the fingerprint.
>> 2.) Information from higher radii
>> The Morgan Fingerprint has the concept of radius. For a radius of 2, I
>> would naively expect that only atom environments up to 2 atoms away from
>> the rooted atom are taken into account. However, this is not fully true, as
>> shown below:
>> In [11]: toluene = "[*:1]c1ccccc1C"
>> In [12]: mol3 = Chem.MolFromSmiles(toluene)
>> In [13]: fp1 = AllChem.GetMorganFingerprint(mol1, 2, fromAtoms=[0])
>> In [14]: fp3 = AllChem.GetMorganFingerprint(mol3, 2, fromAtoms=[0])
>> In [15]: fp1==fp3
>> Out[15]: False
>> Toluene and Phenyl differ in the one C ortho to the star atom. This C is
>> 3 bonds away from the star atom. Therefore, when calculating the
>> MorganFingerprint with radius 2 rooted on the star atom, I would expect the
>> two fingerprints derived from phenyl and toluene to be the same. I assume
>> this is not the case because the connectivity makes a difference between a
>> bond to a heavy atom and to a hydrogen.
>> It would be very helpful to get suggestions or even code snippets for how
>> to change the default behaviour of the Morgan Fingerprinter such that the
>> representation is closer to what one draws or encodes in SMILES for the
>> atoms in a given radius. The documentation says that atom invariants can be
>> defined, which I hope help here. If someone did this before, it would be
>> cool if you could share how to do it exactly.
>> Thanks a lot,
>> Christian
