Oh tempora o mores. Didn't we try for ages to make our SMILES canonical and
now, all of sudden, the opposite is hip :-)

On Mon, Aug 6, 2018 at 1:38 PM Chris Earnshaw <cgearns...@gmail.com> wrote:

> Hi
>
> The question 'what do you mean by ALL?' springs to mind. None of the
> discussion includes dot-disconnected SMILES, which are also perfectly valid
> representations. For example C(C1C2)C.C12 is yet another SMILES (of many
> possible) for the example structure.
>
> I've no idea whether this is of any relevance to you, but you should
> probably consider these representations and decide whether they are
> important or not.
>
> Best regards,
> Chris
>
> On 6 August 2018 at 11:27, Jan Halborg Jensen <jhjen...@chem.ku.dk> wrote:
>
>> This blogpost links to two other ones that may have done that (haven’t
>> read them carefully):
>> https://baoilleach.blogspot.com/2018/06/cheminformatics-for-deep-learners.html
>>
>> Best regards, Jan
>>
>> On 06 Aug 2018, at 11:57, Guillaume GODIN <guillaume.go...@firmenich.com>
>> wrote:
>>
>> Dear Greg,
>>
>> Fantastic, thank you to give both explanation and solution to this
>> “simple question”, I know this is not so simple & it’s fundamental for data
>> augmentation in deep learning.
>>
>> If I may, I have another question related, do you know if someone has
>> worked on a generator of all unique smiles independently of RDKit ?
>>
>> Thanks again,
>>
>> Guillaume
>>
>> *De : *Greg Landrum <greg.land...@gmail.com>
>> *Date : *lundi, 6 août 2018 à 11:40
>> *À : *Guillaume GODIN <guillaume.go...@firmenich.com>
>> *Cc : *RDKit Discuss <rdkit-discuss@lists.sourceforge.net>
>> *Objet : *Re: [Rdkit-discuss] enumeration of smiles question
>>
>>
>> On Thu, Aug 2, 2018 at 8:59 AM Guillaume GODIN <
>> guillaume.go...@firmenich.com> wrote:
>>
>>
>> I have a simple question about generating all possible smiles of a given
>> molecule:
>>
>>
>> It's a simple question, but the answer is somewhat complicated. :-)
>>
>>
>>
>> RDKit provides only 4 differents smiles for my molecule “CCC1CC1“:
>> C1C(CC)C1
>> CCC1CC1
>> C1(CC)CC1
>> C(C)C1CC1
>>
>> While by hand we can write those 7 smiles:
>> CCC1CC1
>> C(C)C1CC1
>> C(C1CC1)C
>> C1CC(CC)1
>> C1C(CC)C1
>> C1CC1CC
>> C(CC)1CC1
>>
>> I use this function for the enumeration:
>>
>> def allsmiles(smil):
>>     m = Chem.MolFromSmiles(smil) # Construct a molecule from a SMILES
>> string.
>>     if m is None:
>>         return smil
>>     N = m.GetNumAtoms()
>>     if N==0:
>>         return smil
>>     try:
>>         n= np.random.randint(0,high=N)
>>         t= Chem.MolToSmiles(m, rootedAtAtom=n, canonical=False)
>>     except :
>>         return smil
>>     return t
>>
>> n= 50
>> SMILES = [“CCC1CC1”]
>> SMILES_mult = [allsmiles(S) for S in SMILES for i in range(n)]
>>
>> Why we cannot generate all the 7 smiles ?
>>
>>
>> The RDKit has rules that it uses to decide which atom to branch to when
>> generating a SMILES. These are used regardless of whether you are
>> generating canonical SMILES or not.
>> The upshot of this is that it will never generate a SMILES where there's
>> a branch before a ring closure.
>> The other important factor here is that atom rank is determined by the
>> index of the atom in the molecule when you aren't using canonicalization.
>> So changing the atom order on input can help:
>>
>> In [12]: set(allsmiles('CCC1CC1') for i in range(50))
>> Out[12]: {'C(C)C1CC1', 'C1(CC)CC1', 'C1C(CC)C1', 'CCC1CC1'}
>>
>> In [13]: set(allsmiles('C1CC1CC') for i in range(50))
>> Out[13]: {'C(C1CC1)C', 'C1(CC)CC1', 'C1CC1CC', 'CCC1CC1'}
>>
>> You can do this all at once as follows:
>>
>> ```
>> In [20]: def allsmiles(smil):
>>     ...:     m = Chem.MolFromSmiles(smil) # Construct a molecule from a
>> SMILES string.
>>     ...:     if m is None:
>>     ...:         return smil
>>     ...:     N = m.GetNumAtoms()
>>     ...:     if N==0:
>>     ...:         return smil
>>     ...:     aids = list(range(N))
>>     ...:     random.shuffle(aids)
>>     ...:     m = Chem.RenumberAtoms(m,aids)
>>     ...:     try:
>>     ...:         n= random.randint(0,N-1)
>>     ...:         t= Chem.MolToSmiles(m, rootedAtAtom=n, canonical=False)
>>     ...:     except :
>>     ...:         return smil
>>     ...:     return t
>>     ...:
>>     ...:
>>     ...:
>>
>> In [21]:
>>
>> In [21]: set(allsmiles('C1CC1CC') for i in range(50))
>> Out[21]: {'C(C)C1CC1', 'C(C1CC1)C', 'C1(CC)CC1', 'C1C(CC)C1', 'C1CC1CC',
>> 'CCC1CC1'}
>> ```
>> Note that I switched to using python's built in random module instead of
>> using the one in numpy.
>>
>> -greg
>>
>>
>>
>>
>>
>> Thanks guys,
>>
>> Best regards,
>>
>> Guillaume
>>
>> ***********************************************************************************
>> DISCLAIMER
>> This email and any files transmitted with it, including replies and
>> forwarded copies (which may contain alterations) subsequently transmitted
>> from Firmenich, are confidential and solely for the use of the intended
>> recipient. The contents do not represent the opinion of Firmenich except to
>> the extent that it relates to their official business.
>>
>> ***********************************************************************************
>>
>> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org <http://slashdot.org/>!
>> http://sdm.link/slashdot_______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>> ***********************************************************************************
>> DISCLAIMER
>> This email and any files transmitted with it, including replies and
>> forwarded copies (which may contain alterations) subsequently transmitted
>> from Firmenich, are confidential and solely for the use of the intended
>> recipient. The contents do not represent the opinion of Firmenich except to
>> the extent that it relates to their official business.
>>
>> ***********************************************************************************
>>
>> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org <http://slashdot.org/>!
>> http://sdm.link/slashdot_______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to