Re: [Rdkit-discuss] Extracting SMILES from text

David Cosgrove Mon, 05 Dec 2016 06:49:19 -0800

Hi Alexis,

While you're wrestling with the difference between (CCCC) and CC(C)C you
could also consider that . in a SMILES is valid, and denotes a mixture, for
example CCO.O.O (for vodka, maybe).  You might get those in FDA documents
that discuss formulations, for example.  In a well scanned and punctuated
document, you should be able to distinguish '. ' for the end of a sentence
from '.' for a mixture but I don't think you'd have to be too unlucky for
some to creep through.


Regards,
Dave


On Mon, Dec 5, 2016 at 2:28 PM, Alexis Parenty <
alexis.parenty.h...@gmail.com> wrote:

> Oups! Thanks Brian and Igor! I did not understand at first the punctuation
> issues referred yesterday by Andrew with smiles that could be quoted inside
> parenthesis or at the end of a sentence next to a full stop or a semi-col.
> I see it now. I should remove the punctuation filter.
>
>
> For the parenthesis issue, the difficulty is to differentiate the SMILES
> formats (xxx)xxxx(xxx) from this one (xxxxxxxxxxx)… I will try and address
> that using something like:
>
>
> Mol = Chem.MolFrom(smiles)
>
> If smiles[0] in ‘({[\’\”’ and smiles[-1] in  ‘)}]\’\”’ and Mol is None:
>
>                 Mol= Chem.MolFrom(smiles[1:-1])
>
>
> Anything better?
>
>
>
> Andrew, no, Alice’s adventure in wonderland is not really representative
> of the text I need to extract my SMILES from (FDA Regulatory documents!)
> I’ll see how it performs on the real stuff and might adjust the script
> further if needed.
>
> Thanks Andrew for the generator comprehension example (I know they exist
> and are faster than typical loops, but I can never figure out how they
> work…) I am still on the learning curve… I’ll add it to the final version.
>
>
> Markus, the valid SMILES found in Alice’s wonderland is the following
>  “*************” which is the linear structure:
> "Any-Any-Any-Any-Any-Any-Any..." !!! Not a company secret I’m afraid!
>
>
> Thanks again
>
>
> On 5 December 2016 at 14:23, Brian Kelley <fustiga...@gmail.com> wrote:
>
>> Cool!  Btw-  try sanitize=False
>>
>> Also, Andrew is right that you will miss parenthetical phrases.  I.e.
>> Benzene(c1ccccc1) and the like, just reasserting that this is a hard
>> problem!
>>
>> ----
>> Brian Kelley
>>
>> On Dec 5, 2016, at 5:35 AM, Alexis Parenty <alexis.parenty.h...@gmail.com>
>> wrote:
>>
>> Dear All,
>>
>> Many thanks to everyone for your participation in that discussion. It was 
>> very interesting and useful. I have written a small script that took on 
>> board everyone’s input:
>>
>> This incorporates a few "text filters" before the RDKit function: First of 
>> all I made a dictionary of all the words present in the text as a Key, and 
>> the number of times
>>
>> they appear in the text as values. Then I removed from the list of unique 
>> keys (words) all the ones that were repeated more than once (because I know 
>> that my SMILES
>>
>> appear only once in each document). Then I remove all the words that are 
>> shorter than 5 letters because I know that all my structures contain more 
>> than 5 atoms
>>
>> and I want to remove possible FPs coming from “I” or “CC” for example. Then, 
>> with regex, I removed all unique words that contain letter that are not in 
>> the main
>>
>> periodic table of element and remove the words that contain the main English 
>> punctuation signs that never happen in SMILES.
>>
>> Placed one after the others, those filters take 26 836 words of the book 
>> "Alice's adventure in the wonderland" down to 780 words. (97% of words 
>> filtered out)
>>
>>
>> TEST RESULTS
>>
>> I have tested my script on:
>> •    7900 unique SMILES for “drug-like molecules”
>> •    Alice’s adventure in wonderland (I never read the book but I assumed 
>> there is no SMILES!)
>> •    A shuffled mixture of Alice’s in wonderland and 7900 unique SMILES
>>
>> The performance is as follow:
>>
>>
>> For Alice’s adventure in wonderland:
>> 26836 words
>> 26835 TN
>> 0 TP
>> 1 FP: “*****************************************************************” 
>> (actually a valid SMILES…)
>> 0 FN
>>
>> ==> Accuracy of 0.99996, in 0:00:00.112000
>>
>>
>>
>> For 7900 unique SMILES from unique drug like molecules
>> 7900 TP
>> 0 TN
>> 0 FP
>> 0 FN
>> ==> Accuracy of 0.99996, in 0:00:04.200000
>>
>>
>>
>>
>> 7900 unique SMILES from unique drug like molecule shuffled within ALICE'S 
>> ADVENTURES IN WONDERLAND 26836 words (34736 word in totals)
>>
>> 7900 TP
>> 26835 TN
>> 1 FP: “*****************************************************************”
>> 0 FN
>>
>> ==> Accuracy of 0.99997 in 0:00:04.949000
>>
>>
>> Then, I have reprocessed the txt mixture above without the text filters 
>> (directly feeding every words from the text into the RDKit function and got 
>> the following result:
>>
>> 7900 TP
>> 26835 TN
>> 339 FP
>> 0 FN
>> ==> Accuracy of 0.97 in 0:00:07.893
>>
>>
>> Therefore, as Brian pointed out, the function Chem.MolFromSmiles(SMILES) is 
>> crazy fast to detected non valid smiles, i.e. to return a “None Object” 
>> (about 240K/s
>>
>> on my computer). What takes the longest is the processing of valid smiles 
>> into valid Mol object (2 K/s, i.e 120 times slower).
>>
>> My conclusion is that the filters are mainly useful to prevent FPs from 
>> occurring, but there is no noticeable gain in time processing. The function 
>> Chem.MolFromSmiles
>>
>> is very quick to discard none valid smiles but can incorporate a number of 
>> FPs if used without text filtering.
>>
>>
>> The script is in attachment, comments are again welcome!
>>
>> Thanks again,
>>
>> Alexis
>>
>>
>>
>>
>>
>>
>>
>> On 4 December 2016 at 02:52, Andrew Dalke <da...@dalkescientific.com>
>> wrote:
>>
>>> On Dec 2, 2016, at 5:46 PM, Brian Kelley wrote:
>>> > I hacked a version of RDKit's smiles parser to compute heavy atom
>>> count, perhaps some version of this could be used to check smiles validity
>>> without making the actual molecule.
>>>
>>> FWIW, here's my regex code for it, which makes the assumption that only
>>> "[H]" and anything with a "*" are not heavy.
>>>
>>> _atom_pat = re.compile(r"""
>>> (
>>>  Cl? |
>>>  Br? |
>>>  [NOSPFIbcnosp] |
>>>  \[[^]]*\]
>>> )
>>> """, re.X)
>>>
>>> def get_num_heavies(smiles):
>>>     num_atoms = 0
>>>     for m in _atom_pat.finditer(smiles):
>>>         text = m.group()
>>>         if text == "[H]" or "*" in text:
>>>             continue
>>>         num_atoms += 1
>>>     return num_atoms
>>>
>>> Thus turns out to be a quite handy piece of functionality.
>>>
>>>
>>>                                 Andrew
>>>                                 da...@dalkescientific.com
>>>
>>>
>>>
>>> ------------------------------------------------------------
>>> ------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>
>> <SMILES_from_english_text_parser.txt>
>>
>> ------------------------------------------------------------
>> ------------------
>>
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
> ------------------------------------------------------------
> ------------------
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>

------------------------------------------------------------------------------

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Extracting SMILES from text

Reply via email to