Hi Alexis, While you're wrestling with the difference between (CCCC) and CC(C)C you could also consider that . in a SMILES is valid, and denotes a mixture, for example CCO.O.O (for vodka, maybe). You might get those in FDA documents that discuss formulations, for example. In a well scanned and punctuated document, you should be able to distinguish '. ' for the end of a sentence from '.' for a mixture but I don't think you'd have to be too unlucky for some to creep through.
Regards, Dave On Mon, Dec 5, 2016 at 2:28 PM, Alexis Parenty < alexis.parenty.h...@gmail.com> wrote: > Oups! Thanks Brian and Igor! I did not understand at first the punctuation > issues referred yesterday by Andrew with smiles that could be quoted inside > parenthesis or at the end of a sentence next to a full stop or a semi-col. > I see it now. I should remove the punctuation filter. > > > For the parenthesis issue, the difficulty is to differentiate the SMILES > formats (xxx)xxxx(xxx) from this one (xxxxxxxxxxx)… I will try and address > that using something like: > > > Mol = Chem.MolFrom(smiles) > > If smiles[0] in ‘({[\’\”’ and smiles[-1] in ‘)}]\’\”’ and Mol is None: > > Mol= Chem.MolFrom(smiles[1:-1]) > > > Anything better? > > > > Andrew, no, Alice’s adventure in wonderland is not really representative > of the text I need to extract my SMILES from (FDA Regulatory documents!) > I’ll see how it performs on the real stuff and might adjust the script > further if needed. > > Thanks Andrew for the generator comprehension example (I know they exist > and are faster than typical loops, but I can never figure out how they > work…) I am still on the learning curve… I’ll add it to the final version. > > > Markus, the valid SMILES found in Alice’s wonderland is the following > “*************” which is the linear structure: > "Any-Any-Any-Any-Any-Any-Any..." !!! Not a company secret I’m afraid! > > > Thanks again > > > On 5 December 2016 at 14:23, Brian Kelley <fustiga...@gmail.com> wrote: > >> Cool! Btw- try sanitize=False >> >> Also, Andrew is right that you will miss parenthetical phrases. I.e. >> Benzene(c1ccccc1) and the like, just reasserting that this is a hard >> problem! >> >> ---- >> Brian Kelley >> >> On Dec 5, 2016, at 5:35 AM, Alexis Parenty <alexis.parenty.h...@gmail.com> >> wrote: >> >> Dear All, >> >> Many thanks to everyone for your participation in that discussion. It was >> very interesting and useful. I have written a small script that took on >> board everyone’s input: >> >> This incorporates a few "text filters" before the RDKit function: First of >> all I made a dictionary of all the words present in the text as a Key, and >> the number of times >> >> they appear in the text as values. Then I removed from the list of unique >> keys (words) all the ones that were repeated more than once (because I know >> that my SMILES >> >> appear only once in each document). Then I remove all the words that are >> shorter than 5 letters because I know that all my structures contain more >> than 5 atoms >> >> and I want to remove possible FPs coming from “I” or “CC” for example. Then, >> with regex, I removed all unique words that contain letter that are not in >> the main >> >> periodic table of element and remove the words that contain the main English >> punctuation signs that never happen in SMILES. >> >> Placed one after the others, those filters take 26 836 words of the book >> "Alice's adventure in the wonderland" down to 780 words. (97% of words >> filtered out) >> >> >> TEST RESULTS >> >> I have tested my script on: >> • 7900 unique SMILES for “drug-like molecules” >> • Alice’s adventure in wonderland (I never read the book but I assumed >> there is no SMILES!) >> • A shuffled mixture of Alice’s in wonderland and 7900 unique SMILES >> >> The performance is as follow: >> >> >> For Alice’s adventure in wonderland: >> 26836 words >> 26835 TN >> 0 TP >> 1 FP: “*****************************************************************” >> (actually a valid SMILES…) >> 0 FN >> >> ==> Accuracy of 0.99996, in 0:00:00.112000 >> >> >> >> For 7900 unique SMILES from unique drug like molecules >> 7900 TP >> 0 TN >> 0 FP >> 0 FN >> ==> Accuracy of 0.99996, in 0:00:04.200000 >> >> >> >> >> 7900 unique SMILES from unique drug like molecule shuffled within ALICE'S >> ADVENTURES IN WONDERLAND 26836 words (34736 word in totals) >> >> 7900 TP >> 26835 TN >> 1 FP: “*****************************************************************” >> 0 FN >> >> ==> Accuracy of 0.99997 in 0:00:04.949000 >> >> >> Then, I have reprocessed the txt mixture above without the text filters >> (directly feeding every words from the text into the RDKit function and got >> the following result: >> >> 7900 TP >> 26835 TN >> 339 FP >> 0 FN >> ==> Accuracy of 0.97 in 0:00:07.893 >> >> >> Therefore, as Brian pointed out, the function Chem.MolFromSmiles(SMILES) is >> crazy fast to detected non valid smiles, i.e. to return a “None Object” >> (about 240K/s >> >> on my computer). What takes the longest is the processing of valid smiles >> into valid Mol object (2 K/s, i.e 120 times slower). >> >> My conclusion is that the filters are mainly useful to prevent FPs from >> occurring, but there is no noticeable gain in time processing. The function >> Chem.MolFromSmiles >> >> is very quick to discard none valid smiles but can incorporate a number of >> FPs if used without text filtering. >> >> >> The script is in attachment, comments are again welcome! >> >> Thanks again, >> >> Alexis >> >> >> >> >> >> >> >> On 4 December 2016 at 02:52, Andrew Dalke <da...@dalkescientific.com> >> wrote: >> >>> On Dec 2, 2016, at 5:46 PM, Brian Kelley wrote: >>> > I hacked a version of RDKit's smiles parser to compute heavy atom >>> count, perhaps some version of this could be used to check smiles validity >>> without making the actual molecule. >>> >>> FWIW, here's my regex code for it, which makes the assumption that only >>> "[H]" and anything with a "*" are not heavy. >>> >>> _atom_pat = re.compile(r""" >>> ( >>> Cl? | >>> Br? | >>> [NOSPFIbcnosp] | >>> \[[^]]*\] >>> ) >>> """, re.X) >>> >>> def get_num_heavies(smiles): >>> num_atoms = 0 >>> for m in _atom_pat.finditer(smiles): >>> text = m.group() >>> if text == "[H]" or "*" in text: >>> continue >>> num_atoms += 1 >>> return num_atoms >>> >>> Thus turns out to be a quite handy piece of functionality. >>> >>> >>> Andrew >>> da...@dalkescientific.com >>> >>> >>> >>> ------------------------------------------------------------ >>> ------------------ >>> Check out the vibrant tech community on one of the world's most >>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot >>> _______________________________________________ >>> Rdkit-discuss mailing list >>> Rdkit-discuss@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >> >> <SMILES_from_english_text_parser.txt> >> >> ------------------------------------------------------------ >> ------------------ >> >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> >> > > ------------------------------------------------------------ > ------------------ > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > >
------------------------------------------------------------------------------
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss