subject:"\[Rdkit\-discuss\] Extracting SMILES from text"

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-05 Thread Ling Chan

Thank you for sharing your results, Alexis. This is indeed an interesting problem. Just wonder what are the 339 FP's? Are they all English words with fewer than 6 characters? If RDKit can construct a molecule out of them, I suppose in theory they could be valid smiles? Looks like the problem with

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-05 Thread Andrew Dalke

On Dec 5, 2016, at 3:28 PM, Alexis Parenty wrote: > For the parenthesis issue, the difficulty is to differentiate the SMILES > formats (xxx)(xxx) from this one (xxx)… I will try and address > that using something like: I do not understand. The first one is not a SMILES format. Can y

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-05 Thread David Cosgrove

Hi Alexis, While you're wrestling with the difference between () and CC(C)C you could also consider that . in a SMILES is valid, and denotes a mixture, for example CCO.O.O (for vodka, maybe). You might get those in FDA documents that discuss formulations, for example. In a well scanned and p

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-05 Thread Alexis Parenty

Oups! Thanks Brian and Igor! I did not understand at first the punctuation issues referred yesterday by Andrew with smiles that could be quoted inside parenthesis or at the end of a sentence next to a full stop or a semi-col. I see it now. I should remove the punctuation filter. For the parenthes

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-05 Thread Brian Kelley

Cool! Btw- try sanitize=False Also, Andrew is right that you will miss parenthetical phrases. I.e. Benzene(c1c1) and the like, just reasserting that this is a hard problem! Brian Kelley > On Dec 5, 2016, at 5:35 AM, Alexis Parenty > wrote: > > Dear All, > Many thanks to everyon

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-05 Thread Andrew Dalke

On Dec 5, 2016, at 11:35 AM, Alexis Parenty wrote: > I have tested my script on: > • 7900 unique SMILES for “drug-like molecules” > • Alice’s adventure in wonderland (I never read the book but I assumed > there is no SMILES!) > • A shuffled mixture of Alice’s in wonderland and 7900 uni

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-05 Thread Igor Filippov

Alexis, Nice, but it doesn't seem to take into account Andrew Dalke's comment that valid SMILES may be adjacent to a punctuation sign (e.g. period or parenthesis). Perhaps it is not an issue for your specific project, but maybe instead of simple "split()" it is worthwhile to use something more sop

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-05 Thread Alexis Parenty

Dear All, Many thanks to everyone for your participation in that discussion. It was very interesting and useful. I have written a small script that took on board everyone’s input: This incorporates a few "text filters" before the RDKit function: First of all I made a dictionary of all the words p

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-03 Thread Andrew Dalke

On Dec 2, 2016, at 5:46 PM, Brian Kelley wrote: > I hacked a version of RDKit's smiles parser to compute heavy atom count, > perhaps some version of this could be used to check smiles validity without > making the actual molecule. FWIW, here's my regex code for it, which makes the assumption tha

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-03 Thread Andrew Dalke

On Dec 3, 2016, at 3:02 PM, Brian Kelley wrote: > If I had to pick, I would just use the normal MolFromSmiles, if you don't > expect many actual smiles strings in your corpus, it's plenty fast. I didn't follow from your timings what you used to see if something was a SMILES candidate? Was it wo

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-03 Thread Brian Kelley

Note: I turned logging off, otherwise a lot of time was spent spewing to stderr: from rdkit import Chem, rdBase rdBase.DisableLog("rdApp.*") On Sat, Dec 3, 2016 at 9:02 AM, Brian Kelley wrote: > Here are some number from my laptop for parsing: > > Normal Smiles parser: > = > P

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-03 Thread Brian Kelley

Here are some number from my laptop for parsing: Normal Smiles parser: = Proper Smiles 11K/s Non Smiles words: 94K/s Don't make molecules (n.b. accepts some 'bad' smiles like C1CCC3) = Proper Smiles: 110K/s Non Smiles words: 130K/s If I had to pick, I would just

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread George Papadatos

:) George. Sent from my giPhone > On 2 Dec 2016, at 22:11, Dimitri Maziuk wrote: > >> On 12/02/2016 03:12 PM, George Papadatos wrote: >> Here's a pragmatic idea: > ... would it not be safe to >> assume that *any *word containing more than 4 'C' or 'c' characters would >> only be a SMILES stri

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Dimitri Maziuk

On 12/02/2016 03:12 PM, George Papadatos wrote: > Here's a pragmatic idea: ... would it not be safe to > assume that *any *word containing more than 4 'C' or 'c' characters would > only be a SMILES string? pneumonoultramicroscopicsilicovolcanoconiosis -- Dimitri Maziuk Programmer/sysadmin BioMa

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Andrew Dalke

On Dec 2, 2016, at 10:05 PM, Brian Kelley wrote: > Here is a very old version of Andrew's parser in code form: ... It was fairy > well tested on the sigma catalog back in the day. It might be fun to > resurrect use it in some form. There's also my OpenSMILES parser written for Ragel: https:/

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Andrew Dalke

On Dec 2, 2016, at 10:12 PM, George Papadatos wrote: > If Alexis wants to search for valid SMILES strings representing typical > organic molecules among text of plain English words, would it not be safe to > assume that any word containing more than 4 'C' or 'c' characters would only > be a SMIL

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread George Papadatos

Here's a pragmatic idea: If Alexis wants to search for valid SMILES strings representing typical *organic *molecules among text of plain English words, would it not be safe to assume that *any *word containing more than 4 'C' or 'c' characters would only be a SMILES string? This simple filter (wor

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Brian Kelley

Here is a very old version of Andrew's parser in code form: http://frowns.cvs.sourceforge.net/viewvc/frowns/frowns/smiles_parsers/Smiles.py?revision=1.1.1.1&content-type=text%2Fplain that I used in frowns more than a decade ago. It was fairy well tested on the sigma catalog back in the day. It

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Andrew Dalke

On Dec 2, 2016, at 11:11 AM, Greg Landrum wrote: > An initial start on some regexps that match SMILES is here: > https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b38da34e22fb > > that may also be useful I've put together a more gnarly regular expression to find possible SMILES

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Brian Kelley

George, My point was actually parsing the words as IUPAC/SMILES is surprisingly effective as opposed to an ai or rule based system. Without sanitization, Rdkit is about 60,000/second for smiles parsing on my laptop. This is much faster when not making molecules, but I don't have the number h

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread George Papadatos

I think Alexis was referring to converting actual SMILES strings found in random text. Chemical entity recognition and name to structure conversion is another story altogether and nowadays one can quickly go a long way with open tools such as OSCAR + OPSIN in KNIME or with something like this: http

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Brian Kelley

This was why they started using the dictionary lookup as I recall :). The iupac system they ended up using was Roger's when at OpenEye. Brian Kelley > On Dec 2, 2016, at 12:33 PM, Igor Filippov wrote: > > I could be wrong but I believe IBM system had a preprocessing step which > removed

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Igor Filippov

I could be wrong but I believe IBM system had a preprocessing step which removed all known dictionary words - which would get rid of "submarine" etc. I also believe this problem has been solved multiple times in the past, NextMove software comes to mind, chemical tagger - http://chemicaltagger.ch.c

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Brian Kelley

I hacked a version of RDKit's smiles parser to compute heavy atom count, perhaps some version of this could be used to check smiles validity without making the actual molecule. >From a fun historical perspective: IBM had an expert system to find IUPAC names in documents. They ended up finding th

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Peter Gedeck

Hello Alexis, Depending on the size of your document, you could consider limit storing the already tested strings by word length and only memoize shorter words. SMILES tend to be longer, so everything above a given number of characters has a higher probability of being a SMILES. Large words probab

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Alexis Parenty

Dear Pavel And Greg, Thanks Greg for the regexps link. I’ll use that too. Pavel, I need to track on which document the SMILES are coming from, but I will indeed make a set of unique word for each document before looping. Thanks! Best, Alexis On 2 December 2016 at 11:21, Pavel wrote: > Hi,

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Pavel

Hi, Alexis, if you should not track from which document SMILES come, you may just combine all words from all document in a list, take only unique words and try to test them. Thus, you should not store and check for valid/non-valid strings. That would reduce problem complexity as well. Pave

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Greg Landrum

An initial start on some regexps that match SMILES is here: https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b38da34e22fb that may also be useful On Fri, Dec 2, 2016 at 11:07 AM, Alexis Parenty < alexis.parenty.h...@gmail.com> wrote: > Hi Markus, > > > Yes! I might discover nov

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Alexis Parenty

Hi Markus, Yes! I might discover novel compounds that way!! Would be interesting to see how they look like… Good suggestion to also store the words that were correctly identified as SMILES. I’ll add that to the script. I also like your “distribution of word” idea. I could safely skip any word

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Markus Sitzmann

Hi Alexis, you may find also so some "novel" compounds by this approach :-). Whether your tuple solution improves performance strongly depends on the content of your text documents and how often they repeat the same words again - but my guess would be it will help. Probably the best way is even t

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Alexis Parenty

Hi Maciek, Thanks for your quick response. Excellent suggestions. I could filter out a lot of crap that way... Maybe I could also add a filter on word length to avoid having a lot of Ethane and Iodide false positives! This also made me think that I could transform the text into a set to avoid s

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Maciek Wójcikowski

Hi Alexis, You may want to filter with some regex strings containing not valid characters (i.e. there is small subset of atoms that may be without brackets). See "Atoms" section: http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html The set might grow pretty quick and may be inefficient,

[Rdkit-discuss] Extracting SMILES from text

2016-12-02 Thread Alexis Parenty

Dear all, I am looking for a way to extract SMILES scattered in many text documents (thousands documents of several pages each). At the moment, I am thinking to scan each words from the text and try to make a mol object from them using Chem.MolFromSmiles() then store the words if they return a m

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

Re: [Rdkit-discuss] Extracting SMILES from text

[Rdkit-discuss] Extracting SMILES from text

33 matches

Site Navigation

Mail list logo

Footer information