Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-05 Thread Ling Chan
Thank you for sharing your results, Alexis. This is indeed an interesting problem. Just wonder what are the 339 FP's? Are they all English words with fewer than 6 characters? If RDKit can construct a molecule out of them, I suppose in theory they could be valid smiles? Looks like the problem with

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-05 Thread Andrew Dalke
On Dec 5, 2016, at 3:28 PM, Alexis Parenty wrote: > For the parenthesis issue, the difficulty is to differentiate the SMILES > formats (xxx)(xxx) from this one (xxx)… I will try and address > that using something like: I do not understand. The first one is not a SMILES format. Can y

Re: [Rdkit-discuss] File Conversion?

2016-12-05 Thread Rocco Moretti
It's not something that RDKit can do - RDKit is focused more on small organic molecules, rather than biomacromolecules. For DNA, if all you want is an idealized B-form double helix, there's a number of programs out there which can take in a sequence and make an ideal (or almost-ideal) structure fr

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-05 Thread David Cosgrove
Hi Alexis, While you're wrestling with the difference between () and CC(C)C you could also consider that . in a SMILES is valid, and denotes a mixture, for example CCO.O.O (for vodka, maybe). You might get those in FDA documents that discuss formulations, for example. In a well scanned and p

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-05 Thread Alexis Parenty
Oups! Thanks Brian and Igor! I did not understand at first the punctuation issues referred yesterday by Andrew with smiles that could be quoted inside parenthesis or at the end of a sentence next to a full stop or a semi-col. I see it now. I should remove the punctuation filter. For the parenthes

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-05 Thread Brian Kelley
Cool! Btw- try sanitize=False Also, Andrew is right that you will miss parenthetical phrases. I.e. Benzene(c1c1) and the like, just reasserting that this is a hard problem! Brian Kelley > On Dec 5, 2016, at 5:35 AM, Alexis Parenty > wrote: > > Dear All, > Many thanks to everyon

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-05 Thread Andrew Dalke
On Dec 5, 2016, at 11:35 AM, Alexis Parenty wrote: > I have tested my script on: > • 7900 unique SMILES for “drug-like molecules” > • Alice’s adventure in wonderland (I never read the book but I assumed > there is no SMILES!) > • A shuffled mixture of Alice’s in wonderland and 7900 uni

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-05 Thread Igor Filippov
Alexis, Nice, but it doesn't seem to take into account Andrew Dalke's comment that valid SMILES may be adjacent to a punctuation sign (e.g. period or parenthesis). Perhaps it is not an issue for your specific project, but maybe instead of simple "split()" it is worthwhile to use something more sop

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-05 Thread Alexis Parenty
Dear All, Many thanks to everyone for your participation in that discussion. It was very interesting and useful. I have written a small script that took on board everyone’s input: This incorporates a few "text filters" before the RDKit function: First of all I made a dictionary of all the words p