Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-05 Thread Ling Chan
Thank you for sharing your results, Alexis. This is indeed an interesting problem. Just wonder what are the 339 FP's? Are they all English words with fewer than 6 characters? If RDKit can construct a molecule out of them, I suppose in theory they could be valid smiles? Looks like the problem

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-05 Thread Andrew Dalke
On Dec 5, 2016, at 3:28 PM, Alexis Parenty wrote: > For the parenthesis issue, the difficulty is to differentiate the SMILES > formats (xxx)(xxx) from this one (xxx)… I will try and address > that using something like: I do not understand. The first one is not a SMILES format. Can

Re: [Rdkit-discuss] File Conversion?

2016-12-05 Thread Rocco Moretti
It's not something that RDKit can do - RDKit is focused more on small organic molecules, rather than biomacromolecules. For DNA, if all you want is an idealized B-form double helix, there's a number of programs out there which can take in a sequence and make an ideal (or almost-ideal) structure

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-05 Thread David Cosgrove
Hi Alexis, While you're wrestling with the difference between () and CC(C)C you could also consider that . in a SMILES is valid, and denotes a mixture, for example CCO.O.O (for vodka, maybe). You might get those in FDA documents that discuss formulations, for example. In a well scanned and

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-05 Thread Alexis Parenty
Oups! Thanks Brian and Igor! I did not understand at first the punctuation issues referred yesterday by Andrew with smiles that could be quoted inside parenthesis or at the end of a sentence next to a full stop or a semi-col. I see it now. I should remove the punctuation filter. For the

Re: [Rdkit-discuss] Extracting SMILES from text

2016-12-05 Thread Alexis Parenty
Dear All, Many thanks to everyone for your participation in that discussion. It was very interesting and useful. I have written a small script that took on board everyone’s input: This incorporates a few "text filters" before the RDKit function: First of all I made a dictionary of all the words