Thank you for sharing your results, Alexis. This is indeed an interesting
problem.
Just wonder what are the 339 FP's? Are they all English words with fewer
than 6 characters? If RDKit can construct a molecule out of them, I suppose
in theory they could be valid smiles?
Looks like the problem
On Dec 5, 2016, at 3:28 PM, Alexis Parenty wrote:
> For the parenthesis issue, the difficulty is to differentiate the SMILES
> formats (xxx)(xxx) from this one (xxx)… I will try and address
> that using something like:
I do not understand. The first one is not a SMILES format.
Can
It's not something that RDKit can do - RDKit is focused more on small
organic molecules, rather than biomacromolecules.
For DNA, if all you want is an idealized B-form double helix, there's a
number of programs out there which can take in a sequence and make an ideal
(or almost-ideal) structure
Hi Alexis,
While you're wrestling with the difference between () and CC(C)C you
could also consider that . in a SMILES is valid, and denotes a mixture, for
example CCO.O.O (for vodka, maybe). You might get those in FDA documents
that discuss formulations, for example. In a well scanned and
Oups! Thanks Brian and Igor! I did not understand at first the punctuation
issues referred yesterday by Andrew with smiles that could be quoted inside
parenthesis or at the end of a sentence next to a full stop or a semi-col.
I see it now. I should remove the punctuation filter.
For the
Dear All,
Many thanks to everyone for your participation in that discussion. It
was very interesting and useful. I have written a small script that
took on board everyone’s input:
This incorporates a few "text filters" before the RDKit function:
First of all I made a dictionary of all the words
6 matches
Mail list logo