On Fri, Dec 2, 2016 at 6:29 PM, Tim Dudgeon wrote:
> But since I've been working with the Release_2016_09_2 release my Docker
> image builds on Docker Hub [1] are timing out as they sometimes exceed
> the 2 hour limit. If I try at a quiet time I can sometimes get them to
>
:)
George.
Sent from my giPhone
> On 2 Dec 2016, at 22:11, Dimitri Maziuk wrote:
>
>> On 12/02/2016 03:12 PM, George Papadatos wrote:
>> Here's a pragmatic idea:
> ... would it not be safe to
>> assume that *any *word containing more than 4 'C' or 'c' characters would
On 12/02/2016 03:12 PM, George Papadatos wrote:
> Here's a pragmatic idea:
... would it not be safe to
> assume that *any *word containing more than 4 'C' or 'c' characters would
> only be a SMILES string?
pneumonoultramicroscopicsilicovolcanoconiosis
--
Dimitri Maziuk
Programmer/sysadmin
On Dec 2, 2016, at 10:05 PM, Brian Kelley wrote:
> Here is a very old version of Andrew's parser in code form: ... It was fairy
> well tested on the sigma catalog back in the day. It might be fun to
> resurrect use it in some form.
There's also my OpenSMILES parser written for Ragel:
On Dec 2, 2016, at 10:12 PM, George Papadatos wrote:
> If Alexis wants to search for valid SMILES strings representing typical
> organic molecules among text of plain English words, would it not be safe to
> assume that any word containing more than 4 'C' or 'c' characters would only
> be a
Here's a pragmatic idea:
If Alexis wants to search for valid SMILES strings representing
typical *organic
*molecules among text of plain English words, would it not be safe to
assume that *any *word containing more than 4 'C' or 'c' characters would
only be a SMILES string?
This simple filter
Here is a very old version of Andrew's parser in code form:
http://frowns.cvs.sourceforge.net/viewvc/frowns/frowns/smiles_parsers/Smiles.py?revision=1.1.1.1=text%2Fplain
that I used in frowns more than a decade ago. It was fairy well tested on the
sigma catalog back in the day. It might be
On Dec 2, 2016, at 11:11 AM, Greg Landrum wrote:
> An initial start on some regexps that match SMILES is here:
> https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b38da34e22fb
>
> that may also be useful
I've put together a more gnarly regular expression to find possible
George,
My point was actually parsing the words as IUPAC/SMILES is surprisingly
effective as opposed to an ai or rule based system. Without sanitization,
Rdkit is about 60,000/second for smiles parsing on my laptop. This is much
faster when not making molecules, but I don't have the number
I think Alexis was referring to converting actual SMILES strings found in
random text. Chemical entity recognition and name to structure conversion
is another story altogether and nowadays one can quickly go a long way with
open tools such as OSCAR + OPSIN in KNIME or with something like this:
This was why they started using the dictionary lookup as I recall :). The iupac
system they ended up using was Roger's when at OpenEye.
Brian Kelley
> On Dec 2, 2016, at 12:33 PM, Igor Filippov wrote:
>
> I could be wrong but I believe IBM system had a
I could be wrong but I believe IBM system had a preprocessing step which
removed all known dictionary words - which would get rid of "submarine" etc.
I also believe this problem has been solved multiple times in the past,
NextMove software comes to mind, chemical tagger -
Of course builds from source are never fast enough, and the RDKit one is
pretty big.
So far I've lived with this and made cups of coffee.
But since I've been working with the Release_2016_09_2 release my Docker
image builds on Docker Hub [1] are timing out as they sometimes exceed
the 2 hour
I hacked a version of RDKit's smiles parser to compute heavy atom count,
perhaps some version of this could be used to check smiles validity without
making the actual molecule.
>From a fun historical perspective: IBM had an expert system to find IUPAC
names in documents. They ended up finding
Hello Alexis,
Depending on the size of your document, you could consider limit storing
the already tested strings by word length and only memoize shorter words.
SMILES tend to be longer, so everything above a given number of characters
has a higher probability of being a SMILES. Large words
Dear Pavel And Greg,
Thanks Greg for the regexps link. I’ll use that too.
Pavel, I need to track on which document the SMILES are coming from, but I
will indeed make a set of unique word for each document before looping.
Thanks!
Best,
Alexis
On 2 December 2016 at 11:21, Pavel
Hi, Alexis,
if you should not track from which document SMILES come, you may just
combine all words from all document in a list, take only unique words
and try to test them. Thus, you should not store and check for
valid/non-valid strings. That would reduce problem complexity as well.
An initial start on some regexps that match SMILES is here:
https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b38da34e22fb
that may also be useful
On Fri, Dec 2, 2016 at 11:07 AM, Alexis Parenty <
alexis.parenty.h...@gmail.com> wrote:
> Hi Markus,
>
>
> Yes! I might discover
Hi Markus,
Yes! I might discover novel compounds that way!! Would be interesting to
see how they look like…
Good suggestion to also store the words that were correctly identified as
SMILES. I’ll add that to the script.
I also like your “distribution of word” idea. I could safely skip any
Hi Alexis,
you may find also so some "novel" compounds by this approach :-).
Whether your tuple solution improves performance strongly depends on the
content of your text documents and how often they repeat the same words
again - but my guess would be it will help. Probably the best way is even
Hi Maciek,
Thanks for your quick response. Excellent suggestions. I could filter out a
lot of crap that way... Maybe I could also add a filter on word length to
avoid having a lot of Ethane and Iodide false positives!
This also made me think that I could transform the text into a set to avoid
Hi Alexis,
You may want to filter with some regex strings containing not valid
characters (i.e. there is small subset of atoms that may be without
brackets). See "Atoms" section:
http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html
The set might grow pretty quick and may be inefficient,
Dear all,
I am looking for a way to extract SMILES scattered in many text documents
(thousands documents of several pages each).
At the moment, I am thinking to scan each words from the text and try to
make a mol object from them using Chem.MolFromSmiles() then store the words
if they return a
23 matches
Mail list logo