I hacked a version of RDKit's smiles parser to compute heavy atom count,
perhaps some version of this could be used to check smiles validity without
making the actual molecule.

>From a fun historical perspective:  IBM had an expert system to find IUPAC
names in documents.  They ended up finding things like "submarine" which
was amusing.  It turned out that just parsing all words with the IUPAC
parser was by far the fastest and best solution.  I expect the same will be
true for finding smiles.

It would be interesting to put the common OCR errors into the parser as
well (l's and 1's are hard for instance).


On Fri, Dec 2, 2016 at 10:46 AM, Peter Gedeck <peter.ged...@gmail.com>
wrote:

> Hello Alexis,
>
> Depending on the size of your document, you could consider limit storing
> the already tested strings by word length and only memoize shorter words.
> SMILES tend to be longer, so everything above a given number of characters
> has a higher probability of being a SMILES. Large words probably also
> contain a lot of chemical names. They often contain commas (,), so they are
> easy to remove quickly.
>
> Best,
>
> Peter
>
>
> On Fri, Dec 2, 2016 at 5:43 AM Alexis Parenty <
> alexis.parenty.h...@gmail.com> wrote:
>
>> Dear Pavel And Greg,
>>
>>
>>
>> Thanks Greg for the regexps link. I’ll use that too.
>>
>>
>> Pavel, I need to track on which document the SMILES are coming from, but
>> I will indeed make a set of unique word for each document before looping.
>> Thanks!
>>
>> Best,
>>
>> Alexis
>>
>> On 2 December 2016 at 11:21, Pavel <pavel_polishc...@ukr.net> wrote:
>>
>> Hi, Alexis,
>>
>>   if you should not track from which document SMILES come, you may just
>> combine all words from all document in a list, take only unique words and
>> try to test them. Thus, you should not store and check for valid/non-valid
>> strings. That would reduce problem complexity as well.
>>
>> Pavel.
>> On 12/02/2016 11:11 AM, Greg Landrum wrote:
>>
>> An initial start on some regexps that match SMILES is here:
>> https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b3
>> 8da34e22fb
>>
>> that may also be useful
>>
>> On Fri, Dec 2, 2016 at 11:07 AM, Alexis Parenty <
>> alexis.parenty.h...@gmail.com> wrote:
>>
>> Hi Markus,
>>
>>
>> Yes! I might discover novel compounds that way!! Would be interesting to
>> see how they look like…
>>
>>
>> Good suggestion to also store the words that were correctly identified as
>> SMILES. I’ll add that to the script.
>>
>>
>> I also like your “distribution of word” idea. I could safely skip any
>> words that occur more than 1% of the time and could try to play around with
>> the threshold to find an optimum.
>>
>>
>> I will try every suggestions and will time it to see what is best. I’ll
>> keep everyone in the loop and will share the script and results.
>>
>>
>> Thanks,
>>
>>
>> Alexis
>>
>> On 2 December 2016 at 10:47, Markus Sitzmann <markus.sitzm...@gmail.com>
>> wrote:
>>
>> Hi Alexis,
>>
>> you may find also so some "novel" compounds by this approach :-).
>>
>> Whether your tuple solution improves performance strongly depends on the
>> content of your text documents and how often they repeat the same words
>> again - but my guess would be it will help. Probably the best way is even
>> to look at the distribution of words before you feed them to RDKit. You
>> should also "memorize" those ones that successfully generated a structure,
>> doesn't make sense to do it again, then.
>>
>> Markus
>>
>> On Fri, Dec 2, 2016 at 10:21 AM, Maciek Wójcikowski <
>> mac...@wojcikowski.pl> wrote:
>>
>> Hi Alexis,
>>
>> You may want to filter with some regex strings containing not valid
>> characters (i.e. there is small subset of atoms that may be without
>> brackets). See "Atoms" section: http://www.daylight.com/
>> dayhtml/doc/theory/theory.smiles.html
>>
>> The set might grow pretty quick and may be inefficient, so I'd parse all
>> strings passing above filter. Although there will be some false positives
>> like "CC" which may occur in text (emails especially).
>>
>> ----
>> Pozdrawiam,  |  Best regards,
>> Maciek Wójcikowski
>> mac...@wojcikowski.pl
>>
>> 2016-12-02 10:11 GMT+01:00 Alexis Parenty <alexis.parenty.h...@gmail.com>
>> :
>>
>> Dear all,
>>
>>
>> I am looking for a way to extract SMILES scattered in many text documents
>> (thousands documents of several pages each).
>>
>> At the moment, I am thinking to scan each words from the text and try to
>> make a mol object from them using Chem.MolFromSmiles() then store the words
>> if they return a mol object that is not None.
>>
>> Can anyone think of a better/quicker way?
>>
>>
>> Would it be worth storing in a tuple any word that do not return a mol
>> object from Chem.MolFromSmiles() and exclude them from subsequent search?
>>
>>
>> Something along those lines
>>
>>
>> excluded_set = set()
>>
>> smiles_list = []
>>
>> For each_word in text:
>>
>>     If each_word not in excluded_set:
>>
>>             each_word_mol =  Chem.MolFromSmiles(each_word)
>>
>>             if each_word_mol is not None:
>>
>>                     smiles_list.append(each_word)
>>
>>              else:
>>
>>                      excluded_set.add(each_word_mol)
>>
>>
>> Would not searching into that growing tuple take actually more time than
>> trying to blindly make a mol object for every word?
>>
>>
>>
>> Any suggestion?
>>
>>
>> Many thanks and regards,
>>
>>
>> Alexis
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>
>>
>>
>> _______________________________________________
>> Rdkit-discuss mailing 
>> listRdkit-discuss@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot______
>> _________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to