An initial start on some regexps that match SMILES is here:
https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b38da34e22fb

that may also be useful

On Fri, Dec 2, 2016 at 11:07 AM, Alexis Parenty <
alexis.parenty.h...@gmail.com> wrote:

> Hi Markus,
>
>
> Yes! I might discover novel compounds that way!! Would be interesting to
> see how they look like…
>
>
> Good suggestion to also store the words that were correctly identified as
> SMILES. I’ll add that to the script.
>
>
> I also like your “distribution of word” idea. I could safely skip any
> words that occur more than 1% of the time and could try to play around with
> the threshold to find an optimum.
>
>
> I will try every suggestions and will time it to see what is best. I’ll
> keep everyone in the loop and will share the script and results.
>
>
> Thanks,
>
>
> Alexis
>
> On 2 December 2016 at 10:47, Markus Sitzmann <markus.sitzm...@gmail.com>
> wrote:
>
>> Hi Alexis,
>>
>> you may find also so some "novel" compounds by this approach :-).
>>
>> Whether your tuple solution improves performance strongly depends on the
>> content of your text documents and how often they repeat the same words
>> again - but my guess would be it will help. Probably the best way is even
>> to look at the distribution of words before you feed them to RDKit. You
>> should also "memorize" those ones that successfully generated a structure,
>> doesn't make sense to do it again, then.
>>
>> Markus
>>
>> On Fri, Dec 2, 2016 at 10:21 AM, Maciek Wójcikowski <
>> mac...@wojcikowski.pl> wrote:
>>
>>> Hi Alexis,
>>>
>>> You may want to filter with some regex strings containing not valid
>>> characters (i.e. there is small subset of atoms that may be without
>>> brackets). See "Atoms" section: http://www.daylight.com/dayhtm
>>> l/doc/theory/theory.smiles.html
>>>
>>> The set might grow pretty quick and may be inefficient, so I'd parse all
>>> strings passing above filter. Although there will be some false positives
>>> like "CC" which may occur in text (emails especially).
>>>
>>> ----
>>> Pozdrawiam,  |  Best regards,
>>> Maciek Wójcikowski
>>> mac...@wojcikowski.pl
>>>
>>> 2016-12-02 10:11 GMT+01:00 Alexis Parenty <alexis.parenty.h...@gmail.com
>>> >:
>>>
>>>> Dear all,
>>>>
>>>>
>>>> I am looking for a way to extract SMILES scattered in many text
>>>> documents (thousands documents of several pages each).
>>>>
>>>> At the moment, I am thinking to scan each words from the text and try
>>>> to make a mol object from them using Chem.MolFromSmiles() then store the
>>>> words if they return a mol object that is not None.
>>>>
>>>> Can anyone think of a better/quicker way?
>>>>
>>>>
>>>> Would it be worth storing in a tuple any word that do not return a mol
>>>> object from Chem.MolFromSmiles() and exclude them from subsequent search?
>>>>
>>>>
>>>> Something along those lines
>>>>
>>>>
>>>> excluded_set = set()
>>>>
>>>> smiles_list = []
>>>>
>>>> For each_word in text:
>>>>
>>>>     If each_word not in excluded_set:
>>>>
>>>>             each_word_mol =  Chem.MolFromSmiles(each_word)
>>>>
>>>>             if each_word_mol is not None:
>>>>
>>>>                     smiles_list.append(each_word)
>>>>
>>>>              else:
>>>>
>>>>                      excluded_set.add(each_word_mol)
>>>>
>>>>
>>>> Would not searching into that growing tuple take actually more time
>>>> than trying to blindly make a mol object for every word?
>>>>
>>>>
>>>>
>>>> Any suggestion?
>>>>
>>>>
>>>> Many thanks and regards,
>>>>
>>>>
>>>> Alexis
>>>>
>>>> ------------------------------------------------------------
>>>> ------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>> _______________________________________________
>>>> Rdkit-discuss mailing list
>>>> Rdkit-discuss@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>
>>>>
>>>
>>> ------------------------------------------------------------
>>> ------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to